[jira] [Comment Edited] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465561#comment-17465561
 ] 

Eduardo Ponce edited comment on ARROW-11989 at 12/27/21, 7:14 AM:
--

Is the structure of the example ChunkedArray common (ie., having many chunks 
with the same length)? If that is the case, this type of access pattern would 
benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks are of 
the same length and thus chunk access would be O(1). This can be implemented 
without defining a new class but by simply having a flag the ChunkedArray uses 
to track if chunks are of the same length.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. Measurements are needed to verify this 
claim. The overall access time will depend on the application and access 
patterns and although a binary search would make the overall chunk finding more 
consistent it will also have its drawback for certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices


was (Author: edponce):
Is the structure of the example ChunkedArray common (ie., having many chunks 
with the same length)? If that is the case, this type of access pattern would 
benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks are of 
the same length and thus chunk access would be O(1). This can be implemented 
without defining a new class but by simply having a flag the ChunkedArray uses 
to track if chunks are of the same length.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - 

[jira] [Comment Edited] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465561#comment-17465561
 ] 

Eduardo Ponce edited comment on ARROW-11989 at 12/27/21, 7:12 AM:
--

Is the structure of the example ChunkedArray common (ie., having many chunks 
with the same length)? If that is the case, this type of access pattern would 
benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks are of 
the same length and thus chunk access would be O(1). This can be implemented 
without defining a new class but by simply having a flag the ChunkedArray uses 
to track if chunks are of the same length.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices


was (Author: edponce):
Is the structure of the example ChunkedArray common (ie., having many chunks of 
with same number of rows)? If that is the case, this type of access pattern 
would benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks 
are of the same length and thus chunk access would be O(1). This can be 
implemented without defining a new class but by simply having a flag the 
ChunkedArray uses to track if chunks are of the same size.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, 

[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465561#comment-17465561
 ] 

Eduardo Ponce commented on ARROW-11989:
---

Is the structure of the example ChunkedArray common (ie., having many chunks of 
with same number of rows)? If that is the case, this type of access pattern 
would benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks 
are of the same length and thus chunk access would be O(1). This can be 
implemented without defining a new class but by simply having a flag the 
ChunkedArray uses to track if chunks are of the same size.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465527#comment-17465527
 ] 

Eduardo Ponce edited comment on ARROW-14332 at 12/27/21, 6:09 AM:
--

The inconsistency/non-symmetry between the *is_xxx_type* type traits and 
*is_xxx* functions arises due to how derived and specific-types are considered. 
This clearly shows in the {{is_xxx_like}} variants. Also, it is a bit difficult 
to express specific types versus derived classes using only the *is* predicate.


was (Author: edponce):
The inconsistency/non-symmetry between the *is_xxx_type* type traits and 
*is_xxx* functions arises due to how derived and specific-types are considered, 
specifically in the {{is_xxx_like}} variants.

> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 8.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465527#comment-17465527
 ] 

Eduardo Ponce commented on ARROW-14332:
---

The inconsistency/non-symmetry between the *is_xxx_type* type traits and 
*is_xxx* functions arises due to how derived and specific-types are considered, 
specifically in the {{is_xxx_like}} variants.

> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 8.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465524#comment-17465524
 ] 

Eduardo Ponce edited comment on ARROW-14332 at 12/27/21, 4:50 AM:
--

This table shows Arrow datatypes and corresponding *is_xxx_type* type traits.
||Datatype||Current type trait||
|FixedWidthType| |
|PrimitiveCType|is_primitive_ctype|
|NumberType|is_number_type|
|IntegerType|is_integer_type|
| |is_signed_integer_type|
| |is_unsigned_integer_type|
|[U]Int[8,16,32,64]Type| |
|FloatingPointType|is_floating_type|
|HalfFloatType|is_half_float_type|
|FloatType| |
|DoubleType| |
|ParametricType| |
|NestedType|is_nested_type|
|NullType|is_null_type|
|BooleanType|is_boolean_type|
|BaseBinaryType|is_base_binary_type|
|BinaryType|is_binary_type|
|LargeBinaryType| |
|StringType|is_string_type|
|LargeStringType| |
|FixedSizeBinaryType|is_fixed_size_binary_type|
|DecimalType|is_decimal_type|
|Decimal128Type|is_decimal128_type|
|Decimal256Type|is_decimal256_type|
|BaseListType|is_var_length_list_type|
|ListType|is_list_type|
|LargeListType| |
|FixedSizeListType|is_fixed_size_list_type|
|MapType| |
|StructType|is_struct_type|
|UnionType|is_union_type|
|SparseUnionType| |
|DenseUnionType| |
|TemporalType|is_temporal_type|
|DateType|is_date_type|
|Date64Type| |
|TimeType|is_time_type|
|Time32Type| |
|Time64Type| |
|TimestampType|is_timestamp_type|
|IntervalType|is_interval_type|
|MonthIntervalType| |
|DayTimeIntervalType| |
|MonthDayNanoIntervalType| |
|DurationType|is_duration_type|
|DictionaryType|is_dictionary_type|
|ExtensionType|is_extension_type|

These are special type traits:
 * {{is_string_like_type = is_base_binary_type && T::is_utf8}}
a. (Eduardo) Seems like a semantic duplicate of {{is_string_type}}
 * {{is_binary_like_type = (is_base_binary_type && !is_string_like_type) || 
is_fixed_size_binary_type}}
 * {{is_base_list_type}} deprecated for {{is_var_length_list_type}}
 * {{is_list_like_type = is_base_list_type || is_fixed_size_list_type}}


was (Author: edponce):
This table shows Arrow datatypes and corresponding *is_xxx_type* type traits.
||Datatype||Current type trait||
|FixedWidthType| |
|PrimitiveCType|is_primitive_ctype|
|NumberType|is_number_type|
|IntegerType|is_integer_type|
| |is_signed_integer_type|
| |is_unsigned_integer_type|
|[U]Int[8,16,32,64]Type| |
|FloatingPointType|is_floating_type|
|HalfFloatType|is_half_float_type|
|FloatType| |
|DoubleType| |
|ParametricType| |
|NestedType|is_nested_type|
|NullType|is_null_type|
|BooleanType|is_boolean_type|
|BaseBinaryType|is_base_binary_type|
|BinaryType|is_binary_type|
|LargeBinaryType| |
|StringType|is_string_type|
|LargeStringType| |
|FixedSizeBinaryType|is_fixed_size_binary_type|
| |is_binary_like_type|
|DecimalType|is_decimal_type|
|Decimal128Type|is_decimal128_type|
|Decimal256Type|is_decimal256_type|
|BaseListType|is_var_length_list_type|
|ListType|is_list_type|
|LargeListType| |
|FixedSizeListType|is_fixed_size_list_type|
|MapType| |
|StructType|is_struct_type|
|UnionType|is_union_type|
|SparseUnionType| |
|DenseUnionType| |
|TemporalType|is_temporal_type|
|DateType|is_date_type|
|Date64Type| |
|TimeType|is_time_type|
|Time32Type| |
|Time64Type| |
|TimestampType|is_timestamp_type|
|IntervalType|is_interval_type|
|MonthIntervalType| |
|DayTimeIntervalType| |
|MonthDayNanoIntervalType| |
|DurationType|is_duration_type|
|DictionaryType|is_dictionary_type|
|ExtensionType|is_extension_type|

These are special type traits:
 * {{is_string_like_type = is_base_binary_type && T::is_utf8}}
a. (Eduardo) Seems like a semantic duplicate of {{is_string_type}}
 * {{is_binary_like_type = (is_base_binary_type && !is_string_like_type) || 
is_fixed_size_binary_type}}
 * {{is_base_list_type}} deprecated for {{is_var_length_list_type}}
 * {{is_list_like_type = is_base_list_type || is_fixed_size_list_type}}

> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 8.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both 

[jira] [Updated] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-14332:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 8.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465524#comment-17465524
 ] 

Eduardo Ponce edited comment on ARROW-14332 at 12/27/21, 4:41 AM:
--

This table shows Arrow datatypes and corresponding *is_xxx_type* type traits.
||Datatype||Current type trait||
|FixedWidthType| |
|PrimitiveCType|is_primitive_ctype|
|NumberType|is_number_type|
|IntegerType|is_integer_type|
| |is_signed_integer_type|
| |is_unsigned_integer_type|
|[U]Int[8,16,32,64]Type| |
|FloatingPointType|is_floating_type|
|HalfFloatType|is_half_float_type|
|FloatType| |
|DoubleType| |
|ParametricType| |
|NestedType|is_nested_type|
|NullType|is_null_type|
|BooleanType|is_boolean_type|
|BaseBinaryType|is_base_binary_type|
|BinaryType|is_binary_type|
|LargeBinaryType| |
|StringType|is_string_type|
|LargeStringType| |
|FixedSizeBinaryType|is_fixed_size_binary_type|
| |is_binary_like_type|
|DecimalType|is_decimal_type|
|Decimal128Type|is_decimal128_type|
|Decimal256Type|is_decimal256_type|
|BaseListType|is_var_length_list_type|
|ListType|is_list_type|
|LargeListType| |
|FixedSizeListType|is_fixed_size_list_type|
|MapType| |
|StructType|is_struct_type|
|UnionType|is_union_type|
|SparseUnionType| |
|DenseUnionType| |
|TemporalType|is_temporal_type|
|DateType|is_date_type|
|Date64Type| |
|TimeType|is_time_type|
|Time32Type| |
|Time64Type| |
|TimestampType|is_timestamp_type|
|IntervalType|is_interval_type|
|MonthIntervalType| |
|DayTimeIntervalType| |
|MonthDayNanoIntervalType| |
|DurationType|is_duration_type|
|DictionaryType|is_dictionary_type|
|ExtensionType|is_extension_type|

These are special type traits:
 * {{is_string_like_type = is_base_binary_type && T::is_utf8}}
a. (Eduardo) Seems like a semantic duplicate of {{is_string_type}}
 * {{is_binary_like_type = (is_base_binary_type && !is_string_like_type) || 
is_fixed_size_binary_type}}
 * {{is_base_list_type}} deprecated for {{is_var_length_list_type}}
 * {{is_list_like_type = is_base_list_type || is_fixed_size_list_type}}


was (Author: edponce):
This table shows Arrow datatypes and corresponding **is_xxx_type** type traits. 

||Datatype||Current type trait||
|FixedWidthType| |
|PrimitiveCType|is_primitive_ctype|
|NumberType|is_number_type|
|IntegerType|is_integer_type|
| |is_signed_integer_type|
| |is_unsigned_integer_type|
|[U]Int[8,16,32,64]Type| |
|FloatingPointType|is_floating_type|
|HalfFloatType|is_half_float_type|
|FloatType| |
|DoubleType| |
|ParametricType| |
|NestedType|is_nested_type|
|NullType|is_null_type|
|BooleanType|is_boolean_type|
|BaseBinaryType|is_base_binary_type|
|BinaryType|is_binary_type|
|LargeBinaryType| |
|StringType|is_string_type|
|LargeStringType| |
|FixedSizeBinaryType|is_fixed_size_binary_type|
| |is_binary_like_type|
|DecimalType|is_decimal_type|
|Decimal128Type|is_decimal128_type|
|Decimal256Type|is_decimal256_type|
|BaseListType|is_var_length_list_type|
|ListType|is_list_type|
|LargeListType| |
|FixedSizeListType|is_fixed_size_list_type|
|MapType| |
|StructType|is_struct_type|
|UnionType|is_union_type|
|SparseUnionType| |
|DenseUnionType| |
|TemporalType|is_temporal_type|
|DateType|is_date_type|
|Date64Type| |
|TimeType|is_time_type|
|Time32Type| |
|Time64Type| |
|TimestampType|is_timestamp_type|
|IntervalType|is_interval_type|
|MonthIntervalType| |
|DayTimeIntervalType| |
|MonthDayNanoIntervalType| |
|DurationType|is_duration_type|
|DictionaryType|is_dictionary_type|
|ExtensionType|is_extension_type|

These are special type traits:
* {{is_string_like_type = is_base_binary_type && T::is_utf8}}
   a. (Eduardo) Seems like a semantic duplicate of {{is_string_type}}
* {{is_binary_like_type = (is_base_binary_type && !is_string_like_type) || 
is_fixed_size_binary_type}}
* {{is_base_list_type}} deprecated for {{is_var_length_list_type}}
* {{is_list_like_type = is_base_list_type || is_fixed_size_list_type}}


> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 7.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* 

[jira] [Commented] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465524#comment-17465524
 ] 

Eduardo Ponce commented on ARROW-14332:
---

This table shows Arrow datatypes and corresponding **is_xxx_type** type traits. 

||Datatype||Current type trait||
|FixedWidthType| |
|PrimitiveCType|is_primitive_ctype|
|NumberType|is_number_type|
|IntegerType|is_integer_type|
| |is_signed_integer_type|
| |is_unsigned_integer_type|
|[U]Int[8,16,32,64]Type| |
|FloatingPointType|is_floating_type|
|HalfFloatType|is_half_float_type|
|FloatType| |
|DoubleType| |
|ParametricType| |
|NestedType|is_nested_type|
|NullType|is_null_type|
|BooleanType|is_boolean_type|
|BaseBinaryType|is_base_binary_type|
|BinaryType|is_binary_type|
|LargeBinaryType| |
|StringType|is_string_type|
|LargeStringType| |
|FixedSizeBinaryType|is_fixed_size_binary_type|
| |is_binary_like_type|
|DecimalType|is_decimal_type|
|Decimal128Type|is_decimal128_type|
|Decimal256Type|is_decimal256_type|
|BaseListType|is_var_length_list_type|
|ListType|is_list_type|
|LargeListType| |
|FixedSizeListType|is_fixed_size_list_type|
|MapType| |
|StructType|is_struct_type|
|UnionType|is_union_type|
|SparseUnionType| |
|DenseUnionType| |
|TemporalType|is_temporal_type|
|DateType|is_date_type|
|Date64Type| |
|TimeType|is_time_type|
|Time32Type| |
|Time64Type| |
|TimestampType|is_timestamp_type|
|IntervalType|is_interval_type|
|MonthIntervalType| |
|DayTimeIntervalType| |
|MonthDayNanoIntervalType| |
|DurationType|is_duration_type|
|DictionaryType|is_dictionary_type|
|ExtensionType|is_extension_type|

These are special type traits:
* {{is_string_like_type = is_base_binary_type && T::is_utf8}}
   a. (Eduardo) Seems like a semantic duplicate of {{is_string_type}}
* {{is_binary_like_type = (is_base_binary_type && !is_string_like_type) || 
is_fixed_size_binary_type}}
* {{is_base_list_type}} deprecated for {{is_var_length_list_type}}
* {{is_list_like_type = is_base_list_type || is_fixed_size_list_type}}


> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 7.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465515#comment-17465515
 ] 

Eduardo Ponce edited comment on ARROW-14332 at 12/27/21, 3:08 AM:
--

There is a more general issue at hand here: the naming convention used when 
referring to concepts (type trait functions, SFINAE conditions, GD, etc.) 
corresponding to base/derived types.


was (Author: edponce):
There is a more general issue at hand here: the naming convention used when 
referring to concepts (functions, GD, SFINAE conditions, etc.) corresponding to 
base/derived types.

> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 7.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465515#comment-17465515
 ] 

Eduardo Ponce commented on ARROW-14332:
---

There is a more general issue at hand here: the naming convention used when 
referring to concepts (functions, GD, SFINAE conditions, etc.) corresponding to 
base/derived types.

> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 7.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-14332:
--
Description: 
There are semantic differences between *enable_ifs-related* utils and *is_xxx* 
functions with the same name. For example, *is_binary_like* 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
 != 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
 The former includes binary only and the latter binary/string types.

Also, the *_like* suffix seems unwarranted as they always refer to binary or 
string.

Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
{_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match how 
binary/string utils are used.

{_}Note{_}: There might be other inconsistencies.

  was:
There are semantic differences between *enable_ifs-related* utils and *is_xxx* 
functions with the same name. For example, *is_binary_like* 
[here|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]]
 != 
[here|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924]|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
 The former includes binary only and the latter binary/string types.

Also, the *_like* suffix seems unwarranted as they always refer to binary or 
string.

Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
{_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match how 
binary/string utils are used.

{_}Note{_}: There might be other inconsistencies.


> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 7.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]
>  != 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14332) [C++] Rename type traits utilities to improve semantic consistency

2021-12-26 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-14332:
--
Description: 
There are semantic differences between *enable_ifs-related* utils and *is_xxx* 
functions with the same name. For example, *is_binary_like* 
[here|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]]
 != 
[here|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924]|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
 The former includes binary only and the latter binary/string types.

Also, the *_like* suffix seems unwarranted as they always refer to binary or 
string.

Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
{_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match how 
binary/string utils are used.

{_}Note{_}: There might be other inconsistencies.

  was:
There are semantic differences between  *enable_ifs-related* utils and *is_xxx* 
functions with the same name. For example, *is_binary_like* 
[here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596)
 != 
[here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924).
 The former includes binary only and the latter binary/string types.

Also, the *_like* suffix seems unwarranted as they always refer to binary or 
string.

Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
_DecimalXXXType_. A better name is *is_base_fixed_size_binary* to match how 
binary/string utils are used.

_Note_: There might be other inconsistencies.


> [C++] Rename type traits utilities to improve semantic consistency
> --
>
> Key: ARROW-14332
> URL: https://issues.apache.org/jira/browse/ARROW-14332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: type
> Fix For: 7.0.0
>
>
> There are semantic differences between *enable_ifs-related* utils and 
> *is_xxx* functions with the same name. For example, *is_binary_like* 
> [here|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L596]]
>  != 
> [here|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924]|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L924].
>  The former includes binary only and the latter binary/string types.
> Also, the *_like* suffix seems unwarranted as they always refer to binary or 
> string.
> Also, the *is_fixed_size_binary* includes both _FixedSizeBinaryType_ and 
> {_}DecimalXXXType{_}. A better name is *is_base_fixed_size_binary* to match 
> how binary/string utils are used.
> {_}Note{_}: There might be other inconsistencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)