[GitHub] [arrow] Dandandan commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
Dandandan commented on pull request #9595: URL: https://github.com/apache/arrow/pull/9595#issuecomment-790407618 @alamb ha, thanks, fixed! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on pull request #9582: URL: https://github.com/apache/arrow/pull/9582#issuecomment-790405647 @pitrou I think I addressed all of your comments, would you mind taking another look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587235073 ## File path: cpp/src/parquet/statistics.cc ## @@ -133,56 +136,95 @@ struct CompareHelper { static T Max(int type_length, const T& a, const T& b) { return Compare(0, a, b) ? b : a; } -}; // namespace parquet - -template -struct ByteLikeCompareHelperBase { - using T = ByteArrayType::c_type; - using PtrType = typename std::conditional::type; +}; - static T DefaultMin() { return {}; } - static T DefaultMax() { return {}; } - static T Coalesce(T val, T fallback) { return val; } +template +struct BinaryLikeComparer {}; - static inline bool Compare(int type_length, const T& a, const T& b) { -const auto* aptr = reinterpret_cast(a.ptr); -const auto* bptr = reinterpret_cast(b.ptr); -return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + b.len); +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); +// Unsigned comparison is used for non-numeric types so straight +// lexiographic comparison makes sense. (a.ptr is always unsigned) +return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr + b_length); } +}; - static T Min(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? a : b; - } +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +// Is signed is used for integers encoded as big-endian twos +// complement integers. (e.g. decimals). +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); + +// At least of the lengths is zero. +if (a_length == 0 || b_length == 0) { + return a_length == 0 && b_length > 0; +} - static T Max(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? b : a; +int8_t first_a = *a.ptr; +int8_t first_b = *b.ptr; +// We can short circuit for different signed numbers or +// for equal length bytes arrays that have different first bytes. +if ((0x80 & first_a) != (0x80 & first_b) || +(a_length == b_length && first_a != first_b)) { + return first_a < first_b; +} +// When the lengths are unequal and the numbers are of the same +// sign we need to extend the digits. +const uint8_t* a_start = a.ptr; +const uint8_t* b_start = b.ptr; +if (a_length != b_length) { + const uint8_t* lead_start = nullptr; + const uint8_t* lead_end = nullptr; + if (a_length > b_length) { +int lead_length = a_length - b_length; +lead_start = a.ptr; +lead_end = a.ptr + lead_length; +a_start += lead_length; + } else { +DCHECK_LT(a_length, b_length); +int lead_length = b_length - a_length; +lead_start = b.ptr; +lead_end = b.ptr + lead_length; +b_start += lead_length; + } + // Compare extra bytes to the sign extension of the first + // byte of the other number. + uint8_t extension = first_a < 0 ? 0xFF : 0; + for (; lead_start != lead_end; lead_start++) { +if (*lead_start < extension) { + // The first bytes of the long value are less + // then the extended short one. So if a is the long value + // we can return true. + return a_length > b_length; +} else if (*lead_start > extension) { Review comment: I did a slightly different factoring let me know if you think that is clearer? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587234738 ## File path: cpp/src/parquet/statistics_test.cc ## @@ -111,21 +117,28 @@ TEST(Comparison, UnsignedByteArray) { } TEST(Comparison, SignedFLBA) { - int size = 10; + int size = 4; auto comparator = MakeComparator(Type::FIXED_LEN_BYTE_ARRAY, SortOrder::SIGNED, size); - std::string s1 = "Anti123456"; - std::string s2 = "Bunkd123456"; - FLBA s1flba = FLBAFromString(s1); - FLBA s2flba = FLBAFromString(s2); - ASSERT_TRUE(comparator->Compare(s1flba, s2flba)); + std::vector byte_values[] = { + {0x80, 0, 0, 0}, {0xFF, 0xFF, 0x01, 0},{0xFF, 0xFF, 0x80, 0}, + {0xFF, 0xFF, 0xFF, 0x80}, {0xFF, 0xFF, 0xFF, 0xFF}, {0, 0, 0x01, 0x01}, + {0, 0x01, 0x01, 0}, {0x01, 0x01, 0, 0}}; + std::vector values_to_compare; + for (auto& bytes : byte_values) { +values_to_compare.emplace_back(FLBA(bytes.data())); + } - s1 = "Bünk123456"; - s2 = "Bunk123456"; - s1flba = FLBAFromString(s1); - s2flba = FLBAFromString(s2); - ASSERT_TRUE(comparator->Compare(s1flba, s2flba)); + for (size_t x = 0; x < values_to_compare.size(); x++) { +EXPECT_FALSE(comparator->Compare(values_to_compare[x], values_to_compare[x])) << x; +for (size_t y = x + 1; y < values_to_compare.size(); y++) { + EXPECT_TRUE(comparator->Compare(values_to_compare[x], values_to_compare[y])) + << x << " " << y; + EXPECT_FALSE(comparator->Compare(values_to_compare[y], values_to_compare[x])) + << y << " " << x; +} + } Review comment: you probably shouldn't trust me. I added a test when writing a column of Decimal128 values to arrow_reader_writer_test which failed without this change. This required fixing some cases with extracting scalar values and there are still some broken aspects to the extraction. I'll open up some JIRAs to fix these: 1. Decimals encoded as Ints aren't extracted correctly. 2. The decimal type chosen might not correspond to the decoded decimal type because we don't pass the arrow schema information through to the decoding. (i.e. we depend solely on precision for deciding Decimal128Scalar or Decimal256Scalar). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587233240 ## File path: cpp/src/parquet/statistics.cc ## @@ -133,56 +136,95 @@ struct CompareHelper { static T Max(int type_length, const T& a, const T& b) { return Compare(0, a, b) ? b : a; } -}; // namespace parquet - -template -struct ByteLikeCompareHelperBase { - using T = ByteArrayType::c_type; - using PtrType = typename std::conditional::type; +}; - static T DefaultMin() { return {}; } - static T DefaultMax() { return {}; } - static T Coalesce(T val, T fallback) { return val; } +template +struct BinaryLikeComparer {}; - static inline bool Compare(int type_length, const T& a, const T& b) { -const auto* aptr = reinterpret_cast(a.ptr); -const auto* bptr = reinterpret_cast(b.ptr); -return std::lexicographical_compare(aptr, aptr + a.len, bptr, bptr + b.len); +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); +// Unsigned comparison is used for non-numeric types so straight +// lexiographic comparison makes sense. (a.ptr is always unsigned) +return std::lexicographical_compare(a.ptr, a.ptr + a_length, b.ptr, b.ptr + b_length); } +}; - static T Min(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? a : b; - } +template +struct BinaryLikeComparer { + static bool Compare(int type_length, const T& a, const T& b) { +// Is signed is used for integers encoded as big-endian twos +// complement integers. (e.g. decimals). +int a_length = value_length(type_length, a); +int b_length = value_length(type_length, b); + +// At least of the lengths is zero. +if (a_length == 0 || b_length == 0) { + return a_length == 0 && b_length > 0; +} - static T Max(int type_length, const T& a, const T& b) { -if (a.ptr == nullptr) return b; -if (b.ptr == nullptr) return a; -return Compare(type_length, a, b) ? b : a; +int8_t first_a = *a.ptr; +int8_t first_b = *b.ptr; +// We can short circuit for different signed numbers or +// for equal length bytes arrays that have different first bytes. +if ((0x80 & first_a) != (0x80 & first_b) || +(a_length == b_length && first_a != first_b)) { + return first_a < first_b; +} Review comment: Added the example above. I can add more of the explanation if you think it is worthwhile. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9582: PARQUET-1655: [C++] Fix comparison of Decimal values in statistics
emkornfield commented on a change in pull request #9582: URL: https://github.com/apache/arrow/pull/9582#discussion_r587230891 ## File path: cpp/src/parquet/statistics_test.cc ## @@ -71,19 +71,25 @@ static FLBA FLBAFromString(const std::string& s) { TEST(Comparison, SignedByteArray) { auto comparator = MakeComparator(Type::BYTE_ARRAY, SortOrder::SIGNED); - std::string s1 = "12345"; - std::string s2 = "12345678"; - ByteArray s1ba = ByteArrayFromString(s1); - ByteArray s2ba = ByteArrayFromString(s2); - ASSERT_TRUE(comparator->Compare(s1ba, s2ba)); + std::vector byte_values[] = { + {0x80, 0x80, 0, 0}, {/*0xFF,*/ 0x80, 0, 0}, {/*0xFF,*/ 0xFF, 0x01, 0}, + {/*0xFF,0xFF,*/ 0x80, 0}, {/*0xFF,0xFF,0xFF,*/ 0x80}, {/*0xFF, 0xFF, 0xFF,*/ 0xFF}, + {/*0, 0,*/ 0x01, 0x01}, {/*0,*/ 0x01, 0x01, 0}, {0x01, 0x01, 0, 0}}; Review comment: did something similar to this. I don't think ByteArray supports string insertion? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dmyersturnbull opened a new issue #9628: write_feather incorrectly deletes files
dmyersturnbull opened a new issue #9628: URL: https://github.com/apache/arrow/issues/9628 I'm happy to fill this out on Jira and/or to submit a pull request. I just can't log in to the Jira right now -- as soon as I log in it "forgets". Two related issues about [write_feather](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L119). pathlib `write_feather` could accept a `pathlib.PurePath` since Python 3.6+ is already required. files are deleted On error, write_feather will delete a file that already existed. Suppose some code is supposed to read `myfile.feather`, update it, and write it back. If the writer throws an error, the original copy of `myfile.feather` is deleted. [Here's the culprit](https://github.com/apache/arrow/blob/b8b44197866242aa1fec6a7f76e3e1005ac3c6de/python/pyarrow/feather.py#L185). ```python except Exception: if isinstance(dest, str): try: os.remove(dest) except os.error: pass raise ```python This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter
github-actions[bot] commented on pull request #9627: URL: https://github.com/apache/arrow/pull/9627#issuecomment-790353990 https://issues.apache.org/jira/browse/ARROW-11856 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #9627: ARROW-11856: [C++] Remove unused reference to RecordBatchStreamWriter
westonpace opened a new pull request #9627: URL: https://github.com/apache/arrow/pull/9627 The type RecordBatchStreamWriter was in a type_fwd but never implemented anywhere. The property type would be RecordBatchWriter This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou edited a comment on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527 @pitrou Yes now it is ready for another review. I have fixed all the issues you mentioned, found and fixed a previously hidden bug and shortened the tests to about 650 lines (with more tests!) It should be a lot neater this time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou edited a comment on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527 @pitrou Yes now it is ready for another review. I have fixed all the issues you mentioned and shortened the tests to about 650 lines (with more tests!) It should be a lot neater this time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mathyingzhou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
mathyingzhou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-790297527 @pitrou Yes now it is ready for another review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
kou closed pull request #9622: URL: https://github.com/apache/arrow/pull/9622 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-790239103 > Congratulations @liyafan82 ! Do you have an idea how hard it will be to add zstd support? @pitrou Support for zstd should be much easier, as you can see, most of the effort is devoted to framework change and library selection. Such effort can be saved when supporting zstd. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
sighingnow commented on a change in pull request #9622: URL: https://github.com/apache/arrow/pull/9622#discussion_r586993756 ## File path: cpp/src/arrow/ArrowConfig.cmake.in ## @@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static)) get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION) get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY) -add_library(arrow_bundled_dependencies STATIC IMPORTED) -set_target_properties( - arrow_bundled_dependencies - PROPERTIES -IMPORTED_LOCATION - "${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}" - ) +if(@ARROW_BUNDLED_STATIC_LIBS@) Review comment: @kou Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sighingnow commented on a change in pull request #9622: ARROW-11836: [C++] Avoid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
sighingnow commented on a change in pull request #9622: URL: https://github.com/apache/arrow/pull/9622#discussion_r586992064 ## File path: cpp/src/arrow/ArrowConfig.cmake.in ## @@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static)) get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION) get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY) -add_library(arrow_bundled_dependencies STATIC IMPORTED) -set_target_properties( - arrow_bundled_dependencies - PROPERTIES -IMPORTED_LOCATION - "${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}" - ) +if(@ARROW_BUNDLED_STATIC_LIBS@) Review comment: Thanks! Thanks deinitely works! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp(&b.1)); -} else { -valids.sort_by(|a, b| a.1.cmp(&b.1).reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets to sort all elements. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns or make partial_sort fallback to default sort when limit equals to len. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array
github-actions[bot] commented on pull request #9626: URL: https://github.com/apache/arrow/pull/9626#issuecomment-790226986 https://issues.apache.org/jira/browse/ARROW-11855 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp(&b.1)); -} else { -valids.sort_by(|a, b| a.1.cmp(&b.1).reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets to sort all elements. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns or make partial_sort fallback to default sort when limit is the len. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #9626: ARROW-11855: Memory leak in to_pandas when converting chunked struct array
westonpace opened a new pull request #9626: URL: https://github.com/apache/arrow/pull/9626 When converting a struct with chunks to python the ownership of the arrow arrays was not being properly tracked and the deletion of the resulting pandas dataframe would leave some buffers behind. I believe it was something like this (pseudo-code)... ``` array_tracker** refs[num_fields]; for chunk in chunks: *refs[chunk.index] = convert(...) for row in rows: row.owns.append(refs[chunk.index]) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp(&b.1)); -} else { -valids.sort_by(|a, b| a.1.cmp(&b.1).reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets to sort all elements. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li commented on a change in pull request #9602: URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774 ## File path: rust/arrow/src/compute/kernels/sort.rs ## @@ -278,12 +347,27 @@ fn sort_boolean( let valids_len = valids.len(); let nulls_len = nulls.len(); -if !descending { -valids.sort_by(|a, b| a.1.cmp(&b.1)); -} else { -valids.sort_by(|a, b| a.1.cmp(&b.1).reverse()); -// reverse to keep a stable ordering -nulls.reverse(); +let mut len = values.len(); +match limit { +Some(limit) => { +len = limit.min(len); +if !descending { +valids.partial_sort(len, |a, b| cmp(a.1, b.1)); Review comment: > Sorry not "fast path" but would the performance be the same as `sort_by`? I'm afraid not. The sort function in rust is merge sort, it is slightly faster than the heap sort for larger sets. But it requires twice the memory of the heap sort because of the second array. ``` sort 2^12 time: [799.70 us 806.41 us 814.15 us] sort 2^12 limit 2^12 time: [1.2848 ms 1.3012 ms 1.3229 ms] sort nulls 2^12 time: [647.20 us 649.27 us 651.61 us] sort nulls 2^12 limit 2^12 time: [780.17 us 788.48 us 798.04 us] ``` We can make a new function to reduce the repeated patterns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
seddonm1 commented on pull request #9625: URL: https://github.com/apache/arrow/pull/9625#issuecomment-790222729 Thanks @andygrove . I have a few more PRs to do to finish this first phase of work. Then I think it's time to tackle type coercion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #9561: ARROW-11649: [R] Add support for null_fallback to R
nealrichardson closed pull request #9561: URL: https://github.com/apache/arrow/pull/9561 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
nealrichardson commented on a change in pull request #9606: URL: https://github.com/apache/arrow/pull/9606#discussion_r586922616 ## File path: r/tests/testthat/test-Array.R ## @@ -723,6 +723,17 @@ test_that("[ accepts Arrays and otherwise handles bad input", { ) }) +test_that("[ %in% looks up string key in dictionary", { Review comment: ```suggestion test_that("%in% works on dictionary arrays", { ``` ## File path: r/tests/testthat/test-Array.R ## @@ -723,6 +723,17 @@ test_that("[ accepts Arrays and otherwise handles bad input", { ) }) +test_that("[ %in% looks up string key in dictionary", { + a1 <- Array$create(as.factor(c("A", "B", "C"))) + a2 <- DictionaryArray$create(c(0L, 1L, 2L), c(4.5, 3.2, 1.1)) + b1 <- Array$create(c(FALSE, TRUE, FALSE)) + b2 <- Array$create(c(FALSE, FALSE, FALSE)) + expect_equal(b1, arrow:::call_function("is_in_meta_binary", a1, Array$create("B"))) + expect_equal(b2, arrow:::call_function("is_in_meta_binary", a1, Array$create("D"))) + expect_equal(b1, arrow:::call_function("is_in_meta_binary", a2, Array$create(3.2))) + expect_error(arrow:::call_function("is_in_meta_binary", a2, Array$create("B"))) Review comment: I would write these as e.g. `expect_equal(a1 %in% "B", b1)` etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask
kou closed pull request #9616: URL: https://github.com/apache/arrow/pull/9616 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask
kou commented on pull request #9616: URL: https://github.com/apache/arrow/pull/9616#issuecomment-790161292 CI is green. I'll merge this. Thanks for updating GLib and Ruby parts too! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
rok commented on pull request #9606: URL: https://github.com/apache/arrow/pull/9606#issuecomment-790159420 > Thanks for doing this. Can you add tests on the C++ side? I expect tests for `IsIn` and `IndexIn`, with non-trivial dictionary arrays, also in cases where not all dictionary values are referred by indices. Added for string and float32 types. Should we check more? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #9606: ARROW-10405: [C++] IsIn kernel should be able to lookup dictionary in string
rok commented on a change in pull request #9606: URL: https://github.com/apache/arrow/pull/9606#discussion_r586896681 ## File path: r/tests/testthat/test-Array.R ## @@ -727,6 +727,15 @@ test_that("[ accepts Arrays and otherwise handles bad input", { ) }) +test_that("[ %in% looks up string key in dictionary", { + a1 <- Array$create(as.factor(c("A", "B", "C"))) + a2 <- DictionaryArray$create(c(0L, 1L, 2L), c(4.5, 3.2, 1.1)) + arrow:::call_function("is_in_meta_binary", a1, Array$create("B")) + arrow:::call_function("is_in_meta_binary", a1, Array$create("D")) + arrow:::call_function("is_in_meta_binary", a2, Array$create(3.2)) Review comment: I was looking to test for errors but the added check is better. Adding. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
seddonm1 commented on pull request #9625: URL: https://github.com/apache/arrow/pull/9625#issuecomment-790153156 @andygrove no problem. I will create a macro for that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
alamb commented on pull request #9624: URL: https://github.com/apache/arrow/pull/9624#issuecomment-790144154 Thanks for this @ericwburden -- I hope to find time to review this tomorrow This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
alamb commented on pull request #9625: URL: https://github.com/apache/arrow/pull/9625#issuecomment-790143072 Thanks @seddonm1 -- I plan to review this tomorrow This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9579: ARROW-11774: [R] macos one line install
nealrichardson commented on a change in pull request #9579: URL: https://github.com/apache/arrow/pull/9579#discussion_r586867436 ## File path: r/tools/nixlibs.R ## @@ -396,6 +401,7 @@ cmake_version <- function(cmd = "cmake") { with_s3_support <- function(env_vars) { arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false" if (arrow_s3) { +# TODO: add in brew versions of these? Review comment: Please do This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
alamb commented on pull request #9595: URL: https://github.com/apache/arrow/pull/9595#issuecomment-790142862 @Dandandan on no! It now seems to have failed `rust fmt` linting This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586859667 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: It does but I think CRAN doesn't do `--run-donttest` everywhere. Anyway we won't be turning off these features on CRAN except for Solaris so I think it's fine, or worth the risk so as not to pollute the examples with potentially confusing (to users) wrapping. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on pull request #9561: ARROW-11649: [R] Add support for null_fallback to R
jonkeane commented on pull request #9561: URL: https://github.com/apache/arrow/pull/9561#issuecomment-790126766 Ok, I've accepted, re-documented, and pushed. This is good to go once CI passes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586830050 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: I initially did that, but apparently as of R 4.0.0: >`R CMD check --as-cran` now runs `\donttest` examples (from https://cran.r-project.org/doc/manuals/r-devel/NEWS.html) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask
lidavidm commented on pull request #9616: URL: https://github.com/apache/arrow/pull/9616#issuecomment-790099377 Thank you for the review and for patching up my very poor bindings :sweat_smile: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
github-actions[bot] commented on pull request #9625: URL: https://github.com/apache/arrow/pull/9625#issuecomment-790096651 https://issues.apache.org/jira/browse/ARROW-11653 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586819763 ## File path: r/tools/autobrew ## @@ -48,7 +48,8 @@ fi # Hardcode this for my custom autobrew build rm -f $BREWDIR/lib/*.dylib AWS_LIBS="-laws-cpp-sdk-config -laws-cpp-sdk-transfer -laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity -laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream -laws-checksums -laws-c-common -lpthread -lcurl" -PKG_LIBS="-L$BREWDIR/lib -lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_LIBS="-lparquet -larrow_dataset -larrow -larrow_bundled_dependencies -lthrift -llz4 -lsnappy -lzstd $AWS_LIBS" +PKG_DIRS="-L$BREWDIR/lib" Review comment: We will need to remember to upstream this change when we release to CRAN This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586818947 ## File path: r/configure ## @@ -26,13 +26,13 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" +BUNDLED_LIBS="" Review comment: Likewise I don't think you need to set this here, "" is what it will be anyway if it's not set. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 opened a new pull request #9625: ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex
seddonm1 opened a new pull request #9625: URL: https://github.com/apache/arrow/pull/9625 @alamb This is the second last of the current string functions but I think there may be one after that with new code. This implements some of the miscellaneous string functions `ascii`, `chr`, `initcap`, `repeat`, `reverse`, `to_hex`. The next PR will have more useful functions (including regex). A little bit of tidying for consistency to the other functions was applied. This PR is a child of #9243 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586818018 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: I don't feel strongly about not requiring dev roxygen, but it's fine either way. If you don't want to use examplesIf here, I think you can leave off the `if (arrow_with_dataset())` checks everywhere since it's already inside a `\donttest` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #9622: ARROW-11836: [C++] Aovid requiring arrow_bundled_dependencies when it doesn't exist for arrow_static.
kou commented on a change in pull request #9622: URL: https://github.com/apache/arrow/pull/9622#discussion_r586790211 ## File path: cpp/src/arrow/ArrowConfig.cmake.in ## @@ -71,19 +71,21 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static)) get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION) get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY) -add_library(arrow_bundled_dependencies STATIC IMPORTED) -set_target_properties( - arrow_bundled_dependencies - PROPERTIES -IMPORTED_LOCATION - "${arrow_lib_dir}/${CMAKE_STATIC_LIBRARY_PREFIX}arrow_bundled_dependencies${CMAKE_STATIC_LIBRARY_SUFFIX}" - ) +if(@ARROW_BUNDLED_STATIC_LIBS@) Review comment: This doesn't work when we have two or more bundled static libraries. e.g.: ```cmake if(jemalloc::jemalloc;mimalloc::mimalloc) # ... endif() ``` doesn't work: ```text CMake Error at /tmp/local/lib/cmake/arrow/ArrowConfig.cmake:99 (if): if given arguments: "jemalloc::jemalloc" "mimalloc::mimalloc" Unknown arguments specified Call Stack (most recent call first): /tmp/local/lib/cmake/arrow/FindArrow.cmake:213 (find_package) /tmp/local/lib/cmake/arrow/FindArrow.cmake:320 (arrow_find_package_cmake_package_configuration) /tmp/local/lib/cmake/arrow/FindArrow.cmake:357 (arrow_find_package) CMakeLists.txt:24 (find_package) ``` How about using a variable? ```diff diff --git a/cpp/src/arrow/ArrowConfig.cmake.in b/cpp/src/arrow/ArrowConfig.cmake.in index a5a3619eb..6209baeec 100644 --- a/cpp/src/arrow/ArrowConfig.cmake.in +++ b/cpp/src/arrow/ArrowConfig.cmake.in @@ -37,6 +37,7 @@ set(ARROW_FULL_SO_VERSION "@ARROW_FULL_SO_VERSION@") set(ARROW_LIBRARY_PATH_SUFFIXES "@ARROW_LIBRARY_PATH_SUFFIXES@") set(ARROW_INCLUDE_PATH_SUFFIXES "@ARROW_INCLUDE_PATH_SUFFIXES@") set(ARROW_SYSTEM_DEPENDENCIES "@ARROW_SYSTEM_DEPENDENCIES@") +set(ARROW_BUNDLED_STATIC_LIBS "@ARROW_BUNDLED_STATIC_LIBS@") include("${CMAKE_CURRENT_LIST_DIR}/ArrowOptions.cmake") @@ -71,7 +72,7 @@ if(NOT (TARGET arrow_shared OR TARGET arrow_static)) get_property(arrow_static_loc TARGET arrow_static PROPERTY LOCATION) get_filename_component(arrow_lib_dir ${arrow_static_loc} DIRECTORY) -if(@ARROW_BUNDLED_STATIC_LIBS@) +if(ARROW_BUNDLED_STATIC_LIBS) add_library(arrow_bundled_dependencies STATIC IMPORTED) set_target_properties( arrow_bundled_dependencies ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586785240 ## File path: r/configure ## @@ -145,7 +150,7 @@ else # TODO: what about non-bundled deps? BUNDLED_LIBS=`cd $LIB_DIR && ls *.a` BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed -e "s/\\.a lib/ -l/g"` -PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS" +PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS" Review comment: Oh, gotcha. I just went through and checked the control flow more carefully, and I agree—`PKG_DIRS` will never be set here. I removed it in 31bd5e9c12023b793a7ef30d6d14f2290698ffc1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586783964 ## File path: r/configure ## @@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then # Always build with arrow on macOS PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW" # Check for features - LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'` - grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 2>&1 + LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'` + ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake" + # Check for Arrow Dataset subcomponent Review comment: Fixed in b7cdbe26ef6ab9496b2a1f810e84469db9620fe1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586783690 ## File path: r/configure ## @@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then # Always build with arrow on macOS PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW" # Check for features - LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'` - grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 2>&1 + LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'` + ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake" + # Check for Arrow Dataset subcomponent + grep 'set(ARROW_DATASET "ON")' $ARROW_OPTS_CMAKE >/dev/null 2>&1 + if [ $? -eq 0 ]; then +PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_DATASET" +PKG_LIBS="-larrow_dataset $PKG_LIBS" +# TODO: what if arrow-dataset has a different -L location than arrow? Review comment: Done in 3dc2061abc13949249e689c64cbdab810948ec65 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586783164 ## File path: r/configure ## @@ -26,13 +26,14 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" +PKG_DIRS="" Review comment: I confirmed this. Removed in edbfb7e3732cf63e63ba2a7380771a2c56e722f2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586763543 ## File path: r/configure ## @@ -26,13 +26,14 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" Review comment: Correction: the order was wrong, but this was corrected in b7cdbe26ef6ab9496b2a1f810e84469db9620fe1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask
lidavidm commented on pull request #9616: URL: https://github.com/apache/arrow/pull/9616#issuecomment-790032276 Looks like things finally pass after fixing the GLib and Ruby bindings. I opted not to bind Fragment::Scan since the API there is being reworked at the moment anyways. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586750914 ## File path: r/configure ## @@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then # Always build with arrow on macOS PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW" # Check for features - LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'` - grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 2>&1 + LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'` + ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake" + # Check for Arrow Dataset subcomponent Review comment: D'oh—I switched from appending to prepending and forgot to change the order 😳 Thank you for catching that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586743366 ## File path: r/configure ## @@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then # Always build with arrow on macOS PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW" # Check for features - LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'` - grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 2>&1 + LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'` + ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake" + # Check for Arrow Dataset subcomponent Review comment: Sorry, my note was ambiguous. The order should be `-larrow_dataset -lparquet`, correct. But if we're prepending them one at a time, we should first prepend parquet, then arrow_dataset. If I'm following correctly, the code here first does if ARROW_DATASET, do `PKG_LIBS="-larrow_dataset $PKG_LIBS"`, then if ARROW_PARQUET, do `PKG_LIBS="-lparquet $PKG_LIBS"`, which would result in `-lparquet -larrow_dataset` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
codecov-io commented on pull request #9595: URL: https://github.com/apache/arrow/pull/9595#issuecomment-790021228 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=h1) Report > Merging [#9595](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=desc) (2869040) into [master](https://codecov.io/gh/apache/arrow/commit/0f647261058892289b1722e5883f8610cc5dd3d9?el=desc) (0f64726) will **increase** coverage by `0.04%`. > The diff coverage is `78.72%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9595/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=tree) ```diff @@Coverage Diff @@ ## master#9595 +/- ## == + Coverage 82.33% 82.38% +0.04% == Files 245 245 Lines 5640757134 +727 == + Hits4644347068 +625 - Misses 996410066 +102 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=tree) | Coverage Δ | | |---|---|---| | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `84.16% <77.77%> (+0.64%)` | :arrow_up: | | [rust/datafusion/src/physical\_plan/repartition.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3JlcGFydGl0aW9uLnJz) | `81.21% <100.00%> (-0.14%)` | :arrow_down: | | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `73.38% <0.00%> (-3.62%)` | :arrow_down: | | [rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=) | `75.49% <0.00%> (-0.99%)` | :arrow_down: | | [rust/arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzL2ZpZWxkLnJz) | `55.47% <0.00%> (-0.66%)` | :arrow_down: | | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `87.83% <0.00%> (-0.22%)` | :arrow_down: | | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `38.33% <0.00%> (ø)` | | | [...datafusion/src/physical\_plan/crypto\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NyeXB0b19leHByZXNzaW9ucy5ycw==) | `52.45% <0.00%> (ø)` | | | [...integration-testing/src/flight\_server\_scenarios.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9mbGlnaHRfc2VydmVyX3NjZW5hcmlvcy5ycw==) | `0.00% <0.00%> (ø)` | | | [...-testing/src/flight\_server\_scenarios/middleware.rs](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9mbGlnaHRfc2VydmVyX3NjZW5hcmlvcy9taWRkbGV3YXJlLnJz) | `0.00% <0.00%> (ø)` | | | ... and [10 more](https://codecov.io/gh/apache/arrow/pull/9595/diff?src=pr&el=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=footer). Last update [0f64726...2869040](https://codecov.io/gh/apache/arrow/pull/9595?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586741670 ## File path: r/configure ## @@ -145,7 +150,7 @@ else # TODO: what about non-bundled deps? BUNDLED_LIBS=`cd $LIB_DIR && ls *.a` BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed -e "s/\\.a lib/ -l/g"` -PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS" +PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS" Review comment: Yeah, I followed that. I just meant that I don't think it's possible for `PKG_DIRS` to be set to anything other than "" before here, so you don't need to append it do you, just set `PKG_DIRS="-L$(pwd)/$LIB_DIR"`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API
github-actions[bot] commented on pull request #9528: URL: https://github.com/apache/arrow/pull/9528#issuecomment-790018580 Revision: 848e020cfaf234411d3a0167b58dc39c030823a4 Submitted crossbow builds: [ursacomputing/crossbow @ actions-174](https://github.com/ursacomputing/crossbow/branches/all?query=actions-174) |Task|Status| ||--| |test-conda-python-3.6|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.6)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.6)| |test-conda-python-3.6-pandas-0.23|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.6-pandas-0.23)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.6-pandas-0.23)| |test-conda-python-3.7|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7)| |test-conda-python-3.7-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-dask-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-dask-latest)| |test-conda-python-3.7-hdfs-3.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-hdfs-3.2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-hdfs-3.2)| |test-conda-python-3.7-kartothek-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-kartothek-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-kartothek-latest)| |test-conda-python-3.7-kartothek-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-kartothek-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-kartothek-master)| |test-conda-python-3.7-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-pandas-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-pandas-latest)| |test-conda-python-3.7-pandas-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-pandas-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-pandas-master)| |test-conda-python-3.7-spark-branch-3.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-spark-branch-3.0)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-spark-branch-3.0)| |test-conda-python-3.7-turbodbc-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-turbodbc-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-turbodbc-latest)| |test-conda-python-3.7-turbodbc-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.7-turbodbc-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.7-turbodbc-master)| |test-conda-python-3.8|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.8)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.8)| |test-conda-python-3.8-dask-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.8-dask-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.8-dask-master)| |test-conda-python-3.8-hypothesis|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-174-github-test-conda-python-3.8-hypothesis)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-174-github-test-conda-python-3.8-hypothesis)| |test-conda-python-3.8-jpype|[![G
[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType
ovr commented on a change in pull request #9232: URL: https://github.com/apache/arrow/pull/9232#discussion_r586722375 ## File path: rust/datafusion/src/physical_plan/group_scalar.rs ## @@ -22,10 +22,12 @@ use std::convert::{From, TryFrom}; use crate::error::{DataFusionError, Result}; use crate::scalar::ScalarValue; +use arrow::datatypes::Decimal128Type; /// Enumeration of types that can be used in a GROUP BY expression #[derive(Debug, PartialEq, Eq, Hash, Clone)] pub(crate) enum GroupByScalar { +Decimal128(i128, usize, usize), Review comment: Change it to ``` // Box is needed to reduce memory usage Decimal128(Box), ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on pull request #9528: URL: https://github.com/apache/arrow/pull/9528#issuecomment-789996441 @github-actions crossbow submit -g python This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9528: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on pull request #9528: URL: https://github.com/apache/arrow/pull/9528#issuecomment-789995923 @ursabot crossbow submit -g python This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
alamb commented on a change in pull request #9595: URL: https://github.com/apache/arrow/pull/9595#discussion_r586715897 ## File path: rust/datafusion/src/physical_plan/hash_join.rs ## @@ -311,6 +319,7 @@ fn update_hash( hash: &mut JoinHashMap, offset: usize, random_state: &RandomState, +hashes_buffer: &mut Vec, Review comment: 🤔 definitely not for this PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7179: ARROW-8732: [C++] Add basic cancellation API
pitrou commented on pull request #7179: URL: https://github.com/apache/arrow/pull/7179#issuecomment-789990634 Closed, superseded by #9528. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #7179: ARROW-8732: [C++] Add basic cancellation API
pitrou closed pull request #7179: URL: https://github.com/apache/arrow/pull/7179 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
Dandandan commented on pull request #9595: URL: https://github.com/apache/arrow/pull/9595#issuecomment-789968391 Thanks @alamb resolved the incosistent naming. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
Dandandan commented on a change in pull request #9595: URL: https://github.com/apache/arrow/pull/9595#discussion_r586686916 ## File path: rust/datafusion/src/physical_plan/hash_join.rs ## @@ -311,6 +319,7 @@ fn update_hash( hash: &mut JoinHashMap, offset: usize, random_state: &RandomState, +hashes_buffer: &mut Vec, Review comment: Indeed, this change is for reusing the allocated `Vec`. Yes, makes sense to group them in a struct. There are some opportunities in other functions `build_join_indexes` `build_batch`, etc. for this as well. Not sure if it makes sense they all receive the same struct, or maybe all of them a subset of the most commonly needed parts :thinking: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
Dandandan commented on a change in pull request #9595: URL: https://github.com/apache/arrow/pull/9595#discussion_r586686916 ## File path: rust/datafusion/src/physical_plan/hash_join.rs ## @@ -311,6 +319,7 @@ fn update_hash( hash: &mut JoinHashMap, offset: usize, random_state: &RandomState, +hashes_buffer: &mut Vec, Review comment: Indeed, this change is for reusing the allocated `Vec`. Yes, makes sense to group them in a struct. There are some opportunities in other functions "build_join_indexes" "build_batch", etc. for this as well. Not sure if it makes sense they all receive the same struct, or maybe all of them a subset of the most commonly needed parts :thinking: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586648630 ## File path: r/configure ## @@ -69,19 +70,22 @@ fi if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then echo "*** Using INCLUDE_DIR/LIB_DIR" PKG_CFLAGS="-I$INCLUDE_DIR $PKG_CFLAGS" - PKG_LIBS="-L$LIB_DIR $PKG_LIBS" + PKG_DIRS="-L$LIB_DIR $PKG_DIRS" else # Use pkg-config if available and allowed pkg-config --version >/dev/null 2>&1 if [ "$ARROW_USE_PKG_CONFIG" != "false" ] && [ $? -eq 0 ]; then PKGCONFIG_CFLAGS=`pkg-config --cflags --silence-errors ${PKG_CONFIG_NAME}` -PKGCONFIG_LIBS=`pkg-config --libs --silence-errors ${PKG_CONFIG_NAME}` +PKGCONFIG_LIBS=`pkg-config --libs-only-l --silence-errors ${PKG_CONFIG_NAME}` +PKGCONFIG_DIRS=`pkg-config --libs-only-L --silence-errors ${PKG_CONFIG_NAME}` +# TODO: what about --libs-only-other? Review comment: Cool, thanks for pointing that out. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586643234 ## File path: r/configure ## @@ -145,7 +150,7 @@ else # TODO: what about non-bundled deps? BUNDLED_LIBS=`cd $LIB_DIR && ls *.a` BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed -e "s/\\.a lib/ -l/g"` -PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS" +PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS" Review comment: `PKG_DIRS` contains only the `-L` flags, which are directories. `PKG_LIBS` contain only the `-l` flags, which are package names. Previously these were combined together in `PKG_LIBS`. I separated them because it makes the appending/prepending much more straightforward. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586639762 ## File path: r/configure ## @@ -145,7 +150,7 @@ else # TODO: what about non-bundled deps? BUNDLED_LIBS=`cd $LIB_DIR && ls *.a` BUNDLED_LIBS=`echo $BUNDLED_LIBS | sed -E "s/lib(.*)\.a/-l\1/" | sed -e "s/\\.a lib/ -l/g"` -PKG_LIBS="-L$(pwd)/$LIB_DIR $PKG_LIBS $BUNDLED_LIBS" +PKG_DIRS="-L$(pwd)/$LIB_DIR $PKG_DIRS" Review comment: `$(command)` is apparently not a bashism: https://stackoverflow.com/questions/3789894/is-command-substitution-foo-bashism This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586615537 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: Right, roxygen2 does not run on CRAN, but I think we should avoid requiring R package contributors to use dev versions of devtools, roxygen2, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #9598: ARROW-11804: [Developer] Offer to create JIRA issue
nealrichardson commented on pull request #9598: URL: https://github.com/apache/arrow/pull/9598#issuecomment-789884959 > In general, I felt the approach to adding an optional, manually acknowledged step I agree. What I'm envisioning is that the existing bot that says "Thanks for opening a pull request! Please name your PR with ARROW-: Issue Title, or go to JIRA and make an issue" would instead say "Thanks for opening a pull request! Please name your PR with ARROW-: Issue Title, or comment `please make a JIRA` on the PR to have me make one for you". And a similar message for new GitHub Issues. That way, the contributor doesn't have to leave GitHub if they don't want, but we get the JIRA issue up front too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
nealrichardson commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586587025 ## File path: r/R/dataset-partition.R ## @@ -76,7 +76,9 @@ HivePartitioning$create <- dataset___HivePartitioning #' calling `hive_partition()` with no arguments. #' @examples #' \donttest{ -#' hive_partition(year = int16(), month = int8()) +#' if (arrow_with_dataset()) { Review comment: We don't need roxygen2 to be on CRAN in order to use it, the Rd it generates is valid. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586568597 ## File path: r/configure ## @@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then # Always build with arrow on macOS PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW" # Check for features - LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'` - grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 2>&1 + LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'` + ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake" + # Check for Arrow Dataset subcomponent Review comment: I thought so too at first, but when I run ``` pkg-config --libs-only-l arrow-dataset ``` I get `-larrow_dataset` first and `-lparquet` second. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586568597 ## File path: r/configure ## @@ -182,15 +187,33 @@ if [ $? -eq 0 ] || [ "$UNAME" = "Darwin" ]; then # Always build with arrow on macOS PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ARROW" # Check for features - LIB_DIR=`echo $PKG_LIBS | sed -e 's/ -l.*//' | sed -e 's/^-L//'` - grep 'set(ARROW_S3 "ON")' $LIB_DIR/cmake/arrow/ArrowOptions.cmake >/dev/null 2>&1 + LIB_DIR=`echo $PKG_DIRS | sed -e 's/^-L//'` + ARROW_OPTS_CMAKE="$LIB_DIR/cmake/arrow/ArrowOptions.cmake" + # Check for Arrow Dataset subcomponent Review comment: I thought so too at first, but when I run ``` pkg-config --libs-only-l arrow-dataset ``` I get `-larrow-dataset` first and `-lparquet` second. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586564929 ## File path: r/configure ## @@ -26,13 +26,14 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" Review comment: Yep, the order in which the code below prepends to `PKG_LIBS` makes its value match the output of ``` pkg-config --libs-only-l arrow-dataset ``` (at least on my Mac) which is: ``` -larrow_dataset -lparquet -larrow ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586564929 ## File path: r/configure ## @@ -26,13 +26,14 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" Review comment: Yep, the order in which the code below prepends to `PKG_LIBS` makes its value match the output of `pkg-config --libs-only-l arrow-dataset` (at least on my Mac) which is: ``` -larrow_dataset -lparquet -larrow ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586564929 ## File path: r/configure ## @@ -26,13 +26,14 @@ # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings -PKG_CONFIG_NAME="arrow parquet arrow-dataset" +PKG_CONFIG_NAME="arrow" PKG_DEB_NAME="(unsuppored)" PKG_RPM_NAME="(unsuppored)" PKG_BREW_NAME="apache-arrow" PKG_TEST_HEADER="" -# These must be the same order as $(pkg-config --libs arrow-dataset) -PKG_LIBS="-larrow_dataset -lparquet -larrow" +PKG_LIBS="-larrow" Review comment: Yep, the order in which the code below prepends to `PKG_LIBS` matches the output of `pkg-config --libs-only-l arrow-dataset` (at least on my Mac) which is: ``` -larrow_dataset -lparquet -larrow ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook commented on a change in pull request #9610: ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components
ianmcook commented on a change in pull request #9610: URL: https://github.com/apache/arrow/pull/9610#discussion_r586560582 ## File path: r/data-raw/codegen.R ## @@ -194,10 +194,16 @@ arrow_exports_cpp <- glue::glue(' {feature_available("arrow")} +{feature_available("dataset")} Review comment: Done in c8d390d This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove closed pull request #9619: ARROW-11842: [Rust][Parquet] Use clone_from in get_batch_with_dict
andygrove closed pull request #9619: URL: https://github.com/apache/arrow/pull/9619 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #9615: ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch
romainfrancois commented on pull request #9615: URL: https://github.com/apache/arrow/pull/9615#issuecomment-789772368 ```cpp bool CanExtendParallel(SEXP x, const std::shared_ptr& type) { // TODO: identify when it's ok to do things in parallel return false; } ``` turning this to `true` entirely makes everything fail dramatically. So we'll need to pay careful attention about what can and cannot be done concurrently. We might have to dial down the use of the `cpp11` package because I believe when we create a `cpp11` vector to shell a `SEXP` this uses a central resource for the protection. It would probably be better to move the "can this be done in parallel" to a virtual method of `RConverter`, but we would need to move capture `converter` in the `task` lambda or at least figure out some way for the converter to still be alive when the task is run. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ericwburden commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
ericwburden commented on a change in pull request #9624: URL: https://github.com/apache/arrow/pull/9624#discussion_r586483501 ## File path: rust/arrow/src/array/array_binary.rs ## @@ -89,7 +89,7 @@ impl GenericBinaryArray { // pointer alignment & location is ensured by RawPtrBox // buffer bounds/offset is ensured by the value_offset invariants std::slice::from_raw_parts( -self.value_data.as_ptr().offset(start.to_isize()), +self.value_data.as_ptr().offset(start.to_isize().unwrap()), Review comment: I do think it would be good form to add the explanatory comments, though. I'll just hang out and see if anyone corrects my logic, then commit the comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ericwburden commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
ericwburden commented on a change in pull request #9624: URL: https://github.com/apache/arrow/pull/9624#discussion_r586482079 ## File path: rust/arrow/src/array/array_binary.rs ## @@ -89,7 +89,7 @@ impl GenericBinaryArray { // pointer alignment & location is ensured by RawPtrBox // buffer bounds/offset is ensured by the value_offset invariants std::slice::from_raw_parts( -self.value_data.as_ptr().offset(start.to_isize()), +self.value_data.as_ptr().offset(start.to_isize().unwrap()), Review comment: Actually, now that I consider it, the implementation of `to_isize()` that I removed from OffsetSizeTrait already had the unwraps, just inside the `to_isize()` function instead of on the result of `to_isize()`. So, at the very least, we shouldn't be adding any risks we weren't already taking. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ericwburden commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
ericwburden commented on a change in pull request #9624: URL: https://github.com/apache/arrow/pull/9624#discussion_r586473814 ## File path: rust/arrow/src/array/array_binary.rs ## @@ -89,7 +89,7 @@ impl GenericBinaryArray { // pointer alignment & location is ensured by RawPtrBox // buffer bounds/offset is ensured by the value_offset invariants std::slice::from_raw_parts( -self.value_data.as_ptr().offset(start.to_isize()), +self.value_data.as_ptr().offset(start.to_isize().unwrap()), Review comment: Here's my rationale: > `start` is an &OffsetSize, which is a generic type that implements the > OffsetSizeTrait. Currently, only i32 and i64 implement OffsetSizeTrait, > both of which should cleanly cast to isize on an architecture that supports > 32/64-bit offsets I *think* that is true. Can anyone think of a situation in which that would not hold? (For reference, I'm primarily an R programmer, so questions about memory address size are a bit arcane for me). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9573: ARROW-11783: [Rust] Proposal for RFCs in Rust Arrow
andygrove commented on pull request #9573: URL: https://github.com/apache/arrow/pull/9573#issuecomment-789746539 > I've been looking for ways to link the mdbook that could be generated for the RFCs to the apache arrow website and to be honest, I don't think it is possible. Also, I think it will become very cumbersome when people want to add a new RFC to the repo if we keep the RFCs within the Arrow repository. > > I propose that we create another repo in the main Apache group and from there we can work with the RFCs. Something similar to what they have done to the [Arrow website](https://github.com/apache/arrow-site). I also think that by keeping it as a separate repository it will make it easier to add and maintain all the RFCs that may be received. > > What do you think? @andygrove @alamb @jorgecarleitao @nevi-me Personally, I think we should have the RFCs in the main repo. We already publish documentation from the main repo to the Arrow web site (for example, the content under `docs/source`). Sorry, I haven't been following this closely so need to catch up on the discussions as to why you are running into issues with publishing the mdbook. I'll try and do that over the next couple days. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType
ovr commented on a change in pull request #9232: URL: https://github.com/apache/arrow/pull/9232#discussion_r586437638 ## File path: rust/datafusion/src/physical_plan/group_scalar.rs ## @@ -22,10 +22,12 @@ use std::convert::{From, TryFrom}; use crate::error::{DataFusionError, Result}; use crate::scalar::ScalarValue; +use arrow::datatypes::Decimal128Type; /// Enumeration of types that can be used in a GROUP BY expression #[derive(Debug, PartialEq, Eq, Hash, Clone)] pub(crate) enum GroupByScalar { +Decimal128(i128, usize, usize), Review comment: @alamb Should I store it as (i128, usize, usize) or `Decimal128Type` (which is a structure) for `ScalaraValue::Decimal128`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #9624: ARROW-11845: [Rust] Implement to_isize() for ArrowNativeTypes
andygrove commented on a change in pull request #9624: URL: https://github.com/apache/arrow/pull/9624#discussion_r586449677 ## File path: rust/arrow/src/array/array_binary.rs ## @@ -89,7 +89,7 @@ impl GenericBinaryArray { // pointer alignment & location is ensured by RawPtrBox // buffer bounds/offset is ensured by the value_offset invariants std::slice::from_raw_parts( -self.value_data.as_ptr().offset(start.to_isize()), +self.value_data.as_ptr().offset(start.to_isize().unwrap()), Review comment: If these calls to `unwrap` can't be avoided, could we at least document why they are needed/safe? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou edited a comment on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
pitrou edited a comment on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-789737670 @mathyingzhou What is the status of this PR? Are you waiting for another review? (feel free to ping) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8648: ARROW-7906: [C++] [Python] Add ORC write support
pitrou commented on pull request #8648: URL: https://github.com/apache/arrow/pull/8648#issuecomment-789737670 @mathyingzhou What is the status of this PR? Are you waiting for another review? (feel free to ping= This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9614: ARROW-11839: [C++] Use xsimd for generation of accelerated bit-unpacking
pitrou commented on pull request #9614: URL: https://github.com/apache/arrow/pull/9614#issuecomment-789736290 Interesting, thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9213: ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP]
alamb commented on pull request #9213: URL: https://github.com/apache/arrow/pull/9213#issuecomment-789733069 Closing this PR for now; We can reopen it if need be This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9493: ARROW-11624: [Rust] Move Arrow benchmarks to its own crate
alamb commented on pull request #9493: URL: https://github.com/apache/arrow/pull/9493#issuecomment-789732796 Closing this PR for now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9493: ARROW-11624: [Rust] Move Arrow benchmarks to its own crate
alamb closed pull request #9493: URL: https://github.com/apache/arrow/pull/9493 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ovr commented on a change in pull request #9232: ARROW-10818: [Rust] Implement DecimalType
ovr commented on a change in pull request #9232: URL: https://github.com/apache/arrow/pull/9232#discussion_r586437638 ## File path: rust/datafusion/src/physical_plan/group_scalar.rs ## @@ -22,10 +22,12 @@ use std::convert::{From, TryFrom}; use crate::error::{DataFusionError, Result}; use crate::scalar::ScalarValue; +use arrow::datatypes::Decimal128Type; /// Enumeration of types that can be used in a GROUP BY expression #[derive(Debug, PartialEq, Eq, Hash, Clone)] pub(crate) enum GroupByScalar { +Decimal128(i128, usize, usize), Review comment: @alamb Should I store it as (i128, usize, usize) or `Decimal128Type` (which is a structure)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ovr commented on pull request #9232: ARROW-10818: [Rust] Implement DecimalType
ovr commented on pull request #9232: URL: https://github.com/apache/arrow/pull/9232#issuecomment-789732012 > @ovr what do you think we should do with this PR? Is it worth keeping open as a Draft or shall we close it for now? It's time to finish it. So, I've rebased PR and drop `Decimal256` from it (will send it in another PR). Planing to finish it on weekends. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
alamb commented on a change in pull request #9595: URL: https://github.com/apache/arrow/pull/9595#discussion_r586420039 ## File path: rust/datafusion/src/physical_plan/hash_join.rs ## @@ -699,13 +726,11 @@ macro_rules! hash_array { } /// Creates hash values for every element in the row based on the values in the columns -pub fn create_hashes( +pub fn create_hashes<'a>( arrays: &[ArrayRef], random_state: &RandomState, -) -> Result> { -let rows = arrays[0].len(); -let mut hashes = vec![0; rows]; - +hashes: &'a mut Vec, Review comment: I think it would help readability to name this `hashes_buffer ` for consistency ## File path: rust/datafusion/src/physical_plan/hash_join.rs ## @@ -311,6 +319,7 @@ fn update_hash( hash: &mut JoinHashMap, offset: usize, random_state: &RandomState, +hashes_buffer: &mut Vec, Review comment: This is effectively allowing `hashes_buffer ` to be reused, right? It may eventually make sense to make some struct that holds all the relevant state (`on`, `random_state`, `hash_buf`, etc). ## File path: rust/datafusion/src/physical_plan/hash_join.rs ## @@ -476,15 +485,16 @@ fn build_join_indexes( .into_array(left_data.1.num_rows())) }) .collect::>>()?; - -let hash_values = create_hashes(&keys_values, &random_state)?; +let hash_buff = &mut vec![0; keys_values[0].len()]; Review comment: I recommend naming this `hashes_buffer` for consistency with its use elsewhere This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9595: ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices
alamb commented on pull request #9595: URL: https://github.com/apache/arrow/pull/9595#issuecomment-789714615 The integration test failure in https://github.com/apache/arrow/pull/9595/checks?check_run_id=1998235390 seems to be the same as was fixed in https://github.com/apache/arrow/pull/9593 I also pulled this branch locally, and re-ran the tests and everythings looks good to me This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9624: ARROW-11845: [Rust]Implement to_isize() for ArrowNativeTypes
github-actions[bot] commented on pull request #9624: URL: https://github.com/apache/arrow/pull/9624#issuecomment-789710271 https://issues.apache.org/jira/browse/ARROW-11845 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
pitrou commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-789701493 Congratulations @liyafan82 ! Do you have an idea how hard it will be to add zstd support? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #9616: ARROW-11837: [C++][Dataset] expose originating Fragment on ScanTask
lidavidm commented on pull request #9616: URL: https://github.com/apache/arrow/pull/9616#issuecomment-789698947 Ah, this is broken until I fully define GADInMemoryFragment...let me fix that up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9624: ARROW-11845: [Rust]Implement to_isize() for ArrowNativeTypes
github-actions[bot] commented on pull request #9624: URL: https://github.com/apache/arrow/pull/9624#issuecomment-789694328 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-789684969 The integration tests have passed. Please take another another look when you have time, dear reviewers. (maybe just review the last three commits) Thanks a lot. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9605: ARROW-11802: [Rust][DataFusion] Remove use of crossbeam channels to avoid potential deadlocks
Dandandan commented on pull request #9605: URL: https://github.com/apache/arrow/pull/9605#issuecomment-789670134 I also tested partitioning parquet & reading parquet from datafusion - all worked OK :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org