[GitHub] [arrow] cyb70289 commented on issue #9735: [Golang] Create ipc format for ipc Reader
cyb70289 commented on issue #9735: URL: https://github.com/apache/arrow/issues/9735#issuecomment-801634282 > ` subscribe first to send an e-mail` > Sorry, because the first time use github. Please guide me know-how ! @hunght3101, mail list is the preferred way for issues discussion. You can simply click the `subscribe` link at page https://arrow.apache.org/community/ to subscribe to user@ mail list. Then send email to the community about your issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal commented on a change in pull request #9707: ARROW-11984: [C++] Implement SHA128 and SHA256 functions in Gandiva module - WIP
projjal commented on a change in pull request #9707: URL: https://github.com/apache/arrow/pull/9707#discussion_r596529641 ## File path: cpp/src/gandiva/gdv_function_stubs.cc ## @@ -122,6 +123,133 @@ int32_t gdv_fn_populate_varlen_vector(int64_t context_ptr, int8_t* data_ptr, return 0; } +#define SHA128_HASH_FUNCTION(TYPE) \ + GANDIVA_EXPORT \ + const char *gdv_fn_sha128_##TYPE(int64_t context, gdv_##TYPE value, \ +bool validity, int32_t *out_length) { \ +if (!validity) { \ + *out_length = 0; \ + return ""; \ Review comment: Return a hash for null values too. ## File path: cpp/src/gandiva/tests/hash_test.cc ## @@ -147,4 +153,252 @@ TEST_F(TestHash, TestBuf) { } } +TEST_F(TestHash, TestSha256Simple) { Review comment: I think you should also add unit test for HashUtils::GetHash() with expected sha values. You can also add java test in java/gandiva/ProjectorTest and match by generating sha using MessageDigest class. ## File path: cpp/src/gandiva/hash_utils.cc ## @@ -0,0 +1,137 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include "openssl/evp.h" +#include "gandiva/hash_utils.h" +#include "gandiva/execution_context.h" +#include "gandiva/gdv_function_stubs.h" + +namespace gandiva { + const char* HashUtils::HashUsingSha256(int64_t context, + const void* message, + size_t message_length, + int32_t *out_length) { +// The buffer size is the hash size + null character +int sha256_result_length = 65; +return HashUtils::GetHash(context, message, message_length, EVP_sha256(), + sha256_result_length, out_length); + } + const char* HashUtils::HashUsingSha128(int64_t context, + const void* message, + size_t message_length, + int32_t *out_length) { +// The buffer size is the hash size + null character +int sha128_result_length = 41; +return HashUtils::GetHash(context, message, message_length, EVP_sha1(), + sha128_result_length, out_length); + } + + const char* HashUtils::GetHash(int64_t context, + const void* message, + size_t message_length, + const EVP_MD *hash_type, + int result_buf_size, + int32_t *out_length) { +EVP_MD_CTX *md_ctx = EVP_MD_CTX_new(); + +if (md_ctx == nullptr) { + HashUtils::ErrorMessage(context, "Could not allocate memory " + "for SHA processing."); + return ""; +} + +int evp_success_status = 1; + +if (EVP_DigestInit_ex(md_ctx, hash_type, nullptr) != evp_success_status) { + HashUtils::ErrorMessage(context, "Could not obtain the hash " + "for the defined value."); + EVP_MD_CTX_free(md_ctx); + return ""; +} + +if (EVP_DigestUpdate(md_ctx, message, message_length) != evp_success_status) { + HashUtils::ErrorMessage(context, "Could not obtain the hash for " + "the defined value."); + EVP_MD_CTX_free(md_ctx); + return ""; +} + +int hash_size = EVP_MD_size(hash_type); +auto* result = static_cast(OPENSSL_malloc(hash_size)); + +if (result == nullptr) { + HashUtils::ErrorMessage(context, "Could not allocate memory "
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-801600759 > If you've already started [ARROW-11899](https://issues.apache.org/jira/browse/ARROW-11899) then I'll let you finish it up, hopefully it isn't too much work. We are discussing on the ML the path forward for LZ4 in general, once that is cleared up we can figure out do the work including if @HedgehogCode is interesting in contributing. Sounds good. Hopefully, I will prepare a PR in a few days. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9745: ARROW-11703: [R] Implement dplyr::arrange() [WIP]
github-actions[bot] commented on pull request #9745: URL: https://github.com/apache/arrow/pull/9745#issuecomment-801599144 https://issues.apache.org/jira/browse/ARROW-11703 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook opened a new pull request #9745: ARROW-11703: [R] Implement dplyr::arrange() [WIP]
ianmcook opened a new pull request #9745: URL: https://github.com/apache/arrow/pull/9745 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9147: ARROW-11177: [Java] ArrowMessage failed to parse compressed grpc stream
emkornfield commented on a change in pull request #9147: URL: https://github.com/apache/arrow/pull/9147#discussion_r596528147 ## File path: java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java ## @@ -332,6 +333,23 @@ private static ArrowMessage frame(BufferAllocator allocator, final InputStream s } + /** + * Get first byte with EOF check, it is especially needed when using grpc compression. + * InflaterInputStream need another read to change reachEOF after all bytes has been read. + * + * @param is InputStream + * @return -1 if stream is not available, otherwise it will return the actual value. + * @throws IOException Read first byte failed. + */ + private static int readRawVarint32WithEOFCheck(InputStream is) throws IOException { +int firstByte = is.read(); Review comment: I agree with @lidavidm another scenario where zero is perfectly allowable is when no data is buffered but EOF hasn't been reached. @stczwd do you think you will be able to revise the PR to fix the underlying issue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #9741: ARROW-12005: [R] Fix a bash typo in configure
nealrichardson closed pull request #9741: URL: https://github.com/apache/arrow/pull/9741 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-801585216 If you've already started ARROW-11899 then I'll let you finish it up, hopefully it isn't too much work. We are discussing on the ML the path forward for LZ4 in general, once that is cleared up we can figure out do the work including if @HedgehogCode is interesting in contributing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #9728: ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults
cyb70289 commented on a change in pull request #9728: URL: https://github.com/apache/arrow/pull/9728#discussion_r596522463 ## File path: cpp/src/arrow/flight/client.h ## @@ -94,8 +94,6 @@ class ARROW_FLIGHT_EXPORT FlightWriteSizeStatusDetail : public arrow::StatusDeta class ARROW_FLIGHT_EXPORT FlightClientOptions { public: - FlightClientOptions(); Review comment: I declared default constructor as private and find it's called implicitly at line https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/client.cc#L1243, the second augment `{}`. Default constructor creates invalid object now, should not be accidentally called. Maybe add a constructor with explicit parameters? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-801583143 > +1 thank you. @liyafan82 did you have plans to work on the follow-up items or ZSTD? Otherwise I can take them up. > > @HedgehogCode any thoughts on how to procede for LZ4? We can maybe discuss more on the performance JIRA? @emkornfield Thanks a lot for your effort. I have started working on ARROW-11899 yesterday. If you are interested in any of the items (including ARROW-11899), please feel free to assign them to yourself. I'd like to help with the review/discussions :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9744: ARROW-12012: [Java][JDBC] Fix BinaryConsumer reallocation
github-actions[bot] commented on pull request #9744: URL: https://github.com/apache/arrow/pull/9744#issuecomment-801571749 https://issues.apache.org/jira/browse/ARROW-12012 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zxf opened a new pull request #9744: ARROW-12012: [Java][JDBC] Fix BinaryConsumer reallocation
zxf opened a new pull request #9744: URL: https://github.com/apache/arrow/pull/9744 [ARROW-12012](https://issues.apache.org/jira/browse/ARROW-12012) An exception will be thrown when BinaryConsumer consumes a large amount or a lot of data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9743: first informative error msg for lz4 error
github-actions[bot] commented on pull request #9743: URL: https://github.com/apache/arrow/pull/9743#issuecomment-801510996 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9741: ARROW-12005: [R] Fix a bash typo in configure
github-actions[bot] commented on pull request #9741: URL: https://github.com/apache/arrow/pull/9741#issuecomment-801499897 Revision: e2c22c680aee40a3eadeb3ec583c560372734171 Submitted crossbow builds: [ursacomputing/crossbow @ actions-228](https://github.com/ursacomputing/crossbow/branches/all?query=actions-228) |Task|Status| ||--| |test-r-install-local|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-228-github-test-r-install-local)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-228-github-test-r-install-local)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9741: ARROW-12005: [R] Fix a bash typo in configure
github-actions[bot] commented on pull request #9741: URL: https://github.com/apache/arrow/pull/9741#issuecomment-801490891 https://issues.apache.org/jira/browse/ARROW-12005 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9682: ARROW-7364: [Rust] Add cast options to cast kernel [WIP]
seddonm1 commented on pull request #9682: URL: https://github.com/apache/arrow/pull/9682#issuecomment-801488963 FYI https://github.com/ballista-compute/sqlparser-rs/pull/299 has been raised to add TRY_CAST to the parser. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]
seddonm1 commented on pull request #9243: URL: https://github.com/apache/arrow/pull/9243#issuecomment-801488843 Closed after splitting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 closed pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]
seddonm1 closed pull request #9243: URL: https://github.com/apache/arrow/pull/9243 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation
bkietz commented on pull request #9621: URL: https://github.com/apache/arrow/pull/9621#issuecomment-801466302 @pitrou PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation
bkietz commented on a change in pull request #9621: URL: https://github.com/apache/arrow/pull/9621#discussion_r596407348 ## File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc ## @@ -229,6 +604,710 @@ std::unique_ptr AllInit(KernelContext*, const KernelInitArgs& args) return ::arrow::internal::make_unique(); } +struct GroupByImpl : public ScalarAggregator { + using AddLengthImpl = std::function&, int32_t*)>; + + struct GetAddLengthImpl { +static constexpr int32_t null_extra_byte = 1; + +static void AddFixedLength(int32_t fixed_length, int64_t num_repeats, + int32_t* lengths) { + for (int64_t i = 0; i < num_repeats; ++i) { +lengths[i] += fixed_length + null_extra_byte; + } +} + +static void AddVarLength(const std::shared_ptr& data, int32_t* lengths) { + using offset_type = typename StringType::offset_type; + constexpr int32_t length_extra_bytes = sizeof(offset_type); + auto offset = data->offset; + const auto offsets = data->GetValues(1); + if (data->MayHaveNulls()) { +const uint8_t* nulls = data->buffers[0]->data(); + +for (int64_t i = 0; i < data->length; ++i) { + bool is_null = !BitUtil::GetBit(nulls, offset + i); + if (is_null) { +lengths[i] += null_extra_byte + length_extra_bytes; + } else { +lengths[i] += null_extra_byte + length_extra_bytes + offsets[offset + i + 1] - + offsets[offset + i]; + } +} + } else { +for (int64_t i = 0; i < data->length; ++i) { + lengths[i] += null_extra_byte + length_extra_bytes + offsets[offset + i + 1] - +offsets[offset + i]; +} + } +} + +template +Status Visit(const T& input_type) { + int32_t num_bytes = (bit_width(input_type.id()) + 7) / 8; + add_length_impl = [num_bytes](const std::shared_ptr& data, +int32_t* lengths) { +AddFixedLength(num_bytes, data->length, lengths); + }; + return Status::OK(); +} + +Status Visit(const StringType&) { + add_length_impl = [](const std::shared_ptr& data, int32_t* lengths) { +AddVarLength(data, lengths); + }; + return Status::OK(); +} + +Status Visit(const BinaryType&) { + add_length_impl = [](const std::shared_ptr& data, int32_t* lengths) { +AddVarLength(data, lengths); + }; + return Status::OK(); +} + +Status Visit(const FixedSizeBinaryType& type) { + int32_t num_bytes = type.byte_width(); + add_length_impl = [num_bytes](const std::shared_ptr& data, +int32_t* lengths) { +AddFixedLength(num_bytes, data->length, lengths); + }; + return Status::OK(); +} + +AddLengthImpl add_length_impl; + }; + + using EncodeNextImpl = + std::function&, uint8_t**)>; + + struct GetEncodeNextImpl { +template +static void EncodeSmallFixed(const std::shared_ptr& data, + uint8_t** encoded_bytes) { + auto raw_input = data->buffers[1]->data(); + auto offset = data->offset; + if (data->MayHaveNulls()) { +const uint8_t* nulls = data->buffers[0]->data(); +for (int64_t i = 0; i < data->length; ++i) { + auto& encoded_ptr = encoded_bytes[i]; + bool is_null = !BitUtil::GetBit(nulls, offset + i); + encoded_ptr[0] = is_null ? 1 : 0; + encoded_ptr += 1; + uint64_t null_multiplier = is_null ? 0 : 1; + if (NumBits == 1) { +encoded_ptr[0] = static_cast( +null_multiplier * (BitUtil::GetBit(raw_input, offset + i) ? 1 : 0)); +encoded_ptr += 1; + } + if (NumBits == 8) { +encoded_ptr[0] = +static_cast(null_multiplier * reinterpret_cast( + raw_input)[offset + i]); +encoded_ptr += 1; + } + if (NumBits == 16) { +reinterpret_cast(encoded_ptr)[0] = +static_cast(null_multiplier * reinterpret_cast( +raw_input)[offset + i]); +encoded_ptr += 2; + } + if (NumBits == 32) { +reinterpret_cast(encoded_ptr)[0] = +static_cast(null_multiplier * reinterpret_cast( +raw_input)[offset + i]); +encoded_ptr += 4; + } + if (NumBits == 64) { +reinterpret_cast(encoded_ptr)[0] = +static_cast(null_multiplier * reinterpret_cast( +raw_input)[offset + i]); +encoded_ptr += 8; + } +} + } else { +for (int64_t i = 0; i < data->length; ++i)
[GitHub] [arrow] bkietz commented on a change in pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation
bkietz commented on a change in pull request #9621: URL: https://github.com/apache/arrow/pull/9621#discussion_r596406429 ## File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc ## @@ -229,6 +604,710 @@ std::unique_ptr AllInit(KernelContext*, const KernelInitArgs& args) return ::arrow::internal::make_unique(); } +struct GroupByImpl : public ScalarAggregator { + using AddLengthImpl = std::function&, int32_t*)>; + + struct GetAddLengthImpl { +static constexpr int32_t null_extra_byte = 1; + +static void AddFixedLength(int32_t fixed_length, int64_t num_repeats, + int32_t* lengths) { + for (int64_t i = 0; i < num_repeats; ++i) { +lengths[i] += fixed_length + null_extra_byte; + } +} + +static void AddVarLength(const std::shared_ptr& data, int32_t* lengths) { + using offset_type = typename StringType::offset_type; + constexpr int32_t length_extra_bytes = sizeof(offset_type); + auto offset = data->offset; + const auto offsets = data->GetValues(1); + if (data->MayHaveNulls()) { +const uint8_t* nulls = data->buffers[0]->data(); + +for (int64_t i = 0; i < data->length; ++i) { + bool is_null = !BitUtil::GetBit(nulls, offset + i); + if (is_null) { +lengths[i] += null_extra_byte + length_extra_bytes; + } else { +lengths[i] += null_extra_byte + length_extra_bytes + offsets[offset + i + 1] - + offsets[offset + i]; + } +} + } else { +for (int64_t i = 0; i < data->length; ++i) { + lengths[i] += null_extra_byte + length_extra_bytes + offsets[offset + i + 1] - +offsets[offset + i]; +} + } +} + +template +Status Visit(const T& input_type) { + int32_t num_bytes = (bit_width(input_type.id()) + 7) / 8; + add_length_impl = [num_bytes](const std::shared_ptr& data, +int32_t* lengths) { +AddFixedLength(num_bytes, data->length, lengths); + }; + return Status::OK(); +} + +Status Visit(const StringType&) { + add_length_impl = [](const std::shared_ptr& data, int32_t* lengths) { +AddVarLength(data, lengths); + }; + return Status::OK(); +} + +Status Visit(const BinaryType&) { + add_length_impl = [](const std::shared_ptr& data, int32_t* lengths) { +AddVarLength(data, lengths); + }; + return Status::OK(); +} + +Status Visit(const FixedSizeBinaryType& type) { + int32_t num_bytes = type.byte_width(); + add_length_impl = [num_bytes](const std::shared_ptr& data, +int32_t* lengths) { +AddFixedLength(num_bytes, data->length, lengths); + }; + return Status::OK(); +} + +AddLengthImpl add_length_impl; + }; + + using EncodeNextImpl = + std::function&, uint8_t**)>; + + struct GetEncodeNextImpl { +template +static void EncodeSmallFixed(const std::shared_ptr& data, + uint8_t** encoded_bytes) { + auto raw_input = data->buffers[1]->data(); + auto offset = data->offset; + if (data->MayHaveNulls()) { +const uint8_t* nulls = data->buffers[0]->data(); +for (int64_t i = 0; i < data->length; ++i) { + auto& encoded_ptr = encoded_bytes[i]; + bool is_null = !BitUtil::GetBit(nulls, offset + i); + encoded_ptr[0] = is_null ? 1 : 0; + encoded_ptr += 1; + uint64_t null_multiplier = is_null ? 0 : 1; + if (NumBits == 1) { +encoded_ptr[0] = static_cast( +null_multiplier * (BitUtil::GetBit(raw_input, offset + i) ? 1 : 0)); +encoded_ptr += 1; + } + if (NumBits == 8) { +encoded_ptr[0] = +static_cast(null_multiplier * reinterpret_cast( + raw_input)[offset + i]); +encoded_ptr += 1; + } + if (NumBits == 16) { +reinterpret_cast(encoded_ptr)[0] = +static_cast(null_multiplier * reinterpret_cast( +raw_input)[offset + i]); +encoded_ptr += 2; + } + if (NumBits == 32) { +reinterpret_cast(encoded_ptr)[0] = +static_cast(null_multiplier * reinterpret_cast( +raw_input)[offset + i]); +encoded_ptr += 4; + } + if (NumBits == 64) { +reinterpret_cast(encoded_ptr)[0] = +static_cast(null_multiplier * reinterpret_cast( +raw_input)[offset + i]); +encoded_ptr += 8; + } +} + } else { +for (int64_t i = 0; i < data->length; ++i)
[GitHub] [arrow] returnString commented on a change in pull request #9710: ARROW-11969: [Rust][DataFusion] Improve Examples in documentation
returnString commented on a change in pull request #9710: URL: https://github.com/apache/arrow/pull/9710#discussion_r596404956 ## File path: rust/datafusion/README.md ## @@ -58,6 +58,49 @@ Here are some of the projects known to use DataFusion: (if you know of another project, please submit a PR to add a link!) +## Example Usage + +Run a SQL query against data stored in a CSV: + +```rust + let mut ctx = ExecutionContext::new(); + ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?; + + // Create a plan to run a SQL query + let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?; + + // execute and print results + let results: Vec = df.collect().await?; + print_batches(&results)?; +``` + +Use the DataFrame API to process data stored in a CSV: + +```rust Review comment: Nitpicking: this might be a little bit fun with API churn, e.g. I believe the input expr ownership work you've recently opened would change these from slices to vecs and we don't have a way to catch that automatically like we do for the in-crate docs (am I right in thinking that `cargo test` runs all doctests?). Edit: to be clear, I don't think it's a reason to not do it, just curious if anyone has ideas for how to prevent doc drift :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] returnString commented on a change in pull request #9710: ARROW-11969: [Rust][DataFusion] Improve Examples in documentation
returnString commented on a change in pull request #9710: URL: https://github.com/apache/arrow/pull/9710#discussion_r596404956 ## File path: rust/datafusion/README.md ## @@ -58,6 +58,49 @@ Here are some of the projects known to use DataFusion: (if you know of another project, please submit a PR to add a link!) +## Example Usage + +Run a SQL query against data stored in a CSV: + +```rust + let mut ctx = ExecutionContext::new(); + ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?; + + // Create a plan to run a SQL query + let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?; + + // execute and print results + let results: Vec = df.collect().await?; + print_batches(&results)?; +``` + +Use the DataFrame API to process data stored in a CSV: + +```rust Review comment: Nitpicking: this might be a little bit fun with API churn, e.g. I believe the input expr ownership work you've recently opened would change these from slices to vecs and we don't have a way to catch that automatically like we do for the in-crate docs (am I right in thinking that `cargo test` runs all doctests?). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] returnString commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support
returnString commented on a change in pull request #9704: URL: https://github.com/apache/arrow/pull/9704#discussion_r596399314 ## File path: rust/datafusion/src/optimizer/constant_folding.rs ## @@ -188,6 +188,97 @@ impl<'a> ExprRewriter for ConstantRewriter<'a> { right, }, }, +Operator::Plus => match (left.as_ref(), right.as_ref()) { +( +Expr::Literal(ScalarValue::Float64(l)), +Expr::Literal(ScalarValue::Float64(r)), +) => match (l, r) { +(Some(l), Some(r)) => { +Expr::Literal(ScalarValue::Float64(Some(l + r))) +} +_ => Expr::Literal(ScalarValue::Float64(None)), +}, +( +Expr::Literal(ScalarValue::Int64(l)), +Expr::Literal(ScalarValue::Int64(r)), +) => match (l, r) { +(Some(l), Some(r)) => { +Expr::Literal(ScalarValue::Int64(Some(l + r))) +} +_ => Expr::Literal(ScalarValue::Int64(None)), +}, +( +Expr::Literal(ScalarValue::Int32(l)), +Expr::Literal(ScalarValue::Int32(r)), +) => match (l, r) { +(Some(l), Some(r)) => { +Expr::Literal(ScalarValue::Int32(Some(l + r))) +} +_ => Expr::Literal(ScalarValue::Int64(None)), Review comment: Would it be worth adding some kind of macro to simplify the maintenance of this section? I'm not sure if this is intentionally an Int64 or whether this is meant to be an Int32, and I wonder if we could have a macro that reduces these maths ops to just the ScalarValue variant and an operator (slight awkwardness: can't pass operators to macros). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pachamaltese opened a new pull request #9743: first informative error msg for lz4 error
pachamaltese opened a new pull request #9743: URL: https://github.com/apache/arrow/pull/9743 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on pull request #9741: ARROW-12005: [R] Fix a bash typo in configure
jonkeane commented on pull request #9741: URL: https://github.com/apache/arrow/pull/9741#issuecomment-801425608 @github-actions crossbow submit test-r-install-local This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9740: ARROW-12003: [R] Fix NOTE re undefined global function group_by_drop_default
github-actions[bot] commented on pull request #9740: URL: https://github.com/apache/arrow/pull/9740#issuecomment-801401711 https://issues.apache.org/jira/browse/ARROW-12003 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #9740: ARROW-12003: [R] Fix NOTE re undefined global function group_by_drop_default
nealrichardson closed pull request #9740: URL: https://github.com/apache/arrow/pull/9740 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9742: [WIP] ARROW-11928: [C++] Execution engine API
pitrou commented on pull request #9742: URL: https://github.com/apache/arrow/pull/9742#issuecomment-801355489 @bkietz @wesm Here is an initial stab at the exec node API. Only base classes are present. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #9742: [WIP] ARROW-11928: [C++] Execution engine API
pitrou opened a new pull request #9742: URL: https://github.com/apache/arrow/pull/9742 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane opened a new pull request #9741: ARROW-12005: [R] Fix a bash typo in configure
jonkeane opened a new pull request #9741: URL: https://github.com/apache/arrow/pull/9741 Without the quotes, we get ` parse error: condition expected: !=` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield closed pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield closed pull request #8949: URL: https://github.com/apache/arrow/pull/8949 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-801332194 +1 thank you. @liyafan82 did you have plans to work on the follow-up items or ZSTD? Otherwise I can take them up. @hedgehogcode any thoughts on how to procede for LZ4? We can maybe discuss more on the performance JIRA? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield closed pull request #9421: ARROW-11066: [FlightRPC][Java] Make zero-copy writes a configurable option
emkornfield closed pull request #9421: URL: https://github.com/apache/arrow/pull/9421 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9421: ARROW-11066: [FlightRPC][Java] Make zero-copy writes a configurable option
emkornfield commented on pull request #9421: URL: https://github.com/apache/arrow/pull/9421#issuecomment-801327752 +1 merging. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go
emkornfield commented on a change in pull request #9671: URL: https://github.com/apache/arrow/pull/9671#discussion_r596278550 ## File path: go/parquet/LICENSE.txt ## @@ -0,0 +1,1987 @@ + Review comment: yeah, looking back it seems like we needed to copy to make this usable. I think there is an open item for go to support symlinks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.
pitrou commented on pull request #9629: URL: https://github.com/apache/arrow/pull/9629#issuecomment-801280256 You need to update the git submodule as part of this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeroshade commented on pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go
zeroshade commented on pull request #9671: URL: https://github.com/apache/arrow/pull/9671#issuecomment-801275287 rebased branch with master, hopefully that fixes the failing check :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sbinet commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go
sbinet commented on a change in pull request #9671: URL: https://github.com/apache/arrow/pull/9671#discussion_r596236416 ## File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.s ## @@ -0,0 +1,174 @@ + .text Review comment: ok. fine by me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeroshade commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go
zeroshade commented on a change in pull request #9671: URL: https://github.com/apache/arrow/pull/9671#discussion_r596233932 ## File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.s ## @@ -0,0 +1,174 @@ + .text Review comment: Well, the assembly in the _lib directory technically doesn't have to exist at build time / run time. It's only needed for generating the .s files in the bmi directory itself via c2goasm. So you'd still have seamless `go get` installation workflow and wouldn't need to regenerate the _lib .s files in order to use this package. That all being said, I'd rather keep all the files checked in anyways. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jmgpeeters commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.
jmgpeeters commented on pull request #9629: URL: https://github.com/apache/arrow/pull/9629#issuecomment-801261388 Hm, for some reason the github integration test checked out an older version of the arrow-testing data, Run ci/scripts/util_checkout.sh Submodule 'cpp/submodules/parquet-testing' (https://github.com/apache/parquet-testing.git) registered for path 'cpp/submodules/parquet-testing' Submodule 'testing' (https://github.com/apache/arrow-testing) registered for path 'testing' Cloning into '/home/runner/work/arrow/arrow/cpp/submodules/parquet-testing'... Cloning into '/home/runner/work/arrow/arrow/testing'... Submodule path 'cpp/submodules/parquet-testing': checked out '8e7badc6a3817a02e06d17b5d8ab6b6dc356e890' From https://github.com/apache/arrow-testing * branche8ce32338f2dfeca3a5126f7677bdee159604000 -> FETCH_HEAD Submodule path 'testing': checked out 'e8ce32338f2dfeca3a5126f7677bdee159604000' which corresponds to one of yours: https://github.com/apache/arrow-testing/commit/e8ce32338f2dfeca3a5126f7677bdee159604000 Unless this is immediately obvious to you (the script seems to work fine for me locally), I'll take a look tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sbinet commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go
sbinet commented on a change in pull request #9671: URL: https://github.com/apache/arrow/pull/9671#discussion_r596219612 ## File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.s ## @@ -0,0 +1,174 @@ + .text Review comment: well, removing those would require people to have all the tooling installed to regenerate them before being able to use go-parquet, right? I'd rather keep all the files checked in that are needed to preserve the usual seamless "go get" installation workflow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sbinet commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go
sbinet commented on a change in pull request #9671: URL: https://github.com/apache/arrow/pull/9671#discussion_r596217949 ## File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.c ## @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +void extract_bits(uint64_t bitmap, uint64_t select_bitmap, uint64_t* res) { + *res = _pext_u64(bitmap, select_bitmap); +} + +void popcount64(uint64_t bitmap, uint64_t* res) { Review comment: giving back to Caesar what's Caesar's... (a.k.a @stuartcarnie in that instance) and where credits are due... https://www.influxdata.com/blog/influxdata-apache-arrow-go-implementation/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mm0708 commented on issue #8099: Specifying schema when using write_feather?
mm0708 commented on issue #8099: URL: https://github.com/apache/arrow/issues/8099#issuecomment-801249604 I dropped the column causing issues because it wasn't essential This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kdkavanagh commented on issue #8099: Specifying schema when using write_feather?
kdkavanagh commented on issue #8099: URL: https://github.com/apache/arrow/issues/8099#issuecomment-801240233 @mm0708 what did you conclude here? I have this same question/problem This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
pitrou closed pull request #9738: URL: https://github.com/apache/arrow/pull/9738 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
pitrou commented on pull request #9738: URL: https://github.com/apache/arrow/pull/9738#issuecomment-801225326 @dianaclarke You may want to enable Travis-CI on your Arrow fork for faster CI results. (in this case this doesn't matter, since Travis-CI doesn't run Python tests) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
pitrou commented on pull request #9738: URL: https://github.com/apache/arrow/pull/9738#issuecomment-801224615 Green CI run at https://github.com/dianaclarke/arrow/runs/2131383253 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ianmcook opened a new pull request #9740: ARROW-12003: [R] Fix NOTE re undefined global function group_by_drop_default
ianmcook opened a new pull request #9740: URL: https://github.com/apache/arrow/pull/9740 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes
github-actions[bot] commented on pull request #9739: URL: https://github.com/apache/arrow/pull/9739#issuecomment-801195182 https://issues.apache.org/jira/browse/ARROW-12000 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #9733: ARROW-11996: [R] Make r/configure run successfully on Solaris
nealrichardson closed pull request #9733: URL: https://github.com/apache/arrow/pull/9733 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python
lidavidm commented on a change in pull request #9725: URL: https://github.com/apache/arrow/pull/9725#discussion_r596093793 ## File path: cpp/src/arrow/dataset/file_csv.h ## @@ -37,6 +37,11 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat { public: /// Options affecting the parsing of CSV files csv::ParseOptions parse_options = csv::ParseOptions::Defaults(); + /// Options affecting how CSV files are read. + /// + /// Note use_threads is ignored (it is always considered false) and block_size + /// should be set on CsvFragmentScanOptions. + csv::ReadOptions read_options = csv::ReadOptions::Defaults(); Review comment: Makes sense. I was going back and forth on it because of the redundancy. I'll inline all relevant fields and xref ReadOptions for documentation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python
bkietz commented on a change in pull request #9725: URL: https://github.com/apache/arrow/pull/9725#discussion_r596084650 ## File path: cpp/src/arrow/dataset/dataset_internal.h ## @@ -185,5 +185,14 @@ inline bool operator==(const SubtreeImpl::Encoded& l, const SubtreeImpl::Encoded l.partition_expression == r.partition_expression; } +template +std::shared_ptr DowncastFragmentScanOptions( +const std::shared_ptr& scan_options, const std::string& type_name) { + if (!scan_options) return nullptr; + if (!scan_options->fragment_scan_options) return nullptr; + if (scan_options->fragment_scan_options->type_name() != type_name) return nullptr; Review comment: I think this should be an error ```suggestion if (scan_options->fragment_scan_options->type_name() != type_name) { return Status::Invalid("FragmentScanOptions of type ", scan_options->fragment_scan_options->type_name(), " were provided for scanning a fragment of type ", type_name); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python
bkietz commented on a change in pull request #9725: URL: https://github.com/apache/arrow/pull/9725#discussion_r596092120 ## File path: cpp/src/arrow/dataset/file_csv.h ## @@ -37,6 +37,11 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat { public: /// Options affecting the parsing of CSV files csv::ParseOptions parse_options = csv::ParseOptions::Defaults(); + /// Options affecting how CSV files are read. + /// + /// Note use_threads is ignored (it is always considered false) and block_size + /// should be set on CsvFragmentScanOptions. + csv::ReadOptions read_options = csv::ReadOptions::Defaults(); Review comment: I'm not sure it's reasonable to add `ReadOptions` here. I think only `skip_rows` and `autogenerate_column_names` are meaningful here. Column renaming can be accomplished in projection if desired (obviating `column_names`) and as noted `use_threads` and `block_size` are configured elsewhere. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] WIP: Add helper to generate random record batches by schema
lidavidm commented on a change in pull request #9715: URL: https://github.com/apache/arrow/pull/9715#discussion_r596088841 ## File path: cpp/src/arrow/testing/random.cc ## @@ -558,5 +584,248 @@ std::shared_ptr RandomArrayGenerator::ArrayOf(std::shared_ptr t return RandomArrayGeneratorOfImpl{this, type, size, null_probability, nullptr}.Finish(); } +namespace { +template +typename T::c_type GetMetadata(const KeyValueMetadata* metadata, const std::string& key, + typename T::c_type default_value) { Review comment: Nice, I admit I'm not as up to speed on my template metaprogramming as I'd like to be. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema
lidavidm commented on a change in pull request #9715: URL: https://github.com/apache/arrow/pull/9715#discussion_r596087724 ## File path: cpp/src/arrow/testing/random.cc ## @@ -558,5 +584,248 @@ std::shared_ptr RandomArrayGenerator::ArrayOf(std::shared_ptr t return RandomArrayGeneratorOfImpl{this, type, size, null_probability, nullptr}.Finish(); } +namespace { +template +typename T::c_type GetMetadata(const KeyValueMetadata* metadata, const std::string& key, + typename T::c_type default_value) { + if (!metadata) return default_value; + const auto index = metadata->FindKey(key); + if (index < 0) return default_value; + const auto& value = metadata->value(index); + typename T::c_type output{}; + auto type = checked_pointer_cast(TypeTraits::type_singleton()); + if (!internal::ParseValue(*type, value.data(), value.length(), &output)) { +ABORT_NOT_OK(Status::Invalid("Could not parse ", key, " = ", value)); + } + return output; +} + +Result> GenerateArray(const Field& field, int64_t length, + RandomArrayGenerator* generator) { +#define GENERATE_INTEGRAL_CASE_VIEW(BASE_TYPE, VIEW_TYPE) \ + case VIEW_TYPE::type_id: { \ +const BASE_TYPE::c_type min_value = GetMetadata( \ +field.metadata().get(), "min", std::numeric_limits::min()); \ +const BASE_TYPE::c_type max_value = GetMetadata( \ +field.metadata().get(), "max", std::numeric_limits::max()); \ +return generator->Numeric(length, min_value, max_value, null_probability) \ +->View(field.type()); \ + } +#define GENERATE_INTEGRAL_CASE(ARROW_TYPE) \ + GENERATE_INTEGRAL_CASE_VIEW(ARROW_TYPE, ARROW_TYPE) +#define GENERATE_FLOATING_CASE(ARROW_TYPE, GENERATOR_FUNC) \ + case ARROW_TYPE::type_id: { \ +const ARROW_TYPE::c_type min_value = GetMetadata( \ +field.metadata().get(), "min", std::numeric_limits::min()); \ +const ARROW_TYPE::c_type max_value = GetMetadata( \ +field.metadata().get(), "max", std::numeric_limits::max()); \ +const double nan_probability = \ +GetMetadata(field.metadata().get(), "nan_probability", 0); \ +return generator->GENERATOR_FUNC(length, min_value, max_value, null_probability,\ + nan_probability); \ + } + + const double null_probability = + field.nullable() + ? GetMetadata(field.metadata().get(), "null_probability", 0.01) + : 0.0; + switch (field.type()->id()) { +case Type::type::NA: + return std::make_shared(length); + +case Type::type::BOOL: { + const double true_probability = + GetMetadata(field.metadata().get(), "true_probability", 0.5); + return generator->Boolean(length, true_probability, null_probability); +} + + GENERATE_INTEGRAL_CASE(UInt8Type); + GENERATE_INTEGRAL_CASE(Int8Type); + GENERATE_INTEGRAL_CASE(UInt16Type); + GENERATE_INTEGRAL_CASE(Int16Type); + GENERATE_INTEGRAL_CASE(UInt32Type); + GENERATE_INTEGRAL_CASE(Int32Type); + GENERATE_INTEGRAL_CASE(UInt64Type); + GENERATE_INTEGRAL_CASE(Int64Type); + GENERATE_INTEGRAL_CASE_VIEW(Int16Type, HalfFloatType); + GENERATE_FLOATING_CASE(FloatType, Float32); + GENERATE_FLOATING_CASE(DoubleType, Float64); + +case Type::type::STRING: +case Type::type::BINARY: { + const int32_t min_length = GetMetadata(field.metadata().get(), "min", 0); + const int32_t max_length = + GetMetadata(field.metadata().get(), "max", 1024); + const int32_t unique_values = + GetMetadata(field.metadata().get(), "unique", -1); + if (unique_values > 0) { +return generator +->StringWithRepeats(length, unique_values, min_length, max_length, +null_probability) +->View(field.type()); + } + return generator->String(length, min_length, max_length, null_probability) + ->View(field.type()); +} + +case Type::type::DECIMAL128: +case Type::type::DECIMAL256: +case Type::type::FIXED_SIZE_BINARY: { + auto byte_width = + internal::checked_pointer_cast(field.type())->byte_width(); + return generator->FixedSizeBinary(length, byte_width, null_probability) + ->View(field.type()); +} + + GENERATE_INTEGRAL_CASE_VIEW(Int32Type, Date32Type); + GENERATE_INTEGRAL_CASE_VIEW(Int64Type, Date64Type); + GENERATE_INTEGRAL_CASE_VIEW(Int64Type, TimestampType); + GENERATE_INTE
[GitHub] [arrow] pitrou closed pull request #8023: ARROW-9318: [C++] Parquet encryption key management
pitrou closed pull request #8023: URL: https://github.com/apache/arrow/pull/8023 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8023: ARROW-9318: [C++] Parquet encryption key management
pitrou commented on pull request #8023: URL: https://github.com/apache/arrow/pull/8023#issuecomment-801144014 I'll merge then. Thank you @thamht4190 and @ggershinsky for contributing this! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema
lidavidm commented on a change in pull request #9715: URL: https://github.com/apache/arrow/pull/9715#discussion_r596086831 ## File path: cpp/src/arrow/testing/random.cc ## @@ -558,5 +584,248 @@ std::shared_ptr RandomArrayGenerator::ArrayOf(std::shared_ptr t return RandomArrayGeneratorOfImpl{this, type, size, null_probability, nullptr}.Finish(); } +namespace { +template +typename T::c_type GetMetadata(const KeyValueMetadata* metadata, const std::string& key, + typename T::c_type default_value) { + if (!metadata) return default_value; + const auto index = metadata->FindKey(key); + if (index < 0) return default_value; + const auto& value = metadata->value(index); + typename T::c_type output{}; + auto type = checked_pointer_cast(TypeTraits::type_singleton()); + if (!internal::ParseValue(*type, value.data(), value.length(), &output)) { +ABORT_NOT_OK(Status::Invalid("Could not parse ", key, " = ", value)); + } + return output; +} + +Result> GenerateArray(const Field& field, int64_t length, + RandomArrayGenerator* generator) { +#define GENERATE_INTEGRAL_CASE_VIEW(BASE_TYPE, VIEW_TYPE) \ + case VIEW_TYPE::type_id: { \ +const BASE_TYPE::c_type min_value = GetMetadata( \ +field.metadata().get(), "min", std::numeric_limits::min()); \ +const BASE_TYPE::c_type max_value = GetMetadata( \ +field.metadata().get(), "max", std::numeric_limits::max()); \ +return generator->Numeric(length, min_value, max_value, null_probability) \ +->View(field.type()); \ + } +#define GENERATE_INTEGRAL_CASE(ARROW_TYPE) \ + GENERATE_INTEGRAL_CASE_VIEW(ARROW_TYPE, ARROW_TYPE) +#define GENERATE_FLOATING_CASE(ARROW_TYPE, GENERATOR_FUNC) \ + case ARROW_TYPE::type_id: { \ +const ARROW_TYPE::c_type min_value = GetMetadata( \ +field.metadata().get(), "min", std::numeric_limits::min()); \ +const ARROW_TYPE::c_type max_value = GetMetadata( \ +field.metadata().get(), "max", std::numeric_limits::max()); \ +const double nan_probability = \ +GetMetadata(field.metadata().get(), "nan_probability", 0); \ +return generator->GENERATOR_FUNC(length, min_value, max_value, null_probability,\ + nan_probability); \ + } + + const double null_probability = + field.nullable() + ? GetMetadata(field.metadata().get(), "null_probability", 0.01) + : 0.0; + switch (field.type()->id()) { +case Type::type::NA: Review comment: Yeah, I will add more validation for these things - I also noticed that a null_probability outside [0,1] on MSVC leads to runtime errors (which do not appear to apply to GCC). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema
lidavidm commented on a change in pull request #9715: URL: https://github.com/apache/arrow/pull/9715#discussion_r596085481 ## File path: cpp/src/arrow/testing/random.h ## @@ -358,6 +362,10 @@ class ARROW_TESTING_EXPORT RandomArrayGenerator { std::default_random_engine seed_rng_; }; +ARROW_TESTING_EXPORT +std::shared_ptr Generate(const FieldVector& fields, int64_t size, + SeedType seed); Review comment: I am not particularly for magic like that. But maybe we could make Generate{Batch,Table,Array,...} methods of Generator to allow them to share a seed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)
pitrou commented on a change in pull request #9730: URL: https://github.com/apache/arrow/pull/9730#discussion_r596085147 ## File path: docs/source/python/pandas.rst ## @@ -293,3 +293,19 @@ Used together, the call will yield significantly lower memory usage in some scenarios. Without these options, ``to_pandas`` will always double memory. + +Note that ``self_destruct=True`` is not guaranteed to save memory. Since the +conversion happens column by column, memory is also freed column by column. But +if multiple columns share an underlying allocation, then no memory will be Review comment: Ah, you're right, thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema
lidavidm commented on a change in pull request #9715: URL: https://github.com/apache/arrow/pull/9715#discussion_r596084830 ## File path: cpp/src/arrow/testing/random.h ## @@ -358,6 +362,10 @@ class ARROW_TESTING_EXPORT RandomArrayGenerator { std::default_random_engine seed_rng_; }; +ARROW_TESTING_EXPORT +std::shared_ptr Generate(const FieldVector& fields, int64_t size, Review comment: Might it be better to just allow passing an explicit schema instead? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)
lidavidm commented on a change in pull request #9730: URL: https://github.com/apache/arrow/pull/9730#discussion_r596083089 ## File path: docs/source/python/pandas.rst ## @@ -293,3 +293,19 @@ Used together, the call will yield significantly lower memory usage in some scenarios. Without these options, ``to_pandas`` will always double memory. + +Note that ``self_destruct=True`` is not guaranteed to save memory. Since the +conversion happens column by column, memory is also freed column by column. But +if multiple columns share an underlying allocation, then no memory will be Review comment: Yes, multiple arrays may share the same buffer. In IPC for instance, we read a record batch's worth of data from the file at a time, and hence all arrays in that batch share the same buffer. In Flight, similarly, we receive a record batch's worth of data from gRPC and (for implementation reasons) concatenate it into a single buffer, so we end up in the same situation. (I don't think this applies to, say, Parquet or CSV, since there's actual decoding for those formats, but haven't tested it.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)
lidavidm commented on a change in pull request #9730: URL: https://github.com/apache/arrow/pull/9730#discussion_r596081637 ## File path: docs/source/python/pandas.rst ## @@ -293,3 +293,19 @@ Used together, the call will yield significantly lower memory usage in some scenarios. Without these options, ``to_pandas`` will always double memory. + +Note that ``self_destruct=True`` is not guaranteed to save memory. Since the +conversion happens column by column, memory is also freed column by column. But +if multiple columns share an underlying allocation, then no memory will be +freed until all of those columns are converted. In particular, data that comes +from IPC or Flight is prone to this, as memory will be laid out as follows:: + + Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ... + Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ... + ... + +In this case, no memory can be freed until the entire table is converted, even +with ``self_destruct=True``. + +Additionally, even if memory is freed by Arrow, depending on the allocator in +use, the memory may not be returned to the operating system immediately. Review comment: I'll remove this, but I do feel this is a commonly misunderstood point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema
bkietz commented on a change in pull request #9715: URL: https://github.com/apache/arrow/pull/9715#discussion_r596029757 ## File path: cpp/src/arrow/testing/random.h ## @@ -358,6 +362,10 @@ class ARROW_TESTING_EXPORT RandomArrayGenerator { std::default_random_engine seed_rng_; }; +ARROW_TESTING_EXPORT +std::shared_ptr Generate(const FieldVector& fields, int64_t size, + SeedType seed); Review comment: What do you think about replacing explicit seed arguments (which are usually repeated within a suite) with ```c++ /// The seed used for random generation. Non const to support customizing the seed for a suite static SeedType kSeed = 0xDEADBEEF; ``` or ```c++ /// Sets the seed used for random generation while it is in scope. /// May be constructed at static scope to set the seed for an entire /// test/benchmark suite. If no SeedGuard is in scope, the seed will /// be kDefaultSeed struct SeedGuard { explicit SeedGuard(SeedType seed) { PushSeed(seed); } ~SeedGuard() { PopSeed(); } static constexpr SeedType kDefaultSeed = 0xDEADBEEF; static void PopSeed(); static void PushSeed(SeedType); }; ``` ? ## File path: cpp/src/arrow/testing/random.cc ## @@ -369,12 +371,16 @@ std::shared_ptr RandomArrayGenerator::FixedSizeBinary(int64_t size, std::move(null_bitmap), null_count); } -std::shared_ptr RandomArrayGenerator::Offsets(int64_t size, int32_t first_offset, - int32_t last_offset, - double null_probability, - bool force_empty_nulls) { - using GenOpt = GenerateOptions>; - GenOpt options(seed(), first_offset, last_offset, null_probability); +namespace { +template Review comment: When I read this first, I was looking for `typename ArrayType::TypeClass::offset_type` because I assumed this parameter referred to `StringArray` or so ```suggestion template ``` ## File path: cpp/src/arrow/testing/random_test.cc ## @@ -0,0 +1,195 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include "arrow/array.h" +#include "arrow/record_batch.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/type.h" +#include "arrow/util/key_value_metadata.h" + +namespace arrow { +namespace random { + +class RandomArrayTest : public ::testing::TestWithParam> { + protected: + std::shared_ptr GetField() { return GetParam(); } +}; + +template +class RandomNumericArrayTest : public ::testing::Test { + protected: + std::shared_ptr GetField() { return field("field0", std::make_shared()); } + + std::shared_ptr> Downcast(std::shared_ptr array) { +return internal::checked_pointer_cast>(array); + } +}; + +TEST_P(RandomArrayTest, GenerateArray) { + auto field = GetField(); + auto batch = Generate({field}, 128, 0xDEADBEEF); + AssertSchemaEqual(schema({field}), batch->schema()); + auto array = batch->column(0); + ASSERT_EQ(128, array->length()); + ASSERT_OK(array->ValidateFull()); +} + +TEST_P(RandomArrayTest, GenerateNonNullArray) { + auto field = + GetField()->WithMetadata(key_value_metadata({{"null_probability", "0.0"}})); + if (field->type()->id() == Type::type::NA) { +GTEST_SKIP() << "Cannot generate non-null null arrays"; + } + auto batch = Generate({field}, 128, 0xDEADBEEF); + AssertSchemaEqual(schema({field}), batch->schema()); + auto array = batch->column(0); + ASSERT_OK(array->ValidateFull()); + ASSERT_EQ(0, array->null_count()); +} + +TEST_P(RandomArrayTest, GenerateNonNullableArray) { + auto field = GetField()->WithNullable(false); + if (field->type()->id() == Type::type::NA) { +GTEST_SKIP() << "Cannot generate non-null null arrays"; + } + auto batch = Generate({field}, 128, 0xDEADBEEF); + AssertSchemaEqual(schema({field}), batch->schema()); + auto array = batch->column(0); + ASSERT_OK(array->ValidateFull()); + ASSERT_EQ(0, array->null_count()); +} + +struct FieldParamName
[GitHub] [arrow] pitrou commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel
pitrou commented on a change in pull request #8468: URL: https://github.com/apache/arrow/pull/8468#discussion_r596070474 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) { #endif } +// -- +// replace substring + +template +struct ReplaceSubStringBase { + using ArrayType = typename TypeTraits::ArrayType; + using ScalarType = typename TypeTraits::ScalarType; + using BuilderType = typename TypeTraits::BuilderType; + using offset_type = typename Type::offset_type; + using ValueDataBuilder = TypedBufferBuilder; + using OffsetBuilder = TypedBufferBuilder; + using State = OptionsWrapper; + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +Derived derived(ctx, State::Get(ctx)); +if (ctx->status().ok()) { + derived.Replace(ctx, batch, out); +} + } + void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +std::shared_ptr value_data_builder = +std::make_shared(); +std::shared_ptr offset_builder = std::make_shared(); + +if (batch[0].kind() == Datum::ARRAY) { + // We already know how many strings we have, so we can use Reserve/UnsafeAppend + KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length)); + + const ArrayData& input = *batch[0].array(); + KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0)); // offsets start at 0 + KERNEL_RETURN_IF_ERROR( + ctx, VisitArrayDataInline( + input, + [&](util::string_view s) { + RETURN_NOT_OK(static_cast(*this).ReplaceString( + s, value_data_builder.get())); + offset_builder->UnsafeAppend( + static_cast(value_data_builder->length())); + return Status::OK(); + }, + [&]() { + // offset for null value + offset_builder->UnsafeAppend( + static_cast(value_data_builder->length())); + return Status::OK(); + })); + ArrayData* output = out->mutable_array(); + KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2])); + KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1])); +} else { + const auto& input = checked_cast(*batch[0].scalar()); + auto result = std::make_shared(); + if (input.is_valid) { +util::string_view s = static_cast(*input.value); +KERNEL_RETURN_IF_ERROR( +ctx, static_cast(*this).ReplaceString(s, value_data_builder.get())); +KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value)); +result->is_valid = true; + } + out->value = result; +} + } +}; + +template +struct ReplaceSubString : ReplaceSubStringBase> { Review comment: Well, you don't need to refactor other kernels for now, but I suppose this one could easily be adapted, no? :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask
pitrou closed pull request #9720: URL: https://github.com/apache/arrow/pull/9720 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)
pitrou commented on a change in pull request #9730: URL: https://github.com/apache/arrow/pull/9730#discussion_r596063732 ## File path: docs/source/python/pandas.rst ## @@ -293,3 +293,19 @@ Used together, the call will yield significantly lower memory usage in some scenarios. Without these options, ``to_pandas`` will always double memory. + +Note that ``self_destruct=True`` is not guaranteed to save memory. Since the +conversion happens column by column, memory is also freed column by column. But +if multiple columns share an underlying allocation, then no memory will be Review comment: What does "share" mean here? Do you mean different arrays share the same underlying buffer? How does that happen? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)
pitrou commented on a change in pull request #9730: URL: https://github.com/apache/arrow/pull/9730#discussion_r596063175 ## File path: docs/source/python/pandas.rst ## @@ -293,3 +293,19 @@ Used together, the call will yield significantly lower memory usage in some scenarios. Without these options, ``to_pandas`` will always double memory. + +Note that ``self_destruct=True`` is not guaranteed to save memory. Since the +conversion happens column by column, memory is also freed column by column. But +if multiple columns share an underlying allocation, then no memory will be +freed until all of those columns are converted. In particular, data that comes +from IPC or Flight is prone to this, as memory will be laid out as follows:: + + Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ... + Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ... + ... + +In this case, no memory can be freed until the entire table is converted, even +with ``self_destruct=True``. + +Additionally, even if memory is freed by Arrow, depending on the allocator in +use, the memory may not be returned to the operating system immediately. Review comment: This is true of all deallocations, so I find it rather un-useful to mention specifically here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes
pitrou commented on pull request #9739: URL: https://github.com/apache/arrow/pull/9739#issuecomment-801118593 This doesn't seem exact. We have `struct` types in public headers (such as `filesystem.h`). I think @wesm 's rationale seems more faithful to reality: > In the public headers we should be trying to use struct only for objects that are principally simple data containers where we are OK with exposing all the internal members, and any methods are primarily conveniences. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
dianaclarke commented on a change in pull request #9738: URL: https://github.com/apache/arrow/pull/9738#discussion_r596059886 ## File path: python/pyarrow/tests/test_table.py ## @@ -271,7 +271,9 @@ def ne(xarrs, yarrs): eq([a, c], [d]) ne([c, a], [a, c]) -assert not pa.chunked_array([], type=pa.int32()).equals(None) +# ARROW-4822 +with pytest.raises(AttributeError): +pa.chunked_array([], type=pa.int32()).equals(None) Review comment: Oh, I just saw your edit: `table.equals(None) can still raise TypeError though.` Are you okay with this? This is how it was before. ``` >>> import pyarrow >>> table = pyarrow.chunked_array([], type=pyarrow.int32()) >>> table.equals(None) False ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
dianaclarke commented on a change in pull request #9738: URL: https://github.com/apache/arrow/pull/9738#discussion_r596057813 ## File path: python/pyarrow/tests/test_table.py ## @@ -271,7 +271,9 @@ def ne(xarrs, yarrs): eq([a, c], [d]) ne([c, a], [a, c]) -assert not pa.chunked_array([], type=pa.int32()).equals(None) +# ARROW-4822 +with pytest.raises(AttributeError): +pa.chunked_array([], type=pa.int32()).equals(None) Review comment: Behaviour is back to normal now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
pitrou commented on a change in pull request #9738: URL: https://github.com/apache/arrow/pull/9738#discussion_r596043491 ## File path: python/pyarrow/tests/test_table.py ## @@ -271,7 +271,9 @@ def ne(xarrs, yarrs): eq([a, c], [d]) ne([c, a], [a, c]) -assert not pa.chunked_array([], type=pa.int32()).equals(None) +# ARROW-4822 +with pytest.raises(AttributeError): +pa.chunked_array([], type=pa.int32()).equals(None) Review comment: Right, `table == None` should return False. `table.equals(None)` can still raise `TypeError` though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
pitrou commented on a change in pull request #9738: URL: https://github.com/apache/arrow/pull/9738#discussion_r596043491 ## File path: python/pyarrow/tests/test_table.py ## @@ -271,7 +271,9 @@ def ne(xarrs, yarrs): eq([a, c], [d]) ne([c, a], [a, c]) -assert not pa.chunked_array([], type=pa.int32()).equals(None) +# ARROW-4822 +with pytest.raises(AttributeError): +pa.chunked_array([], type=pa.int32()).equals(None) Review comment: Right, `table == None` should return False. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
dianaclarke commented on a change in pull request #9738: URL: https://github.com/apache/arrow/pull/9738#discussion_r596042150 ## File path: python/pyarrow/tests/test_table.py ## @@ -271,7 +271,9 @@ def ne(xarrs, yarrs): eq([a, c], [d]) ne([c, a], [a, c]) -assert not pa.chunked_array([], type=pa.int32()).equals(None) +# ARROW-4822 +with pytest.raises(AttributeError): +pa.chunked_array([], type=pa.int32()).equals(None) Review comment: This is a behaviour change though. Before: ``` >>> import pyarrow >>> table = pyarrow.chunked_array([], type=pyarrow.int32()) >>> table is None False >>> table == None False ``` After: ``` >>> import pyarrow >>> table = pyarrow.chunked_array([], type=pyarrow.int32()) >>> table is None False >>> table == None Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 187, in pyarrow.lib.ChunkedArray.__eq__ return self.equals(other) File "pyarrow/table.pxi", line 212, in pyarrow.lib.ChunkedArray.equals CChunkedArray* other_arr = other.chunked_array AttributeError: 'NoneType' object has no attribute 'chunked_array' ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #9685: ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV
bkietz closed pull request #9685: URL: https://github.com/apache/arrow/pull/9685 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
dianaclarke commented on pull request #9738: URL: https://github.com/apache/arrow/pull/9738#issuecomment-801073738 Oh, thanks a ton @pitrou! I'm working on the additional failing tests now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
pitrou commented on pull request #9738: URL: https://github.com/apache/arrow/pull/9738#issuecomment-801062229 This is a recurring problem, so instead of testing for None manually we could let Cython do it for us: ```diff diff --git a/python/pyarrow/lib.pyx b/python/pyarrow/lib.pyx index eba0f5a47..aa6918b54 100644 --- a/python/pyarrow/lib.pyx +++ b/python/pyarrow/lib.pyx @@ -15,9 +15,10 @@ # specific language governing permissions and limitations # under the License. -# cython: profile=False -# distutils: language = c++ +# cython: profile = False # cython: embedsignature = True +# cython: nonecheck = True +# distutils: language = c++ import datetime import decimal as _pydecimal diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi index 2d474ba76..3f1fc28ee 100644 --- a/python/pyarrow/table.pxi +++ b/python/pyarrow/table.pxi @@ -2225,8 +2225,6 @@ def concat_tables(tables, c_bool promote=False, MemoryPool memory_pool=None): CConcatenateTablesOptions.Defaults()) for table in tables: -if table is None: -raise TypeError("Unable to concatenate table. Table is None.") c_tables.push_back(table.sp_table) with nogil: ``` Note that a few tests will need fixing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes
dianaclarke commented on a change in pull request #9739: URL: https://github.com/apache/arrow/pull/9739#discussion_r595979695 ## File path: docs/source/developers/cpp/development.rst ## @@ -77,6 +77,10 @@ This project follows `Google's C++ Style Guide * We relax the line length restriction to 90 characters. * We use the ``NULLPTR`` macro in header files (instead of ``nullptr``) defined in ``src/arrow/util/macros.h`` to support building C++/CLI (ARROW-1134) +* For internal types (not in public headers) it is ok to use structs for + convenience (e.g. list initialization) if access control is not needed + (e.g. no public access to a nested type) even if the struct has + state/invaiants Review comment: `s/invaiants/invariants/` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes
westonpace opened a new pull request #9739: URL: https://github.com/apache/arrow/pull/9739 ![docs](https://user-images.githubusercontent.com/1696093/111465419-2b1e8880-86c6-11eb-98d5-5e5b873c224c.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #9421: ARROW-11066: [FlightRPC][Java] Make zero-copy writes a configurable option
lidavidm commented on pull request #9421: URL: https://github.com/apache/arrow/pull/9421#issuecomment-801015523 I think we should be all good (obviously let's keep an eye out for if it regresses again though), thanks for taking a look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #9147: ARROW-11177: [Java] ArrowMessage failed to parse compressed grpc stream
lidavidm commented on pull request #9147: URL: https://github.com/apache/arrow/pull/9147#issuecomment-801014998 I still believe we should not be polling available(). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #9728: ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults
lidavidm commented on pull request #9728: URL: https://github.com/apache/arrow/pull/9728#issuecomment-801014690 Done, sorry (I thought I ran archery locally…) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
github-actions[bot] commented on pull request #9738: URL: https://github.com/apache/arrow/pull/9738#issuecomment-801014571 https://issues.apache.org/jira/browse/ARROW-11997 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs closed pull request #9713: ARROW-11971: [Packaging] Vcpkg patch doesn't apply on windows due to line endings
kszucs closed pull request #9713: URL: https://github.com/apache/arrow/pull/9713 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #9713: ARROW-11971: [Packaging] Vcpkg patch doesn't apply on windows due to line endings
kszucs commented on pull request #9713: URL: https://github.com/apache/arrow/pull/9713#issuecomment-800999549 CI error is unrelated, merging. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke opened a new pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter
dianaclarke opened a new pull request #9738: URL: https://github.com/apache/arrow/pull/9738 Before: >>> import pyarrow >>> pyarrow.concat_tables([None]) Bus error: 10 Python quit unexpectedly After: >>> import pyarrow >>> pyarrow.concat_tables([None]) Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 2229, in pyarrow.lib.concat_tables raise TypeError("Unable to concatenate table. Table is None.") TypeError: Unable to concatenate table. Table is None. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask
westonpace commented on pull request #9720: URL: https://github.com/apache/arrow/pull/9720#issuecomment-800985902 Thanks for cleaning this up. LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #9678: ARROW-11907: [C++] Use our own executor in S3FileSystem
pitrou closed pull request #9678: URL: https://github.com/apache/arrow/pull/9678 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9728: ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults
pitrou commented on pull request #9728: URL: https://github.com/apache/arrow/pull/9728#issuecomment-800968946 Can you fix the Python lint error? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #9734: ARROW-11998: [C++] Make it easier to create vectors of move-only data types for tests
pitrou commented on pull request #9734: URL: https://github.com/apache/arrow/pull/9734#issuecomment-800967868 Note: originally from https://github.com/westonpace/arrow/pull/6/commits/781dccacaea129668abf1aec87becfba910a266b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union
alamb closed pull request #9695: URL: https://github.com/apache/arrow/pull/9695 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union
alamb commented on a change in pull request #9695: URL: https://github.com/apache/arrow/pull/9695#discussion_r595886251 ## File path: rust/datafusion/src/logical_plan/builder.rs ## @@ -413,6 +436,32 @@ mod tests { Ok(()) } +#[test] +fn plan_builder_union_combined_single_union() -> Result<()> { +let plan = LogicalPlanBuilder::scan_empty( +"employee.csv", +&employee_schema(), +Some(vec![3, 4]), +)?; + +let plan = plan +.union(plan.build()?)? +.union(plan.build()?)? +.union(plan.build()?)? +.build()?; + +// output has only one union +let expected = "Union\ Review comment: 👍 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] hunght3101 opened a new issue #9737: [Golang] How i create input format for IPC Reader?
hunght3101 opened a new issue #9737: URL: https://github.com/apache/arrow/issues/9737 user@ In library ``` // NewReader returns a reader that reads records from an input stream. func NewReader(r io.Reader, opts ...Option) (*Reader, error) { return NewReaderFromMessageReader(NewMessageReader(r), opts...) } ``` How i create parameter `r` right when i call NewReader function ? Example : ``` var r io.Reader r = strings.NewReader(input) rr, err := ipc.NewReader(r, ipc.WithSchema(schema), ipc.WithAllocator(mem)) if err != nil { fmt.Println(err) } ``` How i create input ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask
pitrou commented on a change in pull request #9720: URL: https://github.com/apache/arrow/pull/9720#discussion_r595837653 ## File path: cpp/src/arrow/testing/gtest_util.cc ## @@ -719,43 +719,44 @@ ExtensionTypeGuard::~ExtensionTypeGuard() { } } -class GatingTask::Impl { +class GatingTask::Impl : public std::enable_shared_from_this { public: explicit Impl(double timeout_seconds) : timeout_seconds_(timeout_seconds), status_(), unlocked_(false) {} ~Impl() { -std::unique_lock lk(mx_); -if (!finished_cv_.wait_for( -lk, std::chrono::nanoseconds(static_cast(timeout_seconds_ * 1e9)), -[this] { return num_finished_ == num_launched_; })) { +if (num_running_ != num_launched_) { ADD_FAILURE() - << "A GatingTask instance was destroyed but the underlying tasks did not " + << "A GatingTask instance was destroyed but some underlying tasks did not " + "start running" + << std::endl; +} else if (num_finished_ != num_launched_) { + ADD_FAILURE() + << "A GatingTask instance was destroyed but some underlying tasks did not " "finish running" << std::endl; } } std::function Task() { num_launched_++; Review comment: I think Task() is only meant to be called from the main thread. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask
pitrou commented on a change in pull request #9720: URL: https://github.com/apache/arrow/pull/9720#discussion_r595837013 ## File path: cpp/src/arrow/testing/gtest_util.cc ## @@ -769,10 +770,8 @@ class GatingTask::Impl { } Status Unlock() { -{ - std::lock_guard lk(mx_); - unlocked_ = true; -} +std::lock_guard lk(mx_); Review comment: Indeed, but this is a test util so I'd favor simplicity. ## File path: cpp/src/arrow/testing/gtest_util.cc ## @@ -769,10 +770,8 @@ class GatingTask::Impl { } Status Unlock() { -{ - std::lock_guard lk(mx_); - unlocked_ = true; -} +std::lock_guard lk(mx_); Review comment: Indeed, but this is a test utility so I'd favor simplicity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] hunght3101 commented on issue #9735: [Golang] Create ipc format for ipc Reader
hunght3101 commented on issue #9735: URL: https://github.com/apache/arrow/issues/9735#issuecomment-800908056 @emkornfield Excuse me, i know `r = strings.NewReader("Object Apache Arrow")` is problem. But i dont know to input right to it. Please re-open this issues ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-800898224 > please update the docs to match, something like. > "Slice the buffer to contain the uncompressed bytes" Updated. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-800897864 > With the new enum, maybe we can make this an accessor that returns and enum instead? and then the byte can be extracted from there where necesssary? Sounds good. I have revised the code accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on a change in pull request #8949: URL: https://github.com/apache/arrow/pull/8949#discussion_r595805905 ## File path: java/compression/src/main/java/org/apache/arrow/compression/Lz4CompressionCodec.java ## @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.compression; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; + +import org.apache.arrow.flatbuf.CompressionType; +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.util.MemoryUtil; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.compression.CompressionCodec; +import org.apache.arrow.vector.compression.CompressionUtil; +import org.apache.commons.compress.compressors.lz4.FramedLZ4CompressorInputStream; +import org.apache.commons.compress.compressors.lz4.FramedLZ4CompressorOutputStream; +import org.apache.commons.compress.utils.IOUtils; + +import io.netty.util.internal.PlatformDependent; Review comment: I guess no for now. Maybe we can have one in the future, so we can remove the dependency on Netty (and other dependencies on Netty as well). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org