[GitHub] [arrow] kou commented on pull request #33817: GH-33816: [CI][Conan] Use TARGET_FILE for portability
kou commented on PR #33817: URL: https://github.com/apache/arrow/pull/33817#issuecomment-1399202474 @github-actions crossbow submit wheel-windows-*-amd64 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #33817: GH-33816: [CI][Conan] Use TARGET_FILE for portability
kou commented on PR #33817: URL: https://github.com/apache/arrow/pull/33817#issuecomment-1399202261 @github-actions crossbow submit conan-maximum -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new pull request, #33817: GH-33816: [CI][Conan] Use TARGET_FILE for portability
kou opened a new pull request, #33817: URL: https://github.com/apache/arrow/pull/33817 ### What changes are included in this PR? Use `$` instead of manual library path resolution. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on issue #33814: [Python] Can't install on Raspberry Pi (Failed building wheel for pyarrow)
kou commented on issue #33814: URL: https://github.com/apache/arrow/issues/33814#issuecomment-1399201865 It seems that you didn't define `CMAKE_PREFIX_PATH`: https://arrow.apache.org/docs/dev/developers/python.html#using-system-and-bundled-dependencies ```bash export ARROW_HOME=$(pwd)/dist export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new pull request, #33815: GH-33813: [CI][GLib] Use Ruby 3.2 to update bundled MSYS2
kou opened a new pull request, #33815: URL: https://github.com/apache/arrow/pull/33815 ### What changes are included in this PR? Use Ruby 3.2 to update bundled MSYS2. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wgtmac commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
wgtmac commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083257758 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); Review Comment: `ThriftDeserializer::DeserializeMessage` requires a `uint32_t*` to be the in/out parameter for the length of thrift message. And `ColumnIndex::Make()` simply reuse uint32_t for simplicity. BTW this is not a down-cast, column_index_location->length is in the type of `int32_t`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on issue #15139: [C++] arrow.pc is missing dependencies with Windows static builds
kou commented on issue #15139: URL: https://github.com/apache/arrow/issues/15139#issuecomment-1399200167 Hmm. It seems that it's related to vcpkg. Could you open an issue on vcpkg? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #33811: GH-33786: [C++] Ignore old system xsimd
ursabot commented on PR #33811: URL: https://github.com/apache/arrow/pull/33811#issuecomment-1399199450 Benchmark runs are scheduled for baseline = 0d9d132e9140f26578369b5ef0b44d25c501e45d and contender = 2117d028699edd9f4197650890f3226cdd285c23. 2117d028699edd9f4197650890f3226cdd285c23 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9fde0e445003471087541e47735a2e68...b3787b82b23b4d6b9b78386fe421dcc1/) [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1cd922b14c864ae2a84b80f4185faef6...c64d806e1d134ffab9d1ec9139bd444e/) [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0fbb4c22a0bd4465acac2bc9ed954cd1...d0bf5c298da6498594a348ce39a208da/) [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f3b67a6be39d4d1d8f884ac0fcf6624f...2df2681c6c164b6093f2773bfc4e6b3c/) Buildkite builds: [Failed] [`2117d028` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2241) [Failed] [`2117d028` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2266) [Failed] [`2117d028` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2236) [Failed] [`2117d028` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2259) [Failed] [`0d9d132e` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2240) [Failed] [`0d9d132e` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2265) [Failed] [`0d9d132e` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2235) [Failed] [`0d9d132e` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2258) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wgtmac commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
wgtmac commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083256983 ## cpp/src/parquet/file_reader.cc: ## @@ -302,6 +303,17 @@ class SerializedFile : public ParquetFileReader::Contents { std::shared_ptr metadata() const override { return file_metadata_; } + std::shared_ptr GetPageIndexReader() override { +if (!file_metadata_) { + throw ParquetException("Cannot get PageIndexReader as file metadata is not ready"); Review Comment: Usually this won't happen if user calls one of the static Open() functions to create a ParquetFileReader instance. But if user calls the constructor directly and calls GetPageIndexReader() before Open() then this could happen. Add a comment to explain it and makes the exception message clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Oduig commented on issue #33790: [Python] Support for reading .csv files from a zip archive
Oduig commented on issue #33790: URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399198244 Thank you for the reply, I see the relevant code is in the cpp section! It already works with gz and bz2, but not with (Windows-esque) .zip files. Is there a reason why it is not included as a compression codec? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] kou commented on a diff in pull request #356: feat(go/adbc/driver/pkg/cmake): cmake build for Go shared library drivers
kou commented on code in PR #356: URL: https://github.com/apache/arrow-adbc/pull/356#discussion_r1083254936 ## c/driver/flightsql/adbc-driver-flightsql.pc.in: ## @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=@ADBC_PKG_CONFIG_LIBDIR@ + +Name: Apache Arrow Database Connectivity (ADBC) FlightSQL driver Review Comment: ```suggestion Name: Apache Arrow Database Connectivity (ADBC) Flight SQL driver ``` ## c/driver/flightsql/adbc-driver-flightsql.pc.in: ## @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=@ADBC_PKG_CONFIG_LIBDIR@ + +Name: Apache Arrow Database Connectivity (ADBC) FlightSQL driver +Description: The ADBC FlightSQL driver provides an ADBC driver for Flight SQL. Review Comment: ```suggestion Description: The ADBC Flight SQL driver provides an ADBC driver for Flight SQL. ``` ## c/cmake_modules/GoUtils.cmake: ## @@ -0,0 +1,210 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +find_program(GO_BIN "go" REQUIRED) +message(STATUS "Detecting Go executable: Found ${GO_BIN}") + +function(install_cmake_package PACKAGE_NAME EXPORT_NAME) + set(CONFIG_CMAKE "${PACKAGE_NAME}Config.cmake") + set(BUILT_CONFIG_CMAKE "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_CMAKE}") + configure_package_config_file("${CONFIG_CMAKE}.in" "${BUILT_CONFIG_CMAKE}" +INSTALL_DESTINATION "${ARROW_CMAKE_DIR}/${PACKAGE_NAME}") + set(CONFIG_VERSION_CMAKE "${PACKAGE_NAME}ConfigVersion.cmake") + set(BUILT_CONFIG_VERSION_CMAKE "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_VERSION_CMAKE}") + write_basic_package_version_file("${BUILT_CONFIG_VERSION_CMAKE}" + COMPATIBILITY SameMajorVersion) + set(INSTALL_CMAKEDIR "${CMAKE_INSTALL_LIBDIR}/cmake/${PACKAGE_NAME}") + install(FILES "${BUILT_CONFIG_CMAKE}" "${BUILT_CONFIG_VERSION_CMAKE}" + DESTINATION "${INSTALL_CMAKEDIR}") + set(TARGETS_CMAKE "${PACKAGE_NAME}Targets.cmake") + install(EXPORT ${EXPORT_NAME} + DESTINATION "${INSTALL_CMAKEDIR}" + NAMESPACE "${PACKAGE_NAME}::" + FILE "${TARGETS_CMAKE}" + EXPORT_LINK_INTERFACE_LIBRARIES) +endfunction() Review Comment: It seems that we can remove this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] xiaoyong-z commented on a diff in pull request #4995: [Feature] support describe file
xiaoyong-z commented on code in PR #4995: URL: https://github.com/apache/arrow-datafusion/pull/4995#discussion_r1083255049 ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -1625,6 +1637,15 @@ pub struct Prepare { pub input: Arc, } +/// Describe a file Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] xiaoyong-z commented on a diff in pull request #4995: [Feature] support describe file
xiaoyong-z commented on code in PR #4995: URL: https://github.com/apache/arrow-datafusion/pull/4995#discussion_r1083254991 ## datafusion/core/tests/sqllogictests/test_files/describe.slt: ## @@ -0,0 +1,43 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +## +# Describe internal tables +## + +statement ok +set datafusion.catalog.information_schema = true + +statement ok +CREATE external table aggregate_simple(c1 real, c2 double, c3 boolean) STORED as CSV WITH HEADER ROW LOCATION 'tests/data/aggregate_simple.csv'; + +query C1 +DESCRIBE aggregate_simple; Review Comment: there is no sqllogictests to test describe, so i add some tests here, now i unify the describe table and describe file, and i think it's necessary to add some tests to this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] xiaoyong-z commented on a diff in pull request #4995: [Feature] support describe file
xiaoyong-z commented on code in PR #4995: URL: https://github.com/apache/arrow-datafusion/pull/4995#discussion_r1083254920 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -67,6 +67,10 @@ pub struct ListingTableConfig { pub file_schema: Option, /// Optional `ListingOptions` for the to be created `ListingTable`. pub options: Option, +/// Optional default is false, if temporary is true, then it means that +/// we will create a temporary table for select * from `xxx.file` +/// or describe 'xxx.file', the temporary table will be registered to the schema +pub temporary_file: bool, Review Comment: update a pr, now it's no need to use temporary_files any more -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wgtmac commented on a diff in pull request #33694: MINOR: [C++][Parquet] Rephrase decimal annotation
wgtmac commented on code in PR #33694: URL: https://github.com/apache/arrow/pull/33694#discussion_r1083254666 ## cpp/src/parquet/properties.h: ## @@ -452,19 +452,39 @@ class PARQUET_EXPORT WriterProperties { return this->disable_statistics(path->ToDotString()); } -/// Enable integer type to annotate decimal type as below: -/// int32: 1 <= precision <= 9 -/// int64: 10 <= precision <= 18 -/// Default disabled. -Builder* enable_integer_annotate_decimal() { - integer_annotate_decimal_ = true; +/// Enable decimal logical type with 1 <= precision <= 18 to be stored as +/// integer physical type. +/// +/// According to the specs, DECIMAL can be used to annotate the following types: +/// - int32: for 1 <= precision <= 9. +/// - int64: for 1 <= precision <= 18; precision < 10 will produce a warning. +/// - fixed_len_byte_array: precision is limited by the array size. +/// Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits. +/// - binary: precision is not limited, but is required. precision is not limited, +/// but is required. The minimum number of bytes to store the unscaled value +/// should be used. +/// +/// By default, this is DISABLED and all decimal types annotate fixed_len_byte_array. +/// +/// When enabled, the C++ writer will use following physical types to store decimals: +/// - int32: for 1 <= precision <= 9. +/// - int64: for 10 <= precision <= 18. +/// - fixed_len_byte_array: for precision > 18. +/// +/// As a consequence, decimal columns stored in integer types are more compact +/// but in a risk that the parquet file may not be readable by previous Arrow C++ +/// versions or other implementations. Review Comment: I have checked that both apache/parquet-mr and apache/impala support reading this long ago. It seems that the native parquet reader in the velox does not handle this yet. I'm not familiar with other implementations but there may be some do not support it. IMHO, we cannot take care of all other implementations. There is no risk to the parquet-cpp itself because the reader support is already in and it is well documented by the parquet specs. So I have removed the comment about the risk. @wjones127 @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou merged pull request #33811: GH-33786: [C++] Ignore old system xsimd
kou merged PR #33811: URL: https://github.com/apache/arrow/pull/33811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new pull request, #33812: GH-33796: [C++] Fix wrong arrow-testing.pc config with system GoogleTest
kou opened a new pull request, #33812: URL: https://github.com/apache/arrow/pull/33812 ### Rationale for this change Empty `-I` in `Cflags:` generates an invalid build command line. ### What changes are included in this PR? Add `Requires: gtest` if `gtest.pc` exists. Add `Cflags: -I...` and `Libs: /.../libgtest.a` if `gtest.pc` doesn't exist. Bundled GoogleTest isn't supported yet. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wgtmac commented on a diff in pull request #33694: MINOR: [C++][Parquet] Rephrase decimal annotation
wgtmac commented on code in PR #33694: URL: https://github.com/apache/arrow/pull/33694#discussion_r1083253360 ## cpp/src/parquet/properties.h: ## @@ -452,19 +452,39 @@ class PARQUET_EXPORT WriterProperties { return this->disable_statistics(path->ToDotString()); } -/// Enable integer type to annotate decimal type as below: -/// int32: 1 <= precision <= 9 -/// int64: 10 <= precision <= 18 -/// Default disabled. -Builder* enable_integer_annotate_decimal() { - integer_annotate_decimal_ = true; +/// Enable decimal logical type with 1 <= precision <= 18 to be stored as +/// integer physical type. +/// +/// According to the specs, DECIMAL can be used to annotate the following types: +/// - int32: for 1 <= precision <= 9. +/// - int64: for 1 <= precision <= 18; precision < 10 will produce a warning. +/// - fixed_len_byte_array: precision is limited by the array size. +/// Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits. +/// - binary: precision is not limited, but is required. precision is not limited, +/// but is required. The minimum number of bytes to store the unscaled value +/// should be used. Review Comment: That makes sense. I have changed the comment accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wgtmac commented on a diff in pull request #33694: MINOR: [C++][Parquet] Rephrase decimal annotation
wgtmac commented on code in PR #33694: URL: https://github.com/apache/arrow/pull/33694#discussion_r1083252465 ## cpp/src/parquet/properties.h: ## @@ -452,19 +452,39 @@ class PARQUET_EXPORT WriterProperties { return this->disable_statistics(path->ToDotString()); } -/// Enable integer type to annotate decimal type as below: -/// int32: 1 <= precision <= 9 -/// int64: 10 <= precision <= 18 -/// Default disabled. -Builder* enable_integer_annotate_decimal() { - integer_annotate_decimal_ = true; +/// Enable decimal logical type with 1 <= precision <= 18 to be stored as +/// integer physical type. Review Comment: I don't think negative precision decimals exist, but keeping `1 <= precision` doesn't hurt. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj merged pull request #381: Tag new version dev/release/release.sh
quinnj merged PR #381: URL: https://github.com/apache/arrow-julia/pull/381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #33753: GH-30891: [C++] The C++ API for writing datasets could be improved
kou commented on PR #33753: URL: https://github.com/apache/arrow/pull/33753#issuecomment-1399190351 GLib/Ruby parts are updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a diff in pull request #33791: GH-33782: [Release] Vote email number of issues is querying JIRA and producing a wrong number
kou commented on code in PR #33791: URL: https://github.com/apache/arrow/pull/33791#discussion_r1083240407 ## dev/release/02-source-test.rb: ## @@ -93,33 +93,26 @@ def test_python_version end def test_vote -jira_url = "https://issues.apache.org/jira"; -jql_conditions = [ - "project = ARROW", - "status in (Resolved, Closed)", - "fixVersion = #{@release_version}", -] -jql = jql_conditions.join(" AND ") +github_token = ENV["ARROW_GITHUB_API_TOKEN"] +uri = URI.parse("https://api.github.com/graphql";) +https = Net::HTTP.new(uri.host, uri.port) +https.use_ssl = true +req = Net::HTTP::Post.new(uri.path, initheader = {"Authorization" => "Bearer #{github_token}"}) + +n_issues_query = "{\"query\":\"query {search(query: \\\"repo:apache/arrow is:issue is:closed milestone:#{@release_version}\\\", type:ISSUE) {issueCount}}\"}" +req.body = n_issues_query n_resolved_issues = nil -search_url = URI("#{jira_url}/rest/api/2/search?jql=#{CGI.escape(jql)}") -search_url.open do |response| - n_resolved_issues = JSON.parse(response.read)["total"] +https.request(req) do |response| +n_resolved_issues = JSON.parse(response.body)["data"]["search"]["issueCount"] end Review Comment: We can use `Net::HTTP.post`: ```ruby uri = URI.parse("https://api.github.com/graphql";) n_issues_query = { "query" => <<-QUERY, query { search(query: "repo:apache/arrow is:issue is:closed milestone:#{@release_version}", type: ISSUE) { issueCount } } QUERY } response = Net::HTTP.post(uri, n_issues_query.to_json, "Content-Type" => "application/json", "Authorization" => "Bearer #{github_token}") n_resolved_issues = JSON.parse(response.body)["data"]["search"]["issueCount"] ``` ## dev/release/02-source-test.rb: ## @@ -93,33 +93,26 @@ def test_python_version end def test_vote -jira_url = "https://issues.apache.org/jira"; -jql_conditions = [ - "project = ARROW", - "status in (Resolved, Closed)", - "fixVersion = #{@release_version}", -] -jql = jql_conditions.join(" AND ") +github_token = ENV["ARROW_GITHUB_API_TOKEN"] +uri = URI.parse("https://api.github.com/graphql";) +https = Net::HTTP.new(uri.host, uri.port) +https.use_ssl = true +req = Net::HTTP::Post.new(uri.path, initheader = {"Authorization" => "Bearer #{github_token}"}) + +n_issues_query = "{\"query\":\"query {search(query: \\\"repo:apache/arrow is:issue is:closed milestone:#{@release_version}\\\", type:ISSUE) {issueCount}}\"}" +req.body = n_issues_query n_resolved_issues = nil -search_url = URI("#{jira_url}/rest/api/2/search?jql=#{CGI.escape(jql)}") -search_url.open do |response| - n_resolved_issues = JSON.parse(response.read)["total"] +https.request(req) do |response| +n_resolved_issues = JSON.parse(response.body)["data"]["search"]["issueCount"] end -github_api_url = "https://api.github.com"; -verify_prs = URI("#{github_api_url}/repos/apache/arrow/pulls" + - "?state=open" + - "&head=apache:release-#{@release_version}-rc0") + +pr_query = "{\"query\": \"query {repository(owner: \\\"apache\\\", name: \\\"arrow\\\") {refs(first: 1, refPrefix: \\\"refs/heads/\\\", query: \\\"release-#{@release_version}-rc0\\\") {nodes{associatedPullRequests(first: 1){edges{node{url}}}\"}" +req.body = pr_query verify_pr_url = nil -headers = { - "Accept" => "application/vnd.github+json", -} -github_token = ENV["ARROW_GITHUB_API_TOKEN"] -if github_token - headers["Authorization"] = "Bearer #{github_token}" -end -verify_prs.open(headers) do |response| - verify_pr_url = (JSON.parse(response.read)[0] || {})["html_url"] +https.request(req) do |response| + verify_pr_url = JSON.parse(response.body)["data"]["repository"]["refs"]["nodes"][0]["associatedPullRequests"]["edges"][0]["node"]["url"] end + Review Comment: Do we need to change this? It seems that we can still use the current simple REST API. ## dev/release/02-source.sh: ## @@ -142,18 +142,18 @@ if [ ${SOURCE_PR} -gt 0 ]; then fi if [ ${SOURCE_VOTE} -gt 0 ]; then - jira_url="https://issues.apache.org/jira"; - jql="project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%20${version}" - n_resolved_issues=$(curl "${jira_url}/rest/api/2/search/?jql=${jql}" | jq ".total") - curl_options=(--header "Accept: application/vnd.github+json") - if [ -n "${ARROW_GITHUB_API_TOKEN:-}" ]; then -curl_options+=(--header "Authorization: Bearer ${ARROW_GITHUB_API_TOKEN}") - fi - curl_options+=(--get) - curl_options+=(--data "state=open") - curl_options+=(--data "head=apach
[GitHub] [arrow] kou commented on pull request #33811: GH-33786: [C++] Ignore old system xsimd
kou commented on PR #33811: URL: https://github.com/apache/arrow/pull/33811#issuecomment-1399162810 @github-actions crossbow submit verify-rc-source-cpp-linux-ubuntu-22.04-amd64 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #33739: GH-33655: [C++][Parquet] Fix occasional failure in TestArrowReadWrite.MultithreadedWrite
ursabot commented on PR #33739: URL: https://github.com/apache/arrow/pull/33739#issuecomment-1399162691 Benchmark runs are scheduled for baseline = a16c54567ada6729311fd26bdbee4b5e61901410 and contender = 0d9d132e9140f26578369b5ef0b44d25c501e45d. 0d9d132e9140f26578369b5ef0b44d25c501e45d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a2c888cbc9e54acbb087f82996ef2976...9fde0e445003471087541e47735a2e68/) [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9c950d82a1a141cb9555bb25a732624c...1cd922b14c864ae2a84b80f4185faef6/) [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9ac607cae658446e885a40089cb71e52...0fbb4c22a0bd4465acac2bc9ed954cd1/) [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6c0e57252df347318d11432fdaa0cdb9...f3b67a6be39d4d1d8f884ac0fcf6624f/) Buildkite builds: [Failed] [`0d9d132e` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2240) [Failed] [`0d9d132e` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2265) [Failed] [`0d9d132e` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2235) [Failed] [`0d9d132e` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2258) [Failed] [`a16c5456` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2239) [Failed] [`a16c5456` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2264) [Failed] [`a16c5456` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2234) [Failed] [`a16c5456` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2257) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new pull request, #33811: GH-33786: [C++] Ignore old system xsimd
kou opened a new pull request, #33811: URL: https://github.com/apache/arrow/pull/33811 ### Rationale for this change If old xsimd is installed, CMake target for bundled xsimd is conflicted. ### What changes are included in this PR? Use `arrow::xsimd` for bundled xsimd's target name to avoid conflict. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ozankabak commented on pull request #4866: Support non-equijoin predicate for EliminateCrossJoin
ozankabak commented on PR #4866: URL: https://github.com/apache/arrow-datafusion/pull/4866#issuecomment-1399161710 Any thoughts on how to make progress on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on issue #33786: [Release][C++] Release verification tasks fail with libxsimd-dev installed on ubuntu 22.04
kou commented on issue #33786: URL: https://github.com/apache/arrow/issues/33786#issuecomment-1399160270 We can use `... ARROW_CMAKE_OPTIONS="-Dxsimd_SOURCE=BUNDLED" dev/release/verify-release-candidate.sh ...` instead of changing `verify-release-candidate.sh`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] kou commented on pull request #381: Tag new version dev/release/release.sh
kou commented on PR #381: URL: https://github.com/apache/arrow-julia/pull/381#issuecomment-1399158739 OK. We can try TagBot in the next release again. If TagBot doesn't work, we can use this manual tagging approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mapleFU commented on issue #15173: [Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls
mapleFU commented on issue #15173: URL: https://github.com/apache/arrow/issues/15173#issuecomment-1399158204 @wjones127 @emkornfield Hi, what do you think of this problem? Should we assure there are no padding? Or just use method like https://github.com/apache/arrow/issues/15173#issuecomment-1385858560 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a diff in pull request #33806: GH-33723: [C++] re2::RE2::RE2() result must be checked
kou commented on code in PR #33806: URL: https://github.com/apache/arrow/pull/33806#discussion_r1083228638 ## cpp/src/arrow/compute/kernels/scalar_string_ascii.cc: ## @@ -1505,6 +1508,13 @@ struct MatchLike { static const RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)", kRE2Options); // A LIKE pattern matching this regex can be translated into a suffix search. static const RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))", kRE2Options); +static bool global_checked = false; Review Comment: Can we use `std::once_flag`? ## cpp/src/arrow/compute/kernels/scalar_string_ascii.cc: ## @@ -1505,6 +1508,13 @@ struct MatchLike { static const RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)", kRE2Options); // A LIKE pattern matching this regex can be translated into a suffix search. static const RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))", kRE2Options); +static bool global_checked = false; +if (ARROW_PREDICT_FALSE(!global_checked)) { + RETURN_NOT_OK(RegexStatus(kLikePatternIsSubstringMatch)); + RETURN_NOT_OK(RegexStatus(kLikePatternIsStartsWith)); + RETURN_NOT_OK(RegexStatus(kLikePatternIsEndsWith)); Review Comment: If we have an error in these patterns, does this code report an error on the second call? (It seems that this code reports an error only on the first call.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #33739: GH-33655: [C++][Parquet] Fix occasional failure in TestArrowReadWrite.MultithreadedWrite
cyb70289 commented on PR #33739: URL: https://github.com/apache/arrow/pull/33739#issuecomment-1399156005 Thanks @wgtmac for fixing the issue quickly ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 merged pull request #33739: GH-33655: [C++][Parquet] Fix occasional failure in TestArrowReadWrite.MultithreadedWrite
cyb70289 merged PR #33739: URL: https://github.com/apache/arrow/pull/33739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be checked
kou commented on PR #33781: URL: https://github.com/apache/arrow/pull/33781#issuecomment-1399154574 Thanks! But #33806 is better approach. I close this in favor of #33806. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be checked
kou closed pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be checked URL: https://github.com/apache/arrow/pull/33781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paleolimbot commented on issue #33702: [R] Package Arrow 11.0.0 for R/CRAN
paleolimbot commented on issue #33702: URL: https://github.com/apache/arrow/issues/33702#issuecomment-1399154374 Is there an Arrow issue for clang16 yet? I didn't get to it today but I'm sure I can get a docker image together on Monday with clang16 and R. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zzzzwj commented on a diff in pull request #33703: GH-14748:[C++] Modify some comments about shared_ptr's ownership in parquet-cpp.
wj commented on code in PR #33703: URL: https://github.com/apache/arrow/pull/33703#discussion_r1083228089 ## cpp/src/parquet/arrow/test_util.h: ## @@ -463,12 +463,16 @@ Status MakeEmptyListsArray(int64_t size, std::shared_ptr* out_array) { return Status::OK(); } +// Please make sure that the return value should have an object assigned explicitly +// if you want to use the pointer contained. Review Comment: Oh, really sorry, it looks like I truly misunderstood your meaning. I thought that you wanted me to verify all the ownership of functions that return a shared_ptr. I'll have a careful check and then make a change on this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paleolimbot commented on issue #29428: [R] accept expression lists in Scanner$create() with arrow_dplyr_querys
paleolimbot commented on issue #29428: URL: https://github.com/apache/arrow/issues/29428#issuecomment-1399153279 Typically our `as_record_batch_reader()` methods just have a `schema` argument (as in, output schema...cast if you can, error if you can't). That's not *quite* what duckdb needs to pass through but I'm sure we can provide something more direct than going through the scanner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #33803: GH-33787: [C++] arrow/cpp/src/arrow/util/cpu_info.cc: the -Werror triggers an error: statement has no effect [-Werror=unused-value]
kou commented on PR #33803: URL: https://github.com/apache/arrow/pull/33803#issuecomment-1399152264 How about this? ```diff diff --git a/cpp/src/arrow/util/cpu_info.cc b/cpp/src/arrow/util/cpu_info.cc index 3ba8db216..08b7b8b21 100644 --- a/cpp/src/arrow/util/cpu_info.cc +++ b/cpp/src/arrow/util/cpu_info.cc @@ -348,6 +348,7 @@ int64_t LinuxGetCacheSize(int level) { // care about are present. // Returns a bitmap of flags. int64_t LinuxParseCpuFlags(const std::string& values) { +#if defined(CPUINFO_ARCH_X86) || defined(CPUINFO_ARCH_ARM) const struct { std::string name; int64_t flag; @@ -379,6 +380,9 @@ int64_t LinuxParseCpuFlags(const std::string& values) { } } return flags; +#else + return 0; +#endif } void OsRetrieveCacheSize(std::array* cache_sizes) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] kou commented on issue #366: [Discuss] Is the conventional commit format working?
kou commented on issue #366: URL: https://github.com/apache/arrow-adbc/issues/366#issuecomment-1399151048 > We can ask INFRA to change the setting to merge using the PR title/description, if people agree. I like the setting! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest
ursabot commented on PR #4834: URL: https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399149236 Benchmark runs are scheduled for baseline = b71cae8aa556369bc5ee72b063ed1fc5a81192f1 and contender = 1d69f28f14acf178377ecf55a343b6e71b4dd856. 1d69f28f14acf178377ecf55a343b6e71b4dd856 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ebc718c2b1854f13b01b15b9ab814210...436e61feb07d482da9bb16171366b783/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9c071dfa45984472843f37989cf6184f...d1c046c7b0cd4f8c9e2185b57735a6ae/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/84d58f2d947140169e6aa56dc6992ca0...5b5749ceb3254d02a9294123c31eaa8c/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/3fbb7ffd22844f168faeed49ff4bdbcc...9a91203aa4cf46d495f56c13bfe1f237/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 merged pull request #4834: (#4462) Postgres compatibility tests using sqllogictest
xudong963 merged PR #4834: URL: https://github.com/apache/arrow-datafusion/pull/4834 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 closed issue #4462: Replace python based integration test with sqllogictest
xudong963 closed issue #4462: Replace python based integration test with sqllogictest URL: https://github.com/apache/arrow-datafusion/issues/4462 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest
xudong963 commented on PR #4834: URL: https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399148189 merged, let's iterate! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha commented on pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing
vibhatha commented on PR #14596: URL: https://github.com/apache/arrow/pull/14596#issuecomment-1399110128 @github-actions crossbow submit test-conda-python-3.9-substrait -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #13487: ARROW-8991: [C++] Add hash_64 scalar compute function
westonpace commented on PR #13487: URL: https://github.com/apache/arrow/pull/13487#issuecomment-1399105512 @drin is this something you are still working on? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha commented on pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing
vibhatha commented on PR #14596: URL: https://github.com/apache/arrow/pull/14596#issuecomment-1399104798 @github-actions crossbow submit test-conda-python-3.9-substrait -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15098: FieldRef with clang-cl from VS 17.4.3 needs operator== with /std:c++20
westonpace commented on issue #15098: URL: https://github.com/apache/arrow/issues/15098#issuecomment-1399103925 It would seem we need a new CI environment for visual studio & C++20. It should be a fairly straightforward addition to https://github.com/apache/arrow/blob/master/.github/workflows/cpp.yml (we already have Windows / C++17). Would you be interested in contributing that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on pull request #33810: GH-33377: [Python] Table.drop should support passing a single column
rok commented on PR #33810: URL: https://github.com/apache/arrow/pull/33810#issuecomment-1399101391 > This is my first Arrow PR. I am very open to any feedback or suggestions you might have! Welcome @danepitkin! Thanks for opening the PR! > I followed the instructions as stated in the original GH issue, but there is now some redundancy between `Table.drop()`, `Table.drop_column()`, and `Table.remove_column()`. In my opinion, it would have been nicer to consolidate these from the start. However, since backwards compatibility is important, would you say this PR is the right approach? I would slightly prefer `Table.drop_column()` over `Table.drop_column()` and `Table.drop()`. I don't know if `.drop` -> `.drop_column` makes sense at this point. [Pandas uses `.drop` for everything](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) but it does index dropping as well. Perhaps you could just alias the two and introduce a deprecation warning? (e.g. [here](https://github.com/apache/arrow/blob/a16c54567ada6729311fd26bdbee4b5e61901410/python/pyarrow/plasma.py#L118)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15281: basic_string_view will be invalid in future libc++
westonpace commented on issue #15281: URL: https://github.com/apache/arrow/issues/15281#issuecomment-1399101379 Normally I think we'd want to setup a CI to regress this sort of thing. However, clang-18 is not yet released. Do you have any suggestion on how this could be regressed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing
vibhatha commented on code in PR #14596: URL: https://github.com/apache/arrow/pull/14596#discussion_r1083164628 ## ci/docker/conda-python-substrait.dockerfile: ## @@ -0,0 +1,45 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +ARG repo +ARG arch +ARG python=3.9 + +FROM ${repo}:${arch}-conda-python-${python} + +COPY ci/conda_env_python.txt \ + ci/conda_env_sphinx.txt \ + /arrow/ci/ +RUN mamba install -q -y \ +--file arrow/ci/conda_env_python.txt \ +--file arrow/ci/conda_env_sphinx.txt \ +$([ "$python" == "3.9" ] && echo "pickle5") \ +python=${python} \ Review Comment: you're probably correct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15103: Weighted stat aggregations in arrow-compute
westonpace commented on issue #15103: URL: https://github.com/apache/arrow/issues/15103#issuecomment-1399100195 There is a proposal to support aggregate UDFs but I don't know that it is high priority for anyone. I agree this sounds like a nice feature. Would you be interested in creating a PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing
vibhatha commented on code in PR #14596: URL: https://github.com/apache/arrow/pull/14596#discussion_r1083164377 ## ci/scripts/integration_substrait.sh: ## @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -e + +# check that optional pyarrow modules are available +# because pytest would just skip the substrait tests +echo "Substrait Integration Tests"; +python -c "from substrait_consumer.consumers import AceroConsumer" +python -c "import pyarrow.orc" Review Comment: You're correct, I haven't cleaned this up properly. Very early draft. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #33627: [C++][HDFS] Can't get performance improve when increase the thread number of IO thread pool
westonpace commented on issue #33627: URL: https://github.com/apache/arrow/issues/33627#issuecomment-1399093386 No, I don't think you are wrong. You should be seeing more parallelism. I am not sure when I will get a chance to fully investigate this however. I don't see anything in here that would lead threads to share HDFS connections but I don't know anything about how HDFS works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a diff in pull request #33805: GH-33804: [Python] Add support for manylinux_2_28 wheel
kou commented on code in PR #33805: URL: https://github.com/apache/arrow/pull/33805#discussion_r1083143197 ## docker-compose.yml: ## @@ -967,6 +970,31 @@ services: - ${DOCKER_VOLUME_PREFIX}python-wheel-manylinux2014-ccache:/ccache:delegated command: /arrow/ci/scripts/python_wheel_manylinux_build.sh + # See available versions at: + #https://quay.io/repository/pypa/manylinux_2_28_x86_64?tab=tags + python-wheel-manylinux-2-28: +image: ${REPO}:${ARCH}-python-${PYTHON}-wheel-manylinux-2-28-vcpkg-${VCPKG} +build: + args: +arch: ${ARCH} +arch_short: ${ARCH_SHORT} +base: quay.io/pypa/manylinux_2_28_${ARCH_ALIAS}:2023-01-14-103cb93 +vcpkg: ${VCPKG} +python: ${PYTHON} +manylinux: 2_28 + context: . + dockerfile: ci/docker/python-wheel-manylinux-x-y.dockerfile + cache_from: +- ${REPO}:${ARCH}-python-${PYTHON}-wheel-manylinux-2-28-vcpkg-${VCPKG} +environment: + <<: *ccache +volumes: + - .:/arrow:delegated + - ${DOCKER_VOLUME_PREFIX}python-wheel-manylinux228-ccache:/ccache:delegated +command: /arrow/ci/scripts/python_wheel_manylinux_build.sh + + + Review Comment: ```suggestion ``` ## docker-compose.yml: ## @@ -180,6 +181,8 @@ volumes: name: maven-cache python-wheel-manylinux2014-ccache: name: python-wheel-manylinux2014-ccache + python-wheel-manylinux228-ccache: +name: python-wheel-manylinux228-ccache Review Comment: ```suggestion python-wheel-manylinux-2-28-ccache: name: python-wheel-manylinux-2-28-ccache ``` ## ci/docker/python-wheel-manylinux-x-y.dockerfile: ## @@ -0,0 +1,95 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +ARG base +FROM ${base} + +ARG arch +ARG arch_short +ARG manylinux + +ENV MANYLINUX_VERSION=${manylinux} + +# Install basic dependencies +RUN yum install -y git flex curl autoconf zip perl-IPC-Cmd wget + +ENV HOST_PYTHON_VERSION=3.8 +RUN HOST_PYTHON_ROOT=$(find /opt/python -name cp${HOST_PYTHON_VERSION/./}-*) && \ +echo "export PATH=$HOST_PYTHON_ROOT/bin:\$PATH" >> /python.sh + +# Install CMake +# AWS SDK doesn't work with CMake=3.22 due to https://gitlab.kitware.com/cmake/cmake/-/issues/22524 +ARG cmake=3.21.4 +COPY ci/scripts/install_cmake.sh arrow/ci/scripts/ +RUN source /python.sh && /arrow/ci/scripts/install_cmake.sh ${arch} linux ${cmake} /usr/local + +# Install Ninja +ARG ninja=1.10.2 +COPY ci/scripts/install_ninja.sh arrow/ci/scripts/ +RUN source /python.sh && /arrow/ci/scripts/install_ninja.sh ${ninja} /usr/local + +# Install ccache +ARG ccache=4.1 +COPY ci/scripts/install_ccache.sh arrow/ci/scripts/ +RUN source /python.sh && /arrow/ci/scripts/install_ccache.sh ${ccache} /usr/local + +RUN echo "Hello World" Review Comment: ```suggestion ``` ## ci/docker/python-wheel-manylinux-x-y.dockerfile: ## @@ -0,0 +1,95 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +ARG base +FROM ${base} + +ARG arch +ARG arch_short +ARG manylinux + +ENV MANYLINUX_VERSION=${manylinux} + +# Install basic dependencies +RUN yum install -y git flex curl autoconf zip perl-IPC-Cmd wget + +ENV HOST_PYTHON_VERSION=3.8 +RUN HOST_PYTHON_ROOT=$(find /opt/python -name cp${HOST_PYTHON_VERSION/./}-*) && \ +echo "export PATH=$HOST_PYTHON_ROOT/bin:\$PATH" >> /python.sh Review Comment: Does the following work? ```suggestion ENV HOST_PYTHON_VERSION=3.8 ENV PATH=
[GitHub] [arrow] westonpace commented on issue #33668: Reading flat dataset with `partitioning="hive"` results in partition schema equal to dataset schema
westonpace commented on issue #33668: URL: https://github.com/apache/arrow/issues/33668#issuecomment-1399087997 I can confirm. I reproduced this with the latest and agree it is a bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
nealrichardson commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399086990 > > I thought this was just going to be deleting code from the R package: instead of finding the top-level field names in the projection and sending them in the ScanNode, I'd send the projection and the scanner would pull out the fields. > > > > Got it, so this is not a behavior change, but moving the logic from R into C++ where it would appropriately belong? Right; nested field refs are currently not handled by the C++ scanner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] albel727 commented on issue #3579: `nullif` incorrectly calculates `null_count`, sometimes panics with substraction overflow error
albel727 commented on issue #3579: URL: https://github.com/apache/arrow-rs/issues/3579#issuecomment-1399086004 Just in case, here's the code that I used to validate that the quick fix works: ```rust std::panic::set_hook(Box::new(|_info| { /* silence panics */ })); for n in 0..=128+64 { let left = Int32Array::from((0..n as i32).into_iter().collect::>()); for somes in 0..=n { for trues in 0..=somes { let right = BooleanArray::from((0..n).into_iter().map(|i| { Some(i < trues).filter(|_| i < somes) }).collect::>()); let ok = std::panic::catch_unwind(|| { let arr = arrow_select::nullif::nullif(&left, &right).unwrap(); let arr: &Int32Array = arrow_array::cast::as_primitive_array(&arr); arrow_array::Array::data(arr).validate_full().unwrap(); }).is_ok(); if !ok { println!("{n} {somes} {trues}"); } } } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #14799: ARROW-18417: [C++] Support emit info in Substrait extension-multi and AsOfJoin
westonpace commented on PR #14799: URL: https://github.com/apache/arrow/pull/14799#issuecomment-1399085737 @rtpsw this will need a rebase. Some of the changes here were brought in with the backpressure change. Would you like me to do this or do you want to? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] albel727 opened a new issue, #3579: `nullif` incorrectly calculates `null_count`, sometimes panics with substraction overflow error
albel727 opened a new issue, #3579: URL: https://github.com/apache/arrow-rs/issues/3579 **Describe the bug** `nullif(left, right)` incorrectly calculates `null_count` for the result array, whenever `left` doesn't have a null_buffer and has `len % 64 == 0`. It can even panic, if there are less than 64 true values in `right`. **To Reproduce** ```rust let n = 64; let left = Int32Array::from((0..n as i32).into_iter().collect::>()); let right = BooleanArray::from((0..n).into_iter().map(|_| None).collect::>()); arrow_select::nullif::nullif(&left, &right); ``` **Expected behavior** It works. **Actual behavior** It panics with 'attempt to subtract with overflow' at this line: https://github.com/apache/arrow-rs/blob/796b670338ce33806a39777ea18cf6fae8fa7ee4/arrow-select/src/nullif.rs#L107 **Additional context** The problem evaded fixes #3034 and #3263, which were incomplete. The wrong calculation occurs in these lines: https://github.com/apache/arrow-rs/blob/796b670338ce33806a39777ea18cf6fae8fa7ee4/arrow-select/src/nullif.rs#L81-L90 The reason it is wrong is the un-intuitive, if not to say wrong, behavior of `bitwise_unary_op_helper()` and friends. They're calling `op()` **unconditionally** on the remainder bits, even if there are none: https://github.com/apache/arrow-rs/blob/3a90654f4cb98ac7fe278991a6cbc11384664e2e/arrow-buffer/src/buffer/ops.rs#L119-L123 It doesn't make a difference for the produced output buffer, because it is then just extended with slice of 0 bytes, i.e. remains unaffected. But it does matter for FnMut closures with side effects, such as the one found in `nullif`, which, as a result of this behavior, adds an excess of 64 to `valid_count`, when there are no remainder bits, i.e. bit length is a multiple of 64. The quick fix is to do `valid_count -= 64 - remainder_len;` in `nullif()` unconditionally, i.e. just remove the `if remainder_len != 0 {` condition around it. It was clearly written in assumption, that `bitwise_unary_op_helper()` doesn't call `op()` in excess, when there are no remainder bits. A better fix could have been to avoid making excess `op()` calls in `bitwise_unary_op_helper()` and friends, but since they're public, it could lead to bugs in external consumers, which might have come to rely on this weird behavior. In any case, I suggest at least checking whether other usages of `bitwise_unary_op_helper()` and friends in arrow-rs make the same incorrect assumption. Temporarily changing the type of `op` parameter from `FnMut` to `Fn` and looking at the compilation error fallout should point out all suspicious places. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing
vibhatha commented on code in PR #14596: URL: https://github.com/apache/arrow/pull/14596#discussion_r1083145165 ## ci/scripts/integration_substrait.sh: ## @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -e + +# check that optional pyarrow modules are available +# because pytest would just skip the substrait tests +echo "Substrait Integration Tests"; +python -c "from substrait_consumer.consumers import AceroConsumer" +python -c "import pyarrow.orc" +python -c "import pyarrow.substrait" + +pytest substrait_consumer/tests/functional/extension_functions/test_boolean_functions.py --producer IsthmusProducer --consumer AceroConsumer Review Comment: 😀 no choice yet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #29428: [R] accept expression lists in Scanner$create() with arrow_dplyr_querys
westonpace commented on issue #29428: URL: https://github.com/apache/arrow/issues/29428#issuecomment-1399080062 +1 for some kind of `as_record_batch_reader` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner
westonpace commented on PR #33770: URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399077722 > I thought this was just going to be deleting code from the R package: instead of finding the top-level field names in the projection and sending them in the ScanNode, I'd send the projection and the scanner would pull out the fields. Got it, so this is not a behavior change, but moving the logic from R into C++ where it would appropriately belong? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15138: [C++] Clustered By -- how?
westonpace commented on issue #15138: URL: https://github.com/apache/arrow/issues/15138#issuecomment-1399076522 Internally we do this by hashing the column. There is a PR under way (though it could use some review) to add a new hash compute function (https://github.com/apache/arrow/pull/13487). If that were to merge then you could approximate this pretty well by partitioning on bit_wise_and(hash(x), mask) where `mask` is something like `0x7` (for 8 groups) or `0xF` (for 16 groups), etc. To support non-powers-of-two groups we would need modulo. There is a PR in place for modulo but it seems to have stalled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #33725: GH-33724: [Doc] Update the substrait conformance doc with the latest support
ursabot commented on PR #33725: URL: https://github.com/apache/arrow/pull/33725#issuecomment-1399074351 Benchmark runs are scheduled for baseline = f7aa50dbeccdcc800a0ffc695b107c1cdc688156 and contender = a16c54567ada6729311fd26bdbee4b5e61901410. a16c54567ada6729311fd26bdbee4b5e61901410 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d176eae6bed441eb97ee4baa57dc3fdf...a2c888cbc9e54acbb087f82996ef2976/) [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4f3232dcf7d04ea082736425413a1641...9c950d82a1a141cb9555bb25a732624c/) [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b8cf43c42a19473b8089647069d4f1c5...9ac607cae658446e885a40089cb71e52/) [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b56d62491b8b490287ee545f66a3526e...6c0e57252df347318d11432fdaa0cdb9/) Buildkite builds: [Failed] [`a16c5456` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2239) [Failed] [`a16c5456` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2264) [Failed] [`a16c5456` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2234) [Failed] [`a16c5456` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2257) [Failed] [`f7aa50db` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2238) [Failed] [`f7aa50db` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2263) [Failed] [`f7aa50db` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2233) [Failed] [`f7aa50db` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2256) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a diff in pull request #33775: ARROW-18425: [Substrait] Add Substrait→Acero mapping for round operationMajor:
westonpace commented on code in PR #33775: URL: https://github.com/apache/arrow/pull/33775#discussion_r1083118332 ## cpp/src/arrow/compute/api_scalar.h: ## @@ -882,6 +891,20 @@ ARROW_EXPORT Result Round(const Datum& arg, RoundOptions options = RoundOptions::Defaults(), ExecContext* ctx = NULLPTR); +/// \brief Round a value to a given precision. +/// +/// If argument is null the result will be null. Review Comment: Which argument? I assume the output is null if either argument is null? Can we be more explicit. ## cpp/src/arrow/compute/kernels/scalar_round_arithmetic_test.cc: ## @@ -113,7 +112,7 @@ class TestBaseUnaryRoundArithmetic : public ::testing::Test { // (Array, Array) void AssertUnaryOp(UnaryFunction func, const std::shared_ptr& arg, const std::shared_ptr& expected) { -ASSERT_OK_AND_ASSIGN(auto actual, func(arg, options_, nullptr)); +ASSERT_OK_AND_ASSIGN(auto actual, func(arg, options_, nullptr)) Review Comment: While this does compile and work we try and add a `;` to the end of these kinds of macros anyways for readability ```suggestion ASSERT_OK_AND_ASSIGN(auto actual, func(arg, options_, nullptr)); ``` ## cpp/src/arrow/engine/substrait/extension_set.cc: ## @@ -790,6 +796,30 @@ ExtensionIdRegistry::SubstraitCallToArrow DecodeOptionlessUncheckedArithmetic( }; } +ExtensionIdRegistry::SubstraitCallToArrow DecodeBinaryRoundingMode( +const std::string& function_name) { + return [function_name](const SubstraitCall& call) -> Result { +ARROW_ASSIGN_OR_RAISE( +compute::RoundMode round_mode, +ParseOptionOrElse( +call, "rounding", kRoundModeParser, +{compute::RoundMode::DOWN, compute::RoundMode::UP, + compute::RoundMode::TOWARDS_ZERO, compute::RoundMode::TOWARDS_INFINITY, + compute::RoundMode::HALF_DOWN, compute::RoundMode::HALF_UP, + compute::RoundMode::HALF_TOWARDS_ZERO, + compute::RoundMode::HALF_TOWARDS_INFINITY, compute::RoundMode::HALF_TO_EVEN, + compute::RoundMode::HALF_TO_ODD}, +compute::RoundMode::HALF_TOWARDS_INFINITY)); +ARROW_ASSIGN_OR_RAISE(std::vector value_args, + GetValueArgs(call, 0)); +std::shared_ptr options = Review Comment: Do we want to optimize and call the unary round if the second value is a scalar? If not in this PR can we create a follow-up github issue so we don't lose track of it? Or maybe round_binary itself can fallback to unary rounding if the second argument is scalar. ## cpp/src/arrow/compute/api_scalar.h: ## @@ -882,6 +891,20 @@ ARROW_EXPORT Result Round(const Datum& arg, RoundOptions options = RoundOptions::Defaults(), ExecContext* ctx = NULLPTR); +/// \brief Round a value to a given precision. +/// +/// If argument is null the result will be null. +/// +/// \param[in] arg1 the value rounded Review Comment: ```suggestion /// \param[in] arg1 the value to be rounded ``` ## cpp/src/arrow/compute/api_scalar.h: ## @@ -882,6 +891,20 @@ ARROW_EXPORT Result Round(const Datum& arg, RoundOptions options = RoundOptions::Defaults(), ExecContext* ctx = NULLPTR); +/// \brief Round a value to a given precision. +/// +/// If argument is null the result will be null. +/// +/// \param[in] arg1 the value rounded +/// \param[in] arg2 the number of significant digits to round to +/// \param[in] options rounding options (rounding mode and number of digits), optional Review Comment: ```suggestion /// \param[in] options rounding options (rounding mode), optional ``` Or just get rid of the parentheses section entirely. ## cpp/src/arrow/compute/api_scalar.h: ## @@ -882,6 +891,20 @@ ARROW_EXPORT Result Round(const Datum& arg, RoundOptions options = RoundOptions::Defaults(), ExecContext* ctx = NULLPTR); +/// \brief Round a value to a given precision. +/// +/// If argument is null the result will be null. +/// +/// \param[in] arg1 the value rounded +/// \param[in] arg2 the number of significant digits to round to Review Comment: Can this be negative? Do we define elsewhere what that entails? ## cpp/src/arrow/compute/kernels/scalar_round_arithmetic_test.cc: ## @@ -18,7 +18,6 @@ #include #include #include -#include Review Comment: Our guideline for includes is [`iwyu`](https://include-what-you-use.org/). We don't always follow it perfectly (the conformance tool doesn't like type_fwd files) but it is what we aim for. Please don't remove includes if they are used in the file (I still see many instances of `std::string`) even if the file compiles otherwise (transitive includes are potentially unstable). ## cpp/src/arrow/compute/kernels/scalar_round.cc: ## @@ -751,60 +877,25 @@ Array
[GitHub] [arrow] zeroshade commented on pull request #14989: ARROW-18438: [Go][Parquet] Panic in bitmap writer
zeroshade commented on PR #14989: URL: https://github.com/apache/arrow/pull/14989#issuecomment-1399068130 @minyoung I found the issue and the solution that's not a hack: in parquet/file/column_writer_types.gen.go.tmpl lines 143 - 147 change this: ```go if w.bitsBuffer != nil { w.writeValuesSpaced(vals, info.batchNum, w.bitsBuffer.Bytes(), 0) } else { w.writeValuesSpaced(vals, info.batchNum, validBits, validBitsOffset+valueOffset) } ``` To this instead: ```go if w.bitsBuffer != nil { w.writeValuesSpaced(vals, batch, w.bitsBuffer.Bytes(), 0) } else { w.writeValuesSpaced(vals, batch, validBits, validBitsOffset+valueOffset) } ``` Note the change from `info.batchNum` to `batch`. It turns out that in this scenario we should be passing the full batch size to write and not just the number of raw values it found. This way it properly calculates the number of nulls in the parent (non-leaf) columns. After you make this change, re-run `go generate` so that it re-generates all of the writers. Finally, on line 289 of `pqarrow/column_readers.go`, change the argument to BuildArray from `validityIO.Read` to `lenBound`. That way it creates the correctly sized null bitmap buffer. in my testing that was enough to solve the problem, leave your new tests in for good measure so this bug doesn't creep back :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] danepitkin commented on issue #33377: [Python] Table.drop should support passing a single column
danepitkin commented on issue #33377: URL: https://github.com/apache/arrow/issues/33377#issuecomment-1399062211 Whoops, looks like I did not correctly link my PR to the original issue. My PR is here https://github.com/apache/arrow/pull/33810. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] danepitkin commented on pull request #33810: GH-33377: [Python] Table.drop should support passing a single column
danepitkin commented on PR #33810: URL: https://github.com/apache/arrow/pull/33810#issuecomment-1399047969 This is my first Arrow PR. I am very open to any feedback or suggestions you might have! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace merged pull request #33725: GH-33724: [Doc] Update the substrait conformance doc with the latest support
westonpace merged PR #33725: URL: https://github.com/apache/arrow/pull/33725 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()
westonpace commented on issue #33759: URL: https://github.com/apache/arrow/issues/33759#issuecomment-1399046967 Hmm, backpressure should be applied then. Once you call `to_batches` it should start to read in the background. Eventually, at a certain point, it should stop reading because too much data has accumulated. This is normally around a few GB. You mention there are 13k fragments, just to confirm this is 13k files right? How large is each file? How many row groups are in each file? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] danepitkin opened a new pull request, #33810: GH-33377: [Python] Table.drop should support passing a single column
danepitkin opened a new pull request, #33810: URL: https://github.com/apache/arrow/pull/33810 ### Rationale for this change Provide a better user experience in pyarrow when working with `Table`. ### What changes are included in this PR? Allow `Table.drop()` to accept a single column name as a `str` argument. Provide a wrapper `Table.drop_column(str)`, which calls `Table.drop()`, to match similar APIs such as `add_column()`, `append_column()`. ### Are these changes tested? Updated the pytest for `Table.drop()` and added a pytest for `Table.drop_column()`. Verified both test cases ran successfully locally. ### Are there any user-facing changes? Yes, a new pyarrow API is added. The existing pyarrow API is backwards compatible, but does support additional types of input now (str). The doc strings are updated so I assume the python API reference will be auto-updated somehow? Let me know if this is not the case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #33790: [Python] Support for reading .csv files from a zip archive
westonpace commented on issue #33790: URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399044202 Outside of datasets this is normally achieved by opening a compressed input stream and using the CSV stream reader. If the path ends in `.gz` or `.bz2` I think we also guess that it is compressed and do this for you. Within datasets there are a few un/under documented features which may help. There is a similar "extension guessing" mechanism: https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_base.cc#L93 So if your files end in `gz` or `gzip` it should automatically be picked up. There is also `stream_transform_func` as part of the dataset-csv options which takes an arbitrary callable that transforms the stream before you start reading it. In theory this could maybe be used to provide support for zipped files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #33797: [C++] Add decimal version of Round benchmarks
westonpace commented on issue #33797: URL: https://github.com/apache/arrow/issues/33797#issuecomment-1399039367 @aayushpandey014 I have assigned this to you. In the future you can always comment `take` and our bots will assign an issue to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones127 commented on a diff in pull request #14353: ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type
wjones127 commented on code in PR #14353: URL: https://github.com/apache/arrow/pull/14353#discussion_r1083109332 ## cpp/src/parquet/encoding.h: ## @@ -317,6 +317,13 @@ class TypedDecoder : virtual public Decoder { int64_t valid_bits_offset, typename EncodingTraits::Accumulator* out) = 0; + virtual int DecodeArrow_opt(int num_values, int null_count, const uint8_t* valid_bits, + int32_t* offset, + std::shared_ptr<::arrow::ResizableBuffer>& values, + int64_t valid_bits_offset, int32_t* bianry_length) { +return 0; Review Comment: It seems like we are adding this because the other methods are based on builders (in the accumulator), and builder don't provide a way to transfer multiple values in one `memcpy`. Does that sound right? I wonder if we could add such a method on builders, and that might be a cleaner solution. Something like: ```cpp class BaseBinaryBuilder { ... Status UnsafeAppendValues(const uint8_t* values, int64_t length, const uint8_t* valid_bytes = NULL_PTR) { ... } }; ``` The reason getting back to builders might be good is that I don't think these changes handle the case where there are more values than can fit into a StringArray. The current implementation will split it up into chunks if it reaches capacity, and I think we need to keep that behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #33809: MINOR: [R] Update BugReports field in DESCRIPTION
ursabot commented on PR #33809: URL: https://github.com/apache/arrow/pull/33809#issuecomment-1399037705 Benchmark runs are scheduled for baseline = f9ce32ebab5071b8fc48a135a730c22313aaf9b3 and contender = f7aa50dbeccdcc800a0ffc695b107c1cdc688156. f7aa50dbeccdcc800a0ffc695b107c1cdc688156 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4688a72856d74b63b4494800e30e4bfe...d176eae6bed441eb97ee4baa57dc3fdf/) [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/387f736caaf5494d9cad1f4081ca78d4...4f3232dcf7d04ea082736425413a1641/) [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/aff01843b1ca4ea79a6a5cbdeddc1644...b8cf43c42a19473b8089647069d4f1c5/) [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/3cad7f913b124ec1af1cd10ff0385c78...b56d62491b8b490287ee545f66a3526e/) Buildkite builds: [Failed] [`f7aa50db` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2238) [Failed] [`f7aa50db` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2263) [Failed] [`f7aa50db` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2233) [Failed] [`f7aa50db` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2256) [Failed] [`f9ce32eb` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2237) [Failed] [`f9ce32eb` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2262) [Failed] [`f9ce32eb` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2232) [Failed] [`f9ce32eb` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2255) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion-python] martin-g commented on pull request #129: test: Expand tests for built-in functions
martin-g commented on PR #129: URL: https://github.com/apache/arrow-datafusion-python/pull/129#issuecomment-1399036701 I always prefer using `git rebase` for Pull Request branches. This way the commit history is cleaner. Rebasing also makes it easier to use `Squash and merge` Github UI functionality. The workflows should be triggered manually for first time contributors. Once your first PR is merged all following PRs will execute the checks automatically! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] richtia commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing
richtia commented on code in PR #14596: URL: https://github.com/apache/arrow/pull/14596#discussion_r1083107221 ## ci/docker/conda-python-substrait.dockerfile: ## @@ -0,0 +1,45 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +ARG repo +ARG arch +ARG python=3.9 + +FROM ${repo}:${arch}-conda-python-${python} + +COPY ci/conda_env_python.txt \ + ci/conda_env_sphinx.txt \ + /arrow/ci/ +RUN mamba install -q -y \ +--file arrow/ci/conda_env_python.txt \ +--file arrow/ci/conda_env_sphinx.txt \ +$([ "$python" == "3.9" ] && echo "pickle5") \ +python=${python} \ Review Comment: where do we install openjdk? should it be here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest
alamb commented on PR #4834: URL: https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399016790 Assuming this PR passes CI checks I plan to merge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest
alamb commented on PR #4834: URL: https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399015229 > For now, I merge master into this branch, so that it is mergeable. Thanks @melgenek . I filed issues for the follow on tasks: - [ ] https://github.com/apache/arrow-datafusion/issues/5009 - [ ] https://github.com/apache/arrow-datafusion/issues/5010 - [ ] https://github.com/apache/arrow-datafusion/issues/5011 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb opened a new issue, #5011: [sqllogictest] Remove `integration-tests` directory
alamb opened a new issue, #5011: URL: https://github.com/apache/arrow-datafusion/issues/5011 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** https://github.com/apache/arrow-datafusion/pull/4834 adds a way to run sqllogictests using postgres 🦾 (kudos to @melgenek ) This now duplicates the coverage in https://github.com/apache/arrow-datafusion/tree/master/integration-tests as described in https://github.com/apache/arrow-datafusion/issues/4462 **Describe the solution you'd like** Remove https://github.com/apache/arrow-datafusion/tree/master/integration-tests and the CI checks that use it **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Add any other context or screenshots about the feature request here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb opened a new issue, #5010: [sqllogictest] Consolidate normalization code for the postgres and non-postgres paths
alamb opened a new issue, #5010: URL: https://github.com/apache/arrow-datafusion/issues/5010 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** https://github.com/apache/arrow-datafusion/pull/4834 adds a way to run sqllogictests using postgres 🦾 (kudos to @melgenek ) However, it uses different normalization code for the postgres and non-postgres paths **Describe the solution you'd like** I would like a single set of comparison code that works for both postgres and not postgres paths **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Add any other context or screenshots about the feature request here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #33795: GH-33794: [Go] Add SetRecordReader to PreparedStatement
ursabot commented on PR #33795: URL: https://github.com/apache/arrow/pull/33795#issuecomment-1399012764 Benchmark runs are scheduled for baseline = 0e4a2e19e36d70a3072ce5275129d15fdb187c64 and contender = f9ce32ebab5071b8fc48a135a730c22313aaf9b3. f9ce32ebab5071b8fc48a135a730c22313aaf9b3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/deff7301a8094167a5f3458685d0c0e7...4688a72856d74b63b4494800e30e4bfe/) [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4ff6080bb6794a91b251f13e894603e4...387f736caaf5494d9cad1f4081ca78d4/) [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/48362c3cb97c43f2afb7c20547cf4ad0...aff01843b1ca4ea79a6a5cbdeddc1644/) [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/747a35465dfa415fa603317ae0ca67e0...3cad7f913b124ec1af1cd10ff0385c78/) Buildkite builds: [Failed] [`f9ce32eb` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2237) [Failed] [`f9ce32eb` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2262) [Failed] [`f9ce32eb` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2232) [Failed] [`f9ce32eb` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2255) [Failed] [`0e4a2e19` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2236) [Failed] [`0e4a2e19` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2261) [Failed] [`0e4a2e19` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2231) [Failed] [`0e4a2e19` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2254) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb opened a new issue, #5009: [sqllogictest] Don't orchestrate the postgres containers with rust / docker
alamb opened a new issue, #5009: URL: https://github.com/apache/arrow-datafusion/issues/5009 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** https://github.com/apache/arrow-datafusion/pull/4834 adds a way to run sqllogictests using postgres 🦾 (kudos to @melgenek ) However, it currently orchestrates the postgres containers with rust test code which means running these tests requires docker to run the (full) datafusion test suit locally **Describe the solution you'd like** 1. Remove `testcontainers` dependency 2. Allow sqllogictest `PG_COMPAT` mode to run against an existing postgres database (supply the connection information) **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Add any other context or screenshots about the feature request here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083090721 ## cpp/src/parquet/page_index.h: ## @@ -126,4 +132,94 @@ class PARQUET_EXPORT OffsetIndex { virtual const std::vector& page_locations() const = 0; }; +/// \brief Interface for reading the page index for a Parquet row group. +class PARQUET_EXPORT RowGroupPageIndexReader { + public: + virtual ~RowGroupPageIndexReader() = default; + + /// \brief Read column index of a column chunk. + /// + /// \param[in] i column ordinal of the column chunk. + /// \returns column index of the column or nullptr if it does not exist. + /// \throws ParquetException if the index is out of bound. + virtual std::shared_ptr GetColumnIndex(int32_t i) = 0; + + /// \brief Read offset index of a column chunk. + /// + /// \param[in] i column ordinal of the column chunk. + /// \returns offset index of the column or nullptr if it does not exist. + /// \throws ParquetException if the index is out of bound. + virtual std::shared_ptr GetOffsetIndex(int32_t i) = 0; +}; + +struct IndexSelection { + /// Specifies whether to read the column index. + bool column_index = false; + /// Specifies whether to read the offset index. + bool offset_index = false; +}; + +struct RowGroupIndexReadRange { + /// Base start and total size of column index of all column chunks in a row group. + /// If none of the column chunks have column index, it is set to std::nullopt. + std::optional<::arrow::io::ReadRange> column_index = std::nullopt; + /// Base start and total size of offset index of all column chunks in a row group. + /// If none of the column chunks have offset index, it is set to std::nullopt. + std::optional<::arrow::io::ReadRange> offset_index = std::nullopt; +}; + +/// \brief Interface for reading the page index for a Parquet file. +class PARQUET_EXPORT PageIndexReader { + public: + virtual ~PageIndexReader() = default; + + /// \brief Create a PageIndexReader instance. + /// \returns a PageIndexReader instance. + /// WARNING: The returned PageIndexReader references to all the input parameters, so + /// it must not outlive all of the input parameters. Usually these input parameters + /// come from the same ParquetFileReader object, so it must not outlive the reader + /// that creates this PageIndexReader. + static std::shared_ptr Make( + ::arrow::io::RandomAccessFile* input, std::shared_ptr file_metadata, + const ReaderProperties& properties, + std::shared_ptr file_decryptor = NULLPTR); + + /// \brief Get the page index reader of a specific row group. + /// \param[in] i row group ordinal to get page index reader. + /// \returns RowGroupPageIndexReader of the specified row group. A nullptr may or may + /// not be returned if the page index for the row group is unavailable. It is + /// the caller's responsibility to check the return value of follow-up calls + /// to the RowGroupPageIndexReader. + /// \throws ParquetException if the index is out of bound. + virtual std::shared_ptr RowGroup(int i) = 0; + + /// \brief Advise the reader which part of page index will be read later. + /// + /// The PageIndexReader implementation can optionally prefetch and cache page index + /// that may be read later. Follow-up read should not fail even if WillNeed() is not + /// called, or the requested page index is out of range from WillNeed() call. + /// + /// \param[in] row_group_indices list of row group ordinal to read page index later. + /// \param[in] index_selection tell if any of the page index is required later. + virtual void WillNeed(const std::vector& row_group_indices, +IndexSelection index_selection) = 0; Review Comment: nit, even though this might be more effient, probably pays to use const IndexSelection& as the formal parameter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083090112 ## cpp/src/parquet/page_index.h: ## @@ -126,4 +132,94 @@ class PARQUET_EXPORT OffsetIndex { virtual const std::vector& page_locations() const = 0; }; +/// \brief Interface for reading the page index for a Parquet row group. +class PARQUET_EXPORT RowGroupPageIndexReader { + public: + virtual ~RowGroupPageIndexReader() = default; + + /// \brief Read column index of a column chunk. + /// + /// \param[in] i column ordinal of the column chunk. + /// \returns column index of the column or nullptr if it does not exist. + /// \throws ParquetException if the index is out of bound. + virtual std::shared_ptr GetColumnIndex(int32_t i) = 0; + + /// \brief Read offset index of a column chunk. + /// + /// \param[in] i column ordinal of the column chunk. + /// \returns offset index of the column or nullptr if it does not exist. + /// \throws ParquetException if the index is out of bound. + virtual std::shared_ptr GetOffsetIndex(int32_t i) = 0; +}; + +struct IndexSelection { + /// Specifies whether to read the column index. + bool column_index = false; + /// Specifies whether to read the offset index. + bool offset_index = false; +}; + +struct RowGroupIndexReadRange { Review Comment: Yes, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083089551 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize column index for column {}", i); +} +return column_index; + } + + /// Read offset index of a column chunk. + std::shared_ptr GetOffsetIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get offset index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted offset index yet"); +} + +auto offset_index_location = col_chunk->GetOffsetIndexLocation(); +if (!offset_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.offset_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (offset_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(offset_index_buffer_, + input_->ReadAt(index_read_range_.offset_index->offset, + index_read_range_.offset_index->length)); +} + +auto buffer = offset_index_buffer_.get(); +int64_t buffer_offset = +offset_index_location->offset - index_read_range_.offset_index->offset; +uint32_t length = static_cast(offset_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.offset_index->length); + +std::shared_ptr offset_index; +try { + offset_index = + OffsetIndex::Make(buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize offset index for column {}", i); +} +return offset_index; + } + + private: + /// The input stream that can perform random access read. + ::arrow::io::RandomAccessFile* input_; + + /// The row group metadata to get column chunk metadata. + std::shared_ptr row_group_metadata_; + + /// Reader properties used to deserialize thrift object. + const ReaderProperties& properties_; + + /// File-level decryptor. + std::shared_ptr file_decryptor_; + + /// File offsets and sizes of the page Index of all column chunks in the row group. + RowGroupIndexReadRange i
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083089071 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize column index for column {}", i); +} +return column_index; + } + + /// Read offset index of a column chunk. + std::shared_ptr GetOffsetIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get offset index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted offset index yet"); +} + +auto offset_index_location = col_chunk->GetOffsetIndexLocation(); +if (!offset_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.offset_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (offset_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(offset_index_buffer_, + input_->ReadAt(index_read_range_.offset_index->offset, + index_read_range_.offset_index->length)); +} + +auto buffer = offset_index_buffer_.get(); +int64_t buffer_offset = +offset_index_location->offset - index_read_range_.offset_index->offset; +uint32_t length = static_cast(offset_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.offset_index->length); + +std::shared_ptr offset_index; +try { + offset_index = + OffsetIndex::Make(buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize offset index for column {}", i); Review Comment: propogate exception information from the caugh tone? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088957 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize column index for column {}", i); +} +return column_index; + } + + /// Read offset index of a column chunk. + std::shared_ptr GetOffsetIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get offset index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted offset index yet"); +} + +auto offset_index_location = col_chunk->GetOffsetIndexLocation(); +if (!offset_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.offset_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (offset_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(offset_index_buffer_, + input_->ReadAt(index_read_range_.offset_index->offset, + index_read_range_.offset_index->length)); +} + +auto buffer = offset_index_buffer_.get(); Review Comment: is this a pointer? could we at least to auto* i not spell out the whle type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] lidavidm merged pull request #369: fix(go/adbc/driver/flightsql): heap-allocate Go handles
lidavidm merged PR #369: URL: https://github.com/apache/arrow-adbc/pull/369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088746 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize column index for column {}", i); +} +return column_index; + } + + /// Read offset index of a column chunk. + std::shared_ptr GetOffsetIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get offset index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted offset index yet"); +} + +auto offset_index_location = col_chunk->GetOffsetIndexLocation(); +if (!offset_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.offset_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (offset_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(offset_index_buffer_, + input_->ReadAt(index_read_range_.offset_index->offset, + index_read_range_.offset_index->length)); +} + +auto buffer = offset_index_buffer_.get(); +int64_t buffer_offset = +offset_index_location->offset - index_read_range_.offset_index->offset; +uint32_t length = static_cast(offset_index_location->length); Review Comment: same comment as above on casts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088486 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize column index for column {}", i); +} +return column_index; + } + + /// Read offset index of a column chunk. + std::shared_ptr GetOffsetIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get offset index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { Review Comment: same comment about removing file_decriptor_ check? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088136 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + throw ParquetException("Cannot deserialize column index for column {}", i); Review Comment: seems like we should propogate information about the caught error here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083087767 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return nullptr; +} + +if (!index_read_range_.column_index.has_value()) { + throw ParquetException("Missing column index read range"); +} + +if (column_index_buffer_ == nullptr) { + PARQUET_ASSIGN_OR_THROW(column_index_buffer_, + input_->ReadAt(index_read_range_.column_index->offset, + index_read_range_.column_index->length)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = +column_index_location->offset - index_read_range_.column_index->offset; +uint32_t length = static_cast(column_index_location->length); +DCHECK_GE(buffer_offset, 0); Review Comment: if location->length is derived from data in the parquet file, we should be checking eagerly that the length is within the allowed range. Could we also add a comment on why the down-cast -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4989: Add support for linear range calculation in WINDOW functions
ursabot commented on PR #4989: URL: https://github.com/apache/arrow-datafusion/pull/4989#issuecomment-1399003734 Benchmark runs are scheduled for baseline = 92d0a054c23e5fba91718db32ccd933ce86dd2b6 and contender = b71cae8aa556369bc5ee72b063ed1fc5a81192f1. b71cae8aa556369bc5ee72b063ed1fc5a81192f1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b7602ae58f5d4b7ba3324d1ffe741657...ebc718c2b1854f13b01b15b9ab814210/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/a4af3422df424d45b8c593bf88fcdb2e...9c071dfa45984472843f37989cf6184f/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/62bf27b1bb66413091df940f3d3c4b2c...84d58f2d947140169e6aa56dc6992ca0/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/dcbd2be5fd934bd7957e5699f4165edd...3fbb7ffd22844f168faeed49ff4bdbcc/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083085814 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(std::move(row_group_metadata)), +properties_(properties), +file_decryptor_(std::move(file_decryptor)), +index_read_range_( + PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {} + + /// Read column index of a column chunk. + std::shared_ptr GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + throw ParquetException("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); Review Comment: given this error should the second part of the check above be removed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083084578 ## cpp/src/parquet/file_reader.cc: ## @@ -302,6 +303,17 @@ class SerializedFile : public ParquetFileReader::Contents { std::shared_ptr metadata() const override { return file_metadata_; } + std::shared_ptr GetPageIndexReader() override { +if (!file_metadata_) { + throw ParquetException("Cannot get PageIndexReader as file metadata is not ready"); Review Comment: nit: can we instruct the api caller what went wrong here (i.e. should they call a certain method?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support
emkornfield commented on code in PR #14964: URL: https://github.com/apache/arrow/pull/14964#discussion_r1083083858 ## cpp/src/parquet/page_index.cc: ## @@ -184,8 +185,241 @@ class OffsetIndexImpl : public OffsetIndex { std::vector page_locations_; }; +class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { + public: + RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input, + std::shared_ptr row_group_metadata, + const ReaderProperties& properties, + int32_t row_group_ordinal, + std::shared_ptr file_decryptor) + : input_(input), +row_group_metadata_(row_group_metadata), +properties_(properties), +file_decryptor_(std::move(file_decryptor)) { +bool has_column_index = false; +bool has_offset_index = false; +PageIndexReader::DeterminePageIndexRangesInRowGroup( +*row_group_metadata, &column_index_base_offset_, &column_index_size_, +&offset_index_base_offset_, &offset_index_size_, &has_column_index, +&has_offset_index); + } + + /// Read column index of a column chunk. + ::arrow::Result> GetColumnIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + return ::arrow::Status::IndexError("Invalid column {} to get column index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted column index yet"); +} + +auto column_index_location = col_chunk->GetColumnIndexLocation(); +if (!column_index_location.has_value()) { + return ::arrow::Status::Invalid("Column index does not exist for column {}", i); +} + +if (column_index_buffer_ == nullptr) { + ARROW_ASSIGN_OR_RAISE( + column_index_buffer_, + input_->ReadAt(column_index_base_offset_, column_index_size_)); +} + +auto buffer = column_index_buffer_.get(); +int64_t buffer_offset = column_index_location->offset - column_index_base_offset_; +uint32_t length = column_index_location->length; +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, column_index_size_); + +auto descr = row_group_metadata_->schema()->Column(i); +std::shared_ptr column_index; +try { + column_index = + ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, properties_); +} catch (...) { + return ::arrow::Status::SerializationError( + "Cannot deserialize column index for column {}", i); +} +return column_index; + } + + /// Read offset index of a column chunk. + ::arrow::Result> GetOffsetIndex(int32_t i) override { +if (i < 0 || i >= row_group_metadata_->num_columns()) { + return ::arrow::Status::IndexError("Invalid column {} to get offset index", i); +} + +auto col_chunk = row_group_metadata_->ColumnChunk(i); + +std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); +if (crypto_metadata != nullptr && file_decryptor_ == nullptr) { + ParquetException::NYI("Cannot read encrypted offset index yet"); +} + +auto offset_index_location = col_chunk->GetOffsetIndexLocation(); +if (!offset_index_location.has_value()) { + return ::arrow::Status::Invalid("Offset index does not exist for column {}", i); +} + +if (offset_index_buffer_ == nullptr) { + ARROW_ASSIGN_OR_RAISE( + offset_index_buffer_, + input_->ReadAt(offset_index_base_offset_, offset_index_size_)); +} + +auto buffer = offset_index_buffer_.get(); +int64_t buffer_offset = offset_index_location->offset - offset_index_base_offset_; +uint32_t length = offset_index_location->length; +DCHECK_GE(buffer_offset, 0); +DCHECK_LE(buffer_offset + length, offset_index_size_); + +std::shared_ptr offset_index; +try { + offset_index = + OffsetIndex::Make(buffer->data() + buffer_offset, length, properties_); +} catch (...) { + return ::arrow::Status::SerializationError( + "Cannot deserialize offset index for column {}", i); +} +return offset_index; + } + + private: + /// The input stream that can perform random access read. + ::arrow::io::RandomAccessFile* input_; + + /// The row group metadata to get column chunk metadata. + std::shared_ptr row_group_metadata_; + + /// Reader properties used to deserialize thrift object. + const ReaderProperties& properties_; + + /// File-level decryptor. + std::shared_ptr file_decryptor_; + + /// File offsets and sizes of the page Index. + int64_t column_index_base_offset_; + int64_t column_index_size_; + int64_t offset_index_base_offset_; + int64_t offset_index_size_; + + /// Buffer to hold the raw bytes of the page index. + /// Wi
[GitHub] [arrow] thisisnic merged pull request #33809: MINOR: [R] Update BugReports field in DESCRIPTION
thisisnic merged PR #33809: URL: https://github.com/apache/arrow/pull/33809 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #4979: Add support for linear range search
alamb closed issue #4979: Add support for linear range search URL: https://github.com/apache/arrow-datafusion/issues/4979 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4989: Add support for linear range calculation in WINDOW functions
alamb merged PR #4989: URL: https://github.com/apache/arrow-datafusion/pull/4989 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org