date:20230120

[GitHub] [arrow] kou commented on pull request #33817: GH-33816: [CI][Conan] Use TARGET_FILE for portability

2023-01-20 Thread via GitHub



kou commented on PR #33817:
URL: https://github.com/apache/arrow/pull/33817#issuecomment-1399202474

   @github-actions crossbow submit wheel-windows-*-amd64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #33817: GH-33816: [CI][Conan] Use TARGET_FILE for portability

2023-01-20 Thread via GitHub



kou commented on PR #33817:
URL: https://github.com/apache/arrow/pull/33817#issuecomment-1399202261

   @github-actions crossbow submit conan-maximum


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou opened a new pull request, #33817: GH-33816: [CI][Conan] Use TARGET_FILE for portability

2023-01-20 Thread via GitHub



kou opened a new pull request, #33817:
URL: https://github.com/apache/arrow/pull/33817

   
   ### What changes are included in this PR?
   
   Use `$` instead of manual library path resolution. 
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on issue #33814: [Python] Can't install on Raspberry Pi (Failed building wheel for pyarrow)

2023-01-20 Thread via GitHub



kou commented on issue #33814:
URL: https://github.com/apache/arrow/issues/33814#issuecomment-1399201865

   It seems that you didn't define `CMAKE_PREFIX_PATH`:
   
   
https://arrow.apache.org/docs/dev/developers/python.html#using-system-and-bundled-dependencies
   
   ```bash
   export ARROW_HOME=$(pwd)/dist
   export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
   export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou opened a new pull request, #33815: GH-33813: [CI][GLib] Use Ruby 3.2 to update bundled MSYS2

2023-01-20 Thread via GitHub



kou opened a new pull request, #33815:
URL: https://github.com/apache/arrow/pull/33815

   ### What changes are included in this PR?
   
   Use Ruby 3.2 to update bundled MSYS2.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



wgtmac commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083257758


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);

Review Comment:
   `ThriftDeserializer::DeserializeMessage` requires a `uint32_t*` to be the 
in/out parameter for the length of thrift message. And `ColumnIndex::Make()` 
simply reuse uint32_t for simplicity. BTW this is not a down-cast, 
column_index_location->length is in the type of `int32_t`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on issue #15139: [C++] arrow.pc is missing dependencies with Windows static builds

2023-01-20 Thread via GitHub



kou commented on issue #15139:
URL: https://github.com/apache/arrow/issues/15139#issuecomment-1399200167

   Hmm. It seems that it's related to vcpkg. Could you open an issue on vcpkg?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33811: GH-33786: [C++] Ignore old system xsimd

2023-01-20 Thread via GitHub



ursabot commented on PR #33811:
URL: https://github.com/apache/arrow/pull/33811#issuecomment-1399199450

   Benchmark runs are scheduled for baseline = 
0d9d132e9140f26578369b5ef0b44d25c501e45d and contender = 
2117d028699edd9f4197650890f3226cdd285c23. 
2117d028699edd9f4197650890f3226cdd285c23 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9fde0e445003471087541e47735a2e68...b3787b82b23b4d6b9b78386fe421dcc1/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/1cd922b14c864ae2a84b80f4185faef6...c64d806e1d134ffab9d1ec9139bd444e/)
   [Failed] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0fbb4c22a0bd4465acac2bc9ed954cd1...d0bf5c298da6498594a348ce39a208da/)
   [Failed] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f3b67a6be39d4d1d8f884ac0fcf6624f...2df2681c6c164b6093f2773bfc4e6b3c/)
   Buildkite builds:
   [Failed] [`2117d028` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2241)
   [Failed] [`2117d028` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2266)
   [Failed] [`2117d028` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2236)
   [Failed] [`2117d028` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2259)
   [Failed] [`0d9d132e` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2240)
   [Failed] [`0d9d132e` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2265)
   [Failed] [`0d9d132e` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2235)
   [Failed] [`0d9d132e` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2258)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



wgtmac commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083256983


##
cpp/src/parquet/file_reader.cc:
##
@@ -302,6 +303,17 @@ class SerializedFile : public ParquetFileReader::Contents {
 
   std::shared_ptr metadata() const override { return 
file_metadata_; }
 
+  std::shared_ptr GetPageIndexReader() override {
+if (!file_metadata_) {
+  throw ParquetException("Cannot get PageIndexReader as file metadata is 
not ready");

Review Comment:
   Usually this won't happen if user calls one of the static Open() functions 
to create a ParquetFileReader instance. But if user calls the constructor 
directly and calls GetPageIndexReader() before Open() then this could happen. 
Add a comment to explain it and makes the exception message clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Oduig commented on issue #33790: [Python] Support for reading .csv files from a zip archive

2023-01-20 Thread via GitHub



Oduig commented on issue #33790:
URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399198244

   Thank you for the reply, I see the relevant code is in the cpp section! It 
already works with gz and bz2, but not with (Windows-esque) .zip files. Is 
there a reason why it is not included as a compression codec?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] kou commented on a diff in pull request #356: feat(go/adbc/driver/pkg/cmake): cmake build for Go shared library drivers

2023-01-20 Thread via GitHub



kou commented on code in PR #356:
URL: https://github.com/apache/arrow-adbc/pull/356#discussion_r1083254936


##
c/driver/flightsql/adbc-driver-flightsql.pc.in:
##
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prefix=@CMAKE_INSTALL_PREFIX@
+libdir=@ADBC_PKG_CONFIG_LIBDIR@
+
+Name: Apache Arrow Database Connectivity (ADBC) FlightSQL driver

Review Comment:
   ```suggestion
   Name: Apache Arrow Database Connectivity (ADBC) Flight SQL driver
   ```



##
c/driver/flightsql/adbc-driver-flightsql.pc.in:
##
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prefix=@CMAKE_INSTALL_PREFIX@
+libdir=@ADBC_PKG_CONFIG_LIBDIR@
+
+Name: Apache Arrow Database Connectivity (ADBC) FlightSQL driver
+Description: The ADBC FlightSQL driver provides an ADBC driver for Flight SQL.

Review Comment:
   ```suggestion
   Description: The ADBC Flight SQL driver provides an ADBC driver for Flight 
SQL.
   ```



##
c/cmake_modules/GoUtils.cmake:
##
@@ -0,0 +1,210 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+find_program(GO_BIN "go" REQUIRED)
+message(STATUS "Detecting Go executable: Found ${GO_BIN}")
+
+function(install_cmake_package PACKAGE_NAME EXPORT_NAME)
+  set(CONFIG_CMAKE "${PACKAGE_NAME}Config.cmake")
+  set(BUILT_CONFIG_CMAKE "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_CMAKE}")
+  configure_package_config_file("${CONFIG_CMAKE}.in" "${BUILT_CONFIG_CMAKE}"
+INSTALL_DESTINATION 
"${ARROW_CMAKE_DIR}/${PACKAGE_NAME}")
+  set(CONFIG_VERSION_CMAKE "${PACKAGE_NAME}ConfigVersion.cmake")
+  set(BUILT_CONFIG_VERSION_CMAKE 
"${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_VERSION_CMAKE}")
+  write_basic_package_version_file("${BUILT_CONFIG_VERSION_CMAKE}"
+   COMPATIBILITY SameMajorVersion)
+  set(INSTALL_CMAKEDIR "${CMAKE_INSTALL_LIBDIR}/cmake/${PACKAGE_NAME}")
+  install(FILES "${BUILT_CONFIG_CMAKE}" "${BUILT_CONFIG_VERSION_CMAKE}"
+  DESTINATION "${INSTALL_CMAKEDIR}")
+  set(TARGETS_CMAKE "${PACKAGE_NAME}Targets.cmake")
+  install(EXPORT ${EXPORT_NAME}
+  DESTINATION "${INSTALL_CMAKEDIR}"
+  NAMESPACE "${PACKAGE_NAME}::"
+  FILE "${TARGETS_CMAKE}"
+  EXPORT_LINK_INTERFACE_LIBRARIES)
+endfunction()

Review Comment:
   It seems that we can remove this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] xiaoyong-z commented on a diff in pull request #4995: [Feature] support describe file

2023-01-20 Thread via GitHub



xiaoyong-z commented on code in PR #4995:
URL: https://github.com/apache/arrow-datafusion/pull/4995#discussion_r1083255049


##
datafusion/expr/src/logical_plan/plan.rs:
##
@@ -1625,6 +1637,15 @@ pub struct Prepare {
 pub input: Arc,
 }
 
+/// Describe a file

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] xiaoyong-z commented on a diff in pull request #4995: [Feature] support describe file

2023-01-20 Thread via GitHub



xiaoyong-z commented on code in PR #4995:
URL: https://github.com/apache/arrow-datafusion/pull/4995#discussion_r1083254991


##
datafusion/core/tests/sqllogictests/test_files/describe.slt:
##
@@ -0,0 +1,43 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+
+#   http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+##
+# Describe internal tables
+##
+
+statement ok
+set datafusion.catalog.information_schema = true
+
+statement ok
+CREATE external table aggregate_simple(c1 real, c2 double, c3 boolean) STORED 
as CSV WITH HEADER ROW LOCATION 'tests/data/aggregate_simple.csv';
+
+query C1
+DESCRIBE aggregate_simple;

Review Comment:
   there is no sqllogictests to test describe, so i add some tests here, now i 
unify the describe table and describe file, and i think it's necessary to add 
some tests to this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] xiaoyong-z commented on a diff in pull request #4995: [Feature] support describe file

2023-01-20 Thread via GitHub



xiaoyong-z commented on code in PR #4995:
URL: https://github.com/apache/arrow-datafusion/pull/4995#discussion_r1083254920


##
datafusion/core/src/datasource/listing/table.rs:
##
@@ -67,6 +67,10 @@ pub struct ListingTableConfig {
 pub file_schema: Option,
 /// Optional `ListingOptions` for the to be created `ListingTable`.
 pub options: Option,
+/// Optional default is false, if temporary is true, then it means that
+/// we will create a temporary table for select * from `xxx.file`
+/// or describe 'xxx.file', the temporary table will be registered to the 
schema
+pub temporary_file: bool,

Review Comment:
   update a pr, now it's no need to use temporary_files any more



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #33694: MINOR: [C++][Parquet] Rephrase decimal annotation

2023-01-20 Thread via GitHub



wgtmac commented on code in PR #33694:
URL: https://github.com/apache/arrow/pull/33694#discussion_r1083254666


##
cpp/src/parquet/properties.h:
##
@@ -452,19 +452,39 @@ class PARQUET_EXPORT WriterProperties {
   return this->disable_statistics(path->ToDotString());
 }
 
-/// Enable integer type to annotate decimal type as below:
-///   int32: 1 <= precision <= 9
-///   int64: 10 <= precision <= 18
-/// Default disabled.
-Builder* enable_integer_annotate_decimal() {
-  integer_annotate_decimal_ = true;
+/// Enable decimal logical type with 1 <= precision <= 18 to be stored as
+/// integer physical type.
+///
+/// According to the specs, DECIMAL can be used to annotate the following 
types:
+/// - int32: for 1 <= precision <= 9.
+/// - int64: for 1 <= precision <= 18; precision < 10 will produce a 
warning.
+/// - fixed_len_byte_array: precision is limited by the array size.
+///   Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.
+/// - binary: precision is not limited, but is required. precision is not 
limited,
+///   but is required. The minimum number of bytes to store the unscaled 
value
+///   should be used.
+///
+/// By default, this is DISABLED and all decimal types annotate 
fixed_len_byte_array.
+///
+/// When enabled, the C++ writer will use following physical types to 
store decimals:
+/// - int32: for 1 <= precision <= 9.
+/// - int64: for 10 <= precision <= 18.
+/// - fixed_len_byte_array: for precision > 18.
+///
+/// As a consequence, decimal columns stored in integer types are more 
compact
+/// but in a risk that the parquet file may not be readable by previous 
Arrow C++
+/// versions or other implementations.

Review Comment:
   I have checked that both apache/parquet-mr and apache/impala support reading 
this long ago. It seems that the native parquet reader in the velox does not 
handle this yet. I'm not familiar with other implementations but there may be 
some do not support it.
   
   IMHO, we cannot take care of all other implementations. There is no risk to 
the parquet-cpp itself because the reader support is already in and it is well 
documented by the parquet specs. So I have removed the comment about the risk. 
@wjones127 @pitrou 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou merged pull request #33811: GH-33786: [C++] Ignore old system xsimd

2023-01-20 Thread via GitHub



kou merged PR #33811:
URL: https://github.com/apache/arrow/pull/33811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou opened a new pull request, #33812: GH-33796: [C++] Fix wrong arrow-testing.pc config with system GoogleTest

2023-01-20 Thread via GitHub



kou opened a new pull request, #33812:
URL: https://github.com/apache/arrow/pull/33812

   ### Rationale for this change
   
   Empty `-I` in `Cflags:` generates an invalid build command line.
   
   ### What changes are included in this PR?
   
   Add `Requires: gtest` if `gtest.pc` exists.
   
   Add `Cflags: -I...` and `Libs: /.../libgtest.a` if `gtest.pc` doesn't exist.
   
   Bundled GoogleTest isn't supported yet.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #33694: MINOR: [C++][Parquet] Rephrase decimal annotation

2023-01-20 Thread via GitHub



wgtmac commented on code in PR #33694:
URL: https://github.com/apache/arrow/pull/33694#discussion_r1083253360


##
cpp/src/parquet/properties.h:
##
@@ -452,19 +452,39 @@ class PARQUET_EXPORT WriterProperties {
   return this->disable_statistics(path->ToDotString());
 }
 
-/// Enable integer type to annotate decimal type as below:
-///   int32: 1 <= precision <= 9
-///   int64: 10 <= precision <= 18
-/// Default disabled.
-Builder* enable_integer_annotate_decimal() {
-  integer_annotate_decimal_ = true;
+/// Enable decimal logical type with 1 <= precision <= 18 to be stored as
+/// integer physical type.
+///
+/// According to the specs, DECIMAL can be used to annotate the following 
types:
+/// - int32: for 1 <= precision <= 9.
+/// - int64: for 1 <= precision <= 18; precision < 10 will produce a 
warning.
+/// - fixed_len_byte_array: precision is limited by the array size.
+///   Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.
+/// - binary: precision is not limited, but is required. precision is not 
limited,
+///   but is required. The minimum number of bytes to store the unscaled 
value
+///   should be used.

Review Comment:
   That makes sense. I have changed the comment accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #33694: MINOR: [C++][Parquet] Rephrase decimal annotation

2023-01-20 Thread via GitHub



wgtmac commented on code in PR #33694:
URL: https://github.com/apache/arrow/pull/33694#discussion_r1083252465


##
cpp/src/parquet/properties.h:
##
@@ -452,19 +452,39 @@ class PARQUET_EXPORT WriterProperties {
   return this->disable_statistics(path->ToDotString());
 }
 
-/// Enable integer type to annotate decimal type as below:
-///   int32: 1 <= precision <= 9
-///   int64: 10 <= precision <= 18
-/// Default disabled.
-Builder* enable_integer_annotate_decimal() {
-  integer_annotate_decimal_ = true;
+/// Enable decimal logical type with 1 <= precision <= 18 to be stored as
+/// integer physical type.

Review Comment:
   I don't think negative precision decimals exist, but keeping `1 <= 
precision` doesn't hurt.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-julia] quinnj merged pull request #381: Tag new version dev/release/release.sh

2023-01-20 Thread via GitHub



quinnj merged PR #381:
URL: https://github.com/apache/arrow-julia/pull/381


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #33753: GH-30891: [C++] The C++ API for writing datasets could be improved

2023-01-20 Thread via GitHub



kou commented on PR #33753:
URL: https://github.com/apache/arrow/pull/33753#issuecomment-1399190351

   GLib/Ruby parts are updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on a diff in pull request #33791: GH-33782: [Release] Vote email number of issues is querying JIRA and producing a wrong number

2023-01-20 Thread via GitHub



kou commented on code in PR #33791:
URL: https://github.com/apache/arrow/pull/33791#discussion_r1083240407


##
dev/release/02-source-test.rb:
##
@@ -93,33 +93,26 @@ def test_python_version
   end
 
   def test_vote
-jira_url = "https://issues.apache.org/jira";
-jql_conditions = [
-  "project = ARROW",
-  "status in (Resolved, Closed)",
-  "fixVersion = #{@release_version}",
-]
-jql = jql_conditions.join(" AND ")
+github_token = ENV["ARROW_GITHUB_API_TOKEN"]
+uri = URI.parse("https://api.github.com/graphql";)
+https = Net::HTTP.new(uri.host, uri.port)
+https.use_ssl = true
+req = Net::HTTP::Post.new(uri.path, initheader = {"Authorization" => 
"Bearer #{github_token}"})
+
+n_issues_query = "{\"query\":\"query {search(query: \\\"repo:apache/arrow 
is:issue is:closed milestone:#{@release_version}\\\", type:ISSUE) 
{issueCount}}\"}"
+req.body = n_issues_query
 n_resolved_issues = nil
-search_url = URI("#{jira_url}/rest/api/2/search?jql=#{CGI.escape(jql)}")
-search_url.open do |response|
-  n_resolved_issues = JSON.parse(response.read)["total"]
+https.request(req) do |response|
+n_resolved_issues = 
JSON.parse(response.body)["data"]["search"]["issueCount"]
 end

Review Comment:
   We can use `Net::HTTP.post`:
   
   ```ruby
   uri = URI.parse("https://api.github.com/graphql";)
   n_issues_query = {
 "query" => <<-QUERY,
   query {
 search(query: "repo:apache/arrow is:issue is:closed 
milestone:#{@release_version}",
type: ISSUE) {
   issueCount
 }
   }
 QUERY
   }
   response = Net::HTTP.post(uri,
 n_issues_query.to_json,
 "Content-Type" => "application/json",
 "Authorization" => "Bearer #{github_token}")
   n_resolved_issues = JSON.parse(response.body)["data"]["search"]["issueCount"]
   ```



##
dev/release/02-source-test.rb:
##
@@ -93,33 +93,26 @@ def test_python_version
   end
 
   def test_vote
-jira_url = "https://issues.apache.org/jira";
-jql_conditions = [
-  "project = ARROW",
-  "status in (Resolved, Closed)",
-  "fixVersion = #{@release_version}",
-]
-jql = jql_conditions.join(" AND ")
+github_token = ENV["ARROW_GITHUB_API_TOKEN"]
+uri = URI.parse("https://api.github.com/graphql";)
+https = Net::HTTP.new(uri.host, uri.port)
+https.use_ssl = true
+req = Net::HTTP::Post.new(uri.path, initheader = {"Authorization" => 
"Bearer #{github_token}"})
+
+n_issues_query = "{\"query\":\"query {search(query: \\\"repo:apache/arrow 
is:issue is:closed milestone:#{@release_version}\\\", type:ISSUE) 
{issueCount}}\"}"
+req.body = n_issues_query
 n_resolved_issues = nil
-search_url = URI("#{jira_url}/rest/api/2/search?jql=#{CGI.escape(jql)}")
-search_url.open do |response|
-  n_resolved_issues = JSON.parse(response.read)["total"]
+https.request(req) do |response|
+n_resolved_issues = 
JSON.parse(response.body)["data"]["search"]["issueCount"]
 end
-github_api_url = "https://api.github.com";
-verify_prs = URI("#{github_api_url}/repos/apache/arrow/pulls" +
- "?state=open" +
- "&head=apache:release-#{@release_version}-rc0")
+
+pr_query = "{\"query\": \"query {repository(owner: \\\"apache\\\", name: 
\\\"arrow\\\") {refs(first: 1, refPrefix: \\\"refs/heads/\\\", query: 
\\\"release-#{@release_version}-rc0\\\") {nodes{associatedPullRequests(first: 
1){edges{node{url}}}\"}"
+req.body = pr_query
 verify_pr_url = nil
-headers = {
-  "Accept" => "application/vnd.github+json",
-}
-github_token = ENV["ARROW_GITHUB_API_TOKEN"]
-if github_token
-  headers["Authorization"] = "Bearer #{github_token}"
-end
-verify_prs.open(headers) do |response|
-  verify_pr_url = (JSON.parse(response.read)[0] || {})["html_url"]
+https.request(req) do |response|
+  verify_pr_url = 
JSON.parse(response.body)["data"]["repository"]["refs"]["nodes"][0]["associatedPullRequests"]["edges"][0]["node"]["url"]
 end
+

Review Comment:
   Do we need to change this? It seems that we can still use the current simple 
REST API.



##
dev/release/02-source.sh:
##
@@ -142,18 +142,18 @@ if [ ${SOURCE_PR} -gt 0 ]; then
 fi
 
 if [ ${SOURCE_VOTE} -gt 0 ]; then
-  jira_url="https://issues.apache.org/jira";
-  
jql="project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%20${version}"
-  n_resolved_issues=$(curl "${jira_url}/rest/api/2/search/?jql=${jql}" | jq 
".total")
-  curl_options=(--header "Accept: application/vnd.github+json")
-  if [ -n "${ARROW_GITHUB_API_TOKEN:-}" ]; then
-curl_options+=(--header "Authorization: Bearer ${ARROW_GITHUB_API_TOKEN}")
-  fi
-  curl_options+=(--get)
-  curl_options+=(--data "state=open")
-  curl_options+=(--data "head=apach

[GitHub] [arrow] kou commented on pull request #33811: GH-33786: [C++] Ignore old system xsimd

2023-01-20 Thread via GitHub



kou commented on PR #33811:
URL: https://github.com/apache/arrow/pull/33811#issuecomment-1399162810

   @github-actions crossbow submit verify-rc-source-cpp-linux-ubuntu-22.04-amd64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33739: GH-33655: [C++][Parquet] Fix occasional failure in TestArrowReadWrite.MultithreadedWrite

2023-01-20 Thread via GitHub



ursabot commented on PR #33739:
URL: https://github.com/apache/arrow/pull/33739#issuecomment-1399162691

   Benchmark runs are scheduled for baseline = 
a16c54567ada6729311fd26bdbee4b5e61901410 and contender = 
0d9d132e9140f26578369b5ef0b44d25c501e45d. 
0d9d132e9140f26578369b5ef0b44d25c501e45d is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a2c888cbc9e54acbb087f82996ef2976...9fde0e445003471087541e47735a2e68/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/9c950d82a1a141cb9555bb25a732624c...1cd922b14c864ae2a84b80f4185faef6/)
   [Failed] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9ac607cae658446e885a40089cb71e52...0fbb4c22a0bd4465acac2bc9ed954cd1/)
   [Failed] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6c0e57252df347318d11432fdaa0cdb9...f3b67a6be39d4d1d8f884ac0fcf6624f/)
   Buildkite builds:
   [Failed] [`0d9d132e` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2240)
   [Failed] [`0d9d132e` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2265)
   [Failed] [`0d9d132e` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2235)
   [Failed] [`0d9d132e` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2258)
   [Failed] [`a16c5456` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2239)
   [Failed] [`a16c5456` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2264)
   [Failed] [`a16c5456` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2234)
   [Failed] [`a16c5456` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2257)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou opened a new pull request, #33811: GH-33786: [C++] Ignore old system xsimd

2023-01-20 Thread via GitHub



kou opened a new pull request, #33811:
URL: https://github.com/apache/arrow/pull/33811

   ### Rationale for this change
   
   If old xsimd is installed, CMake target for bundled xsimd is conflicted.
   
   ### What changes are included in this PR?
   
   Use `arrow::xsimd` for bundled xsimd's target name to avoid conflict.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] ozankabak commented on pull request #4866: Support non-equijoin predicate for EliminateCrossJoin

2023-01-20 Thread via GitHub



ozankabak commented on PR #4866:
URL: 
https://github.com/apache/arrow-datafusion/pull/4866#issuecomment-1399161710

   Any thoughts on how to make progress on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on issue #33786: [Release][C++] Release verification tasks fail with libxsimd-dev installed on ubuntu 22.04

2023-01-20 Thread via GitHub



kou commented on issue #33786:
URL: https://github.com/apache/arrow/issues/33786#issuecomment-1399160270

   We can use `... ARROW_CMAKE_OPTIONS="-Dxsimd_SOURCE=BUNDLED" 
dev/release/verify-release-candidate.sh ...` instead of changing 
`verify-release-candidate.sh`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-julia] kou commented on pull request #381: Tag new version dev/release/release.sh

2023-01-20 Thread via GitHub



kou commented on PR #381:
URL: https://github.com/apache/arrow-julia/pull/381#issuecomment-1399158739

   OK. We can try TagBot in the next release again.
   If TagBot doesn't work, we can use this manual tagging approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #15173: [Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls

2023-01-20 Thread via GitHub



mapleFU commented on issue #15173:
URL: https://github.com/apache/arrow/issues/15173#issuecomment-1399158204

   @wjones127 @emkornfield Hi, what do you think of this problem? Should we 
assure there are no padding? Or just use method like 
https://github.com/apache/arrow/issues/15173#issuecomment-1385858560 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on a diff in pull request #33806: GH-33723: [C++] re2::RE2::RE2() result must be checked

2023-01-20 Thread via GitHub



kou commented on code in PR #33806:
URL: https://github.com/apache/arrow/pull/33806#discussion_r1083228638


##
cpp/src/arrow/compute/kernels/scalar_string_ascii.cc:
##
@@ -1505,6 +1508,13 @@ struct MatchLike {
 static const RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)", 
kRE2Options);
 // A LIKE pattern matching this regex can be translated into a suffix 
search.
 static const RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))", kRE2Options);
+static bool global_checked = false;

Review Comment:
   Can we use `std::once_flag`?



##
cpp/src/arrow/compute/kernels/scalar_string_ascii.cc:
##
@@ -1505,6 +1508,13 @@ struct MatchLike {
 static const RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)", 
kRE2Options);
 // A LIKE pattern matching this regex can be translated into a suffix 
search.
 static const RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))", kRE2Options);
+static bool global_checked = false;
+if (ARROW_PREDICT_FALSE(!global_checked)) {
+  RETURN_NOT_OK(RegexStatus(kLikePatternIsSubstringMatch));
+  RETURN_NOT_OK(RegexStatus(kLikePatternIsStartsWith));
+  RETURN_NOT_OK(RegexStatus(kLikePatternIsEndsWith));

Review Comment:
   If we have an error in these patterns, does this code report an error on the 
second call? (It seems that this code reports an error only on the first call.) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on pull request #33739: GH-33655: [C++][Parquet] Fix occasional failure in TestArrowReadWrite.MultithreadedWrite

2023-01-20 Thread via GitHub



cyb70289 commented on PR #33739:
URL: https://github.com/apache/arrow/pull/33739#issuecomment-1399156005

   Thanks @wgtmac for fixing the issue quickly !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 merged pull request #33739: GH-33655: [C++][Parquet] Fix occasional failure in TestArrowReadWrite.MultithreadedWrite

2023-01-20 Thread via GitHub



cyb70289 merged PR #33739:
URL: https://github.com/apache/arrow/pull/33739


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be checked

2023-01-20 Thread via GitHub



kou commented on PR #33781:
URL: https://github.com/apache/arrow/pull/33781#issuecomment-1399154574

   Thanks! But #33806 is better approach.
   I close this in favor of #33806.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou closed pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be checked

2023-01-20 Thread via GitHub



kou closed pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be 
checked
URL: https://github.com/apache/arrow/pull/33781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] paleolimbot commented on issue #33702: [R] Package Arrow 11.0.0 for R/CRAN

2023-01-20 Thread via GitHub



paleolimbot commented on issue #33702:
URL: https://github.com/apache/arrow/issues/33702#issuecomment-1399154374

   Is there an Arrow issue for clang16 yet? I didn't get to it today but I'm 
sure I can get a docker image together on Monday with clang16 and R.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zzzzwj commented on a diff in pull request #33703: GH-14748:[C++] Modify some comments about shared_ptr's ownership in parquet-cpp.

2023-01-20 Thread via GitHub



wj commented on code in PR #33703:
URL: https://github.com/apache/arrow/pull/33703#discussion_r1083228089


##
cpp/src/parquet/arrow/test_util.h:
##
@@ -463,12 +463,16 @@ Status MakeEmptyListsArray(int64_t size, 
std::shared_ptr* out_array) {
   return Status::OK();
 }
 
+// Please make sure that the return value should have an object assigned 
explicitly
+// if you want to use the pointer contained.

Review Comment:
   Oh, really sorry, it looks like I truly misunderstood your meaning. I 
thought that you wanted me to verify all the ownership of functions that return 
a shared_ptr. I'll have a careful check and then make a change on this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] paleolimbot commented on issue #29428: [R] accept expression lists in Scanner$create() with arrow_dplyr_querys

2023-01-20 Thread via GitHub



paleolimbot commented on issue #29428:
URL: https://github.com/apache/arrow/issues/29428#issuecomment-1399153279

   Typically our `as_record_batch_reader()` methods just have a `schema` 
argument (as in, output schema...cast if you can, error if you can't). That's 
not *quite* what duckdb needs to pass through but I'm sure we can provide 
something more direct than going through the scanner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #33803: GH-33787: [C++] arrow/cpp/src/arrow/util/cpu_info.cc: the -Werror triggers an error: statement has no effect [-Werror=unused-value]

2023-01-20 Thread via GitHub



kou commented on PR #33803:
URL: https://github.com/apache/arrow/pull/33803#issuecomment-1399152264

   How about this?
   
   ```diff
   diff --git a/cpp/src/arrow/util/cpu_info.cc b/cpp/src/arrow/util/cpu_info.cc
   index 3ba8db216..08b7b8b21 100644
   --- a/cpp/src/arrow/util/cpu_info.cc
   +++ b/cpp/src/arrow/util/cpu_info.cc
   @@ -348,6 +348,7 @@ int64_t LinuxGetCacheSize(int level) {
// care about are present.
// Returns a bitmap of flags.
int64_t LinuxParseCpuFlags(const std::string& values) {
   +#if defined(CPUINFO_ARCH_X86) || defined(CPUINFO_ARCH_ARM)
  const struct {
std::string name;
int64_t flag;
   @@ -379,6 +380,9 @@ int64_t LinuxParseCpuFlags(const std::string& values) {
}
  }
  return flags;
   +#else
   +  return 0;
   +#endif
}

void OsRetrieveCacheSize(std::array* cache_sizes) {
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] kou commented on issue #366: [Discuss] Is the conventional commit format working?

2023-01-20 Thread via GitHub



kou commented on issue #366:
URL: https://github.com/apache/arrow-adbc/issues/366#issuecomment-1399151048

   > We can ask INFRA to change the setting to merge using the PR 
title/description, if people agree.
   
   I like the setting!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest

2023-01-20 Thread via GitHub



ursabot commented on PR #4834:
URL: 
https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399149236

   Benchmark runs are scheduled for baseline = 
b71cae8aa556369bc5ee72b063ed1fc5a81192f1 and contender = 
1d69f28f14acf178377ecf55a343b6e71b4dd856. 
1d69f28f14acf178377ecf55a343b6e71b4dd856 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ec2-t3-xlarge-us-east-2] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ebc718c2b1854f13b01b15b9ab814210...436e61feb07d482da9bb16171366b783/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on test-mac-arm] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/9c071dfa45984472843f37989cf6184f...d1c046c7b0cd4f8c9e2185b57735a6ae/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-i9-9960x] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/84d58f2d947140169e6aa56dc6992ca0...5b5749ceb3254d02a9294123c31eaa8c/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-thinkcentre-m75q] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/3fbb7ffd22844f168faeed49ff4bdbcc...9a91203aa4cf46d495f56c13bfe1f237/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 merged pull request #4834: (#4462) Postgres compatibility tests using sqllogictest

2023-01-20 Thread via GitHub



xudong963 merged PR #4834:
URL: https://github.com/apache/arrow-datafusion/pull/4834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 closed issue #4462: Replace python based integration test with sqllogictest

2023-01-20 Thread via GitHub



xudong963 closed issue #4462: Replace python based integration test with 
sqllogictest
URL: https://github.com/apache/arrow-datafusion/issues/4462


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest

2023-01-20 Thread via GitHub



xudong963 commented on PR #4834:
URL: 
https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399148189

   merged, let's iterate!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing

2023-01-20 Thread via GitHub



vibhatha commented on PR #14596:
URL: https://github.com/apache/arrow/pull/14596#issuecomment-1399110128

   @github-actions crossbow submit test-conda-python-3.9-substrait


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13487: ARROW-8991: [C++] Add hash_64 scalar compute function

2023-01-20 Thread via GitHub



westonpace commented on PR #13487:
URL: https://github.com/apache/arrow/pull/13487#issuecomment-1399105512

   @drin is this something you are still working on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing

2023-01-20 Thread via GitHub



vibhatha commented on PR #14596:
URL: https://github.com/apache/arrow/pull/14596#issuecomment-1399104798

   @github-actions crossbow submit test-conda-python-3.9-substrait


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15098: FieldRef with clang-cl from VS 17.4.3 needs operator== with /std:c++20

2023-01-20 Thread via GitHub



westonpace commented on issue #15098:
URL: https://github.com/apache/arrow/issues/15098#issuecomment-1399103925

   It would seem we need a new CI environment for visual studio & C++20.  It 
should be a fairly straightforward addition to 
https://github.com/apache/arrow/blob/master/.github/workflows/cpp.yml (we 
already have Windows / C++17).  Would you be interested in contributing that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on pull request #33810: GH-33377: [Python] Table.drop should support passing a single column

2023-01-20 Thread via GitHub



rok commented on PR #33810:
URL: https://github.com/apache/arrow/pull/33810#issuecomment-1399101391

   > This is my first Arrow PR. I am very open to any feedback or suggestions 
you might have!
   
   Welcome @danepitkin! Thanks for opening the PR!
   
   > I followed the instructions as stated in the original GH issue, but there 
is now some redundancy between `Table.drop()`, 
`Table.drop_column()`, and `Table.remove_column()`. In my opinion, 
it would have been nicer to consolidate these from the start. However, since 
backwards compatibility is important, would you say this PR is the right 
approach?
   
   I would slightly prefer `Table.drop_column()` over 
`Table.drop_column()` and `Table.drop()`. I don't know if 
`.drop` -> `.drop_column` makes sense at this point. [Pandas uses `.drop` for 
everything](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)
 but it does index dropping as well. Perhaps you could just alias the two and 
introduce a deprecation warning? (e.g. 
[here](https://github.com/apache/arrow/blob/a16c54567ada6729311fd26bdbee4b5e61901410/python/pyarrow/plasma.py#L118)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15281: basic_string_view will be invalid in future libc++

2023-01-20 Thread via GitHub



westonpace commented on issue #15281:
URL: https://github.com/apache/arrow/issues/15281#issuecomment-1399101379

   Normally I think we'd want to setup a CI to regress this sort of thing.  
However, clang-18 is not yet released.  Do you have any suggestion on how this 
could be regressed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing

2023-01-20 Thread via GitHub



vibhatha commented on code in PR #14596:
URL: https://github.com/apache/arrow/pull/14596#discussion_r1083164628


##
ci/docker/conda-python-substrait.dockerfile:
##
@@ -0,0 +1,45 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+ARG repo
+ARG arch
+ARG python=3.9
+
+FROM ${repo}:${arch}-conda-python-${python}
+
+COPY ci/conda_env_python.txt \
+ ci/conda_env_sphinx.txt \
+ /arrow/ci/
+RUN mamba install -q -y \
+--file arrow/ci/conda_env_python.txt \
+--file arrow/ci/conda_env_sphinx.txt \
+$([ "$python" == "3.9" ] && echo "pickle5") \
+python=${python} \

Review Comment:
   you're probably correct
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15103: Weighted stat aggregations in arrow-compute

2023-01-20 Thread via GitHub



westonpace commented on issue #15103:
URL: https://github.com/apache/arrow/issues/15103#issuecomment-1399100195

   There is a proposal to support aggregate UDFs but I don't know that it is 
high priority for anyone.  I agree this sounds like a nice feature.  Would you 
be interested in creating a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing

2023-01-20 Thread via GitHub



vibhatha commented on code in PR #14596:
URL: https://github.com/apache/arrow/pull/14596#discussion_r1083164377


##
ci/scripts/integration_substrait.sh:
##
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+set -e
+
+# check that optional pyarrow modules are available
+# because pytest would just skip the substrait tests
+echo "Substrait Integration Tests";
+python -c "from substrait_consumer.consumers import AceroConsumer"
+python -c "import pyarrow.orc"

Review Comment:
   You're correct, I haven't cleaned this up properly. Very early draft.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33627: [C++][HDFS] Can't get performance improve when increase the thread number of IO thread pool

2023-01-20 Thread via GitHub



westonpace commented on issue #33627:
URL: https://github.com/apache/arrow/issues/33627#issuecomment-1399093386

   No, I don't think you are wrong.  You should be seeing more parallelism.  I 
am not sure when I will get a chance to fully investigate this however.  I 
don't see anything in here that would lead threads to share HDFS connections 
but I don't know anything about how HDFS works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on a diff in pull request #33805: GH-33804: [Python] Add support for manylinux_2_28 wheel

2023-01-20 Thread via GitHub



kou commented on code in PR #33805:
URL: https://github.com/apache/arrow/pull/33805#discussion_r1083143197


##
docker-compose.yml:
##
@@ -967,6 +970,31 @@ services:
   - 
${DOCKER_VOLUME_PREFIX}python-wheel-manylinux2014-ccache:/ccache:delegated
 command: /arrow/ci/scripts/python_wheel_manylinux_build.sh
 
+  # See available versions at:
+  #https://quay.io/repository/pypa/manylinux_2_28_x86_64?tab=tags
+  python-wheel-manylinux-2-28:
+image: ${REPO}:${ARCH}-python-${PYTHON}-wheel-manylinux-2-28-vcpkg-${VCPKG}
+build:
+  args:
+arch: ${ARCH}
+arch_short: ${ARCH_SHORT}
+base: quay.io/pypa/manylinux_2_28_${ARCH_ALIAS}:2023-01-14-103cb93
+vcpkg: ${VCPKG}
+python: ${PYTHON}
+manylinux: 2_28
+  context: .
+  dockerfile: ci/docker/python-wheel-manylinux-x-y.dockerfile
+  cache_from:
+- ${REPO}:${ARCH}-python-${PYTHON}-wheel-manylinux-2-28-vcpkg-${VCPKG}
+environment:
+  <<: *ccache
+volumes:
+  - .:/arrow:delegated
+  - 
${DOCKER_VOLUME_PREFIX}python-wheel-manylinux228-ccache:/ccache:delegated
+command: /arrow/ci/scripts/python_wheel_manylinux_build.sh
+
+
+

Review Comment:
   ```suggestion
   ```



##
docker-compose.yml:
##
@@ -180,6 +181,8 @@ volumes:
 name: maven-cache
   python-wheel-manylinux2014-ccache:
 name: python-wheel-manylinux2014-ccache
+  python-wheel-manylinux228-ccache:
+name: python-wheel-manylinux228-ccache

Review Comment:
   ```suggestion
 python-wheel-manylinux-2-28-ccache:
   name: python-wheel-manylinux-2-28-ccache
   ```



##
ci/docker/python-wheel-manylinux-x-y.dockerfile:
##
@@ -0,0 +1,95 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+ARG base
+FROM ${base}
+
+ARG arch
+ARG arch_short
+ARG manylinux
+
+ENV MANYLINUX_VERSION=${manylinux}
+
+# Install basic dependencies
+RUN yum install -y git flex curl autoconf zip perl-IPC-Cmd wget
+
+ENV HOST_PYTHON_VERSION=3.8
+RUN HOST_PYTHON_ROOT=$(find /opt/python -name cp${HOST_PYTHON_VERSION/./}-*) 
&& \
+echo "export PATH=$HOST_PYTHON_ROOT/bin:\$PATH" >> /python.sh
+
+# Install CMake
+# AWS SDK doesn't work with CMake=3.22 due to 
https://gitlab.kitware.com/cmake/cmake/-/issues/22524
+ARG cmake=3.21.4
+COPY ci/scripts/install_cmake.sh arrow/ci/scripts/
+RUN source /python.sh && /arrow/ci/scripts/install_cmake.sh ${arch} linux 
${cmake} /usr/local
+
+# Install Ninja
+ARG ninja=1.10.2
+COPY ci/scripts/install_ninja.sh arrow/ci/scripts/
+RUN source /python.sh && /arrow/ci/scripts/install_ninja.sh ${ninja} /usr/local
+
+# Install ccache
+ARG ccache=4.1
+COPY ci/scripts/install_ccache.sh arrow/ci/scripts/
+RUN source /python.sh && /arrow/ci/scripts/install_ccache.sh ${ccache} 
/usr/local
+
+RUN echo "Hello World"

Review Comment:
   ```suggestion
   ```



##
ci/docker/python-wheel-manylinux-x-y.dockerfile:
##
@@ -0,0 +1,95 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+ARG base
+FROM ${base}
+
+ARG arch
+ARG arch_short
+ARG manylinux
+
+ENV MANYLINUX_VERSION=${manylinux}
+
+# Install basic dependencies
+RUN yum install -y git flex curl autoconf zip perl-IPC-Cmd wget
+
+ENV HOST_PYTHON_VERSION=3.8
+RUN HOST_PYTHON_ROOT=$(find /opt/python -name cp${HOST_PYTHON_VERSION/./}-*) 
&& \
+echo "export PATH=$HOST_PYTHON_ROOT/bin:\$PATH" >> /python.sh

Review Comment:
   Does the following work?
   
   ```suggestion
   ENV HOST_PYTHON_VERSION=3.8
   ENV 
PATH=

[GitHub] [arrow] westonpace commented on issue #33668: Reading flat dataset with `partitioning="hive"` results in partition schema equal to dataset schema

2023-01-20 Thread via GitHub



westonpace commented on issue #33668:
URL: https://github.com/apache/arrow/issues/33668#issuecomment-1399087997

   I can confirm.  I reproduced this with the latest and agree it is a bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-20 Thread via GitHub



nealrichardson commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399086990

   > > I thought this was just going to be deleting code from the R package: 
instead of finding the top-level field names in the projection and sending them 
in the ScanNode, I'd send the projection and the scanner would pull out the 
fields.
   > 
   > 
   > 
   > Got it, so this is not a behavior change, but moving the logic from R into 
C++ where it would appropriately belong?
   
   Right; nested field refs are currently not handled by the C++ scanner. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] albel727 commented on issue #3579: `nullif` incorrectly calculates `null_count`, sometimes panics with substraction overflow error

2023-01-20 Thread via GitHub



albel727 commented on issue #3579:
URL: https://github.com/apache/arrow-rs/issues/3579#issuecomment-1399086004

   Just in case, here's the code that I used to validate that the quick fix 
works:
   ```rust
   
   std::panic::set_hook(Box::new(|_info| { /* silence panics */ }));
   
   for n in 0..=128+64 {
   let left = Int32Array::from((0..n as 
i32).into_iter().collect::>());
   
   for somes in 0..=n {
   for trues in 0..=somes {
   let right = BooleanArray::from((0..n).into_iter().map(|i| {
   Some(i < trues).filter(|_| i < somes)
   }).collect::>());
   
   let ok = std::panic::catch_unwind(|| {
   let arr = arrow_select::nullif::nullif(&left, 
&right).unwrap();
   let arr: &Int32Array = 
arrow_array::cast::as_primitive_array(&arr);
   arrow_array::Array::data(arr).validate_full().unwrap();
   }).is_ok();
   
   if !ok {
   println!("{n} {somes} {trues}");
   }
   }
   }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #14799: ARROW-18417: [C++] Support emit info in Substrait extension-multi and AsOfJoin

2023-01-20 Thread via GitHub



westonpace commented on PR #14799:
URL: https://github.com/apache/arrow/pull/14799#issuecomment-1399085737

   @rtpsw this will need a rebase.  Some of the changes here were brought in 
with the backpressure change.  Would you like me to do this or do you want to?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] albel727 opened a new issue, #3579: `nullif` incorrectly calculates `null_count`, sometimes panics with substraction overflow error

2023-01-20 Thread via GitHub



albel727 opened a new issue, #3579:
URL: https://github.com/apache/arrow-rs/issues/3579

   **Describe the bug**
   `nullif(left, right)` incorrectly calculates `null_count` for the result 
array, whenever `left` doesn't have a null_buffer and has `len % 64 == 0`. It 
can even panic, if there are less than 64 true values in `right`.
   
   **To Reproduce**
   ```rust
   let n = 64;
   let left = Int32Array::from((0..n as 
i32).into_iter().collect::>());
   let right = BooleanArray::from((0..n).into_iter().map(|_| 
None).collect::>());
   arrow_select::nullif::nullif(&left, &right);
   ```
   
   **Expected behavior**
   It works.
   
   **Actual behavior**
   It panics with 'attempt to subtract with overflow' at this line:
   
https://github.com/apache/arrow-rs/blob/796b670338ce33806a39777ea18cf6fae8fa7ee4/arrow-select/src/nullif.rs#L107
   
   **Additional context**
   
   The problem evaded fixes #3034 and #3263, which were incomplete. The wrong 
calculation occurs in these lines:
   
https://github.com/apache/arrow-rs/blob/796b670338ce33806a39777ea18cf6fae8fa7ee4/arrow-select/src/nullif.rs#L81-L90
   
   The reason it is wrong is the un-intuitive, if not to say wrong, behavior of 
`bitwise_unary_op_helper()` and friends. They're calling `op()` 
**unconditionally** on the remainder bits, even if there are none:
   
   
https://github.com/apache/arrow-rs/blob/3a90654f4cb98ac7fe278991a6cbc11384664e2e/arrow-buffer/src/buffer/ops.rs#L119-L123
   
   It doesn't make a difference for the produced output buffer, because it is 
then just extended with slice of 0 bytes, i.e. remains unaffected. But it does 
matter for FnMut closures with side effects, such as the one found in `nullif`, 
which, as a result of this behavior, adds an excess of 64 to `valid_count`, 
when there are no remainder bits, i.e. bit length is a multiple of 64.
   
   The quick fix is to do `valid_count -= 64 - remainder_len;` in `nullif()` 
unconditionally, i.e. just remove the `if remainder_len != 0 {` condition 
around it. It was clearly written in assumption, that 
`bitwise_unary_op_helper()` doesn't call `op()` in excess, when there are no 
remainder bits.
   
   A better fix could have been to avoid making excess `op()` calls in 
`bitwise_unary_op_helper()` and friends, but since they're public, it could 
lead to bugs in external consumers, which might have come to rely on this weird 
behavior.
   
   In any case, I suggest at least checking whether other usages of 
`bitwise_unary_op_helper()` and friends 
   in arrow-rs make the same incorrect assumption. Temporarily changing the 
type of `op` parameter from `FnMut` to `Fn` and looking at the compilation 
error fallout should point out all suspicious places.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing

2023-01-20 Thread via GitHub



vibhatha commented on code in PR #14596:
URL: https://github.com/apache/arrow/pull/14596#discussion_r1083145165


##
ci/scripts/integration_substrait.sh:
##
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+set -e
+
+# check that optional pyarrow modules are available
+# because pytest would just skip the substrait tests
+echo "Substrait Integration Tests";
+python -c "from substrait_consumer.consumers import AceroConsumer"
+python -c "import pyarrow.orc"
+python -c "import pyarrow.substrait"
+
+pytest 
substrait_consumer/tests/functional/extension_functions/test_boolean_functions.py
 --producer IsthmusProducer --consumer AceroConsumer

Review Comment:
   😀 no choice yet 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #29428: [R] accept expression lists in Scanner$create() with arrow_dplyr_querys

2023-01-20 Thread via GitHub



westonpace commented on issue #29428:
URL: https://github.com/apache/arrow/issues/29428#issuecomment-1399080062

   +1 for some kind of `as_record_batch_reader`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #33770: GH-33760: [R][C++] Handle nested field refs in scanner

2023-01-20 Thread via GitHub



westonpace commented on PR #33770:
URL: https://github.com/apache/arrow/pull/33770#issuecomment-1399077722

   > I thought this was just going to be deleting code from the R package: 
instead of finding the top-level field names in the projection and sending them 
in the ScanNode, I'd send the projection and the scanner would pull out the 
fields.
   
   Got it, so this is not a behavior change, but moving the logic from R into 
C++ where it would appropriately belong?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15138: [C++] Clustered By -- how?

2023-01-20 Thread via GitHub



westonpace commented on issue #15138:
URL: https://github.com/apache/arrow/issues/15138#issuecomment-1399076522

   Internally we do this by hashing the column.  There is a PR under way 
(though it could use some review) to add a new hash compute function 
(https://github.com/apache/arrow/pull/13487).  If that were to merge then you 
could approximate this pretty well by partitioning on bit_wise_and(hash(x), 
mask) where `mask` is something like `0x7` (for 8 groups) or `0xF` (for 16 
groups), etc.  To support non-powers-of-two groups we would need modulo.  There 
is a PR in place for modulo but it seems to have stalled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33725: GH-33724: [Doc] Update the substrait conformance doc with the latest support

2023-01-20 Thread via GitHub



ursabot commented on PR #33725:
URL: https://github.com/apache/arrow/pull/33725#issuecomment-1399074351

   Benchmark runs are scheduled for baseline = 
f7aa50dbeccdcc800a0ffc695b107c1cdc688156 and contender = 
a16c54567ada6729311fd26bdbee4b5e61901410. 
a16c54567ada6729311fd26bdbee4b5e61901410 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d176eae6bed441eb97ee4baa57dc3fdf...a2c888cbc9e54acbb087f82996ef2976/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/4f3232dcf7d04ea082736425413a1641...9c950d82a1a141cb9555bb25a732624c/)
   [Failed] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b8cf43c42a19473b8089647069d4f1c5...9ac607cae658446e885a40089cb71e52/)
   [Failed] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b56d62491b8b490287ee545f66a3526e...6c0e57252df347318d11432fdaa0cdb9/)
   Buildkite builds:
   [Failed] [`a16c5456` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2239)
   [Failed] [`a16c5456` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2264)
   [Failed] [`a16c5456` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2234)
   [Failed] [`a16c5456` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2257)
   [Failed] [`f7aa50db` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2238)
   [Failed] [`f7aa50db` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2263)
   [Failed] [`f7aa50db` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2233)
   [Failed] [`f7aa50db` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2256)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #33775: ARROW-18425: [Substrait] Add Substrait→Acero mapping for round operationMajor:

2023-01-20 Thread via GitHub



westonpace commented on code in PR #33775:
URL: https://github.com/apache/arrow/pull/33775#discussion_r1083118332


##
cpp/src/arrow/compute/api_scalar.h:
##
@@ -882,6 +891,20 @@ ARROW_EXPORT
 Result Round(const Datum& arg, RoundOptions options = 
RoundOptions::Defaults(),
 ExecContext* ctx = NULLPTR);
 
+/// \brief Round a value to a given precision.
+///
+/// If argument is null the result will be null.

Review Comment:
   Which argument?  I assume the output is null if either argument is null?  
Can we be more explicit.



##
cpp/src/arrow/compute/kernels/scalar_round_arithmetic_test.cc:
##
@@ -113,7 +112,7 @@ class TestBaseUnaryRoundArithmetic : public ::testing::Test 
{
   // (Array, Array)
   void AssertUnaryOp(UnaryFunction func, const std::shared_ptr& arg,
  const std::shared_ptr& expected) {
-ASSERT_OK_AND_ASSIGN(auto actual, func(arg, options_, nullptr));
+ASSERT_OK_AND_ASSIGN(auto actual, func(arg, options_, nullptr))

Review Comment:
   While this does compile and work we try and add a `;` to the end of these 
kinds of macros anyways for readability
   ```suggestion
   ASSERT_OK_AND_ASSIGN(auto actual, func(arg, options_, nullptr));
   ```



##
cpp/src/arrow/engine/substrait/extension_set.cc:
##
@@ -790,6 +796,30 @@ ExtensionIdRegistry::SubstraitCallToArrow 
DecodeOptionlessUncheckedArithmetic(
   };
 }
 
+ExtensionIdRegistry::SubstraitCallToArrow DecodeBinaryRoundingMode(
+const std::string& function_name) {
+  return [function_name](const SubstraitCall& call) -> 
Result {
+ARROW_ASSIGN_OR_RAISE(
+compute::RoundMode round_mode,
+ParseOptionOrElse(
+call, "rounding", kRoundModeParser,
+{compute::RoundMode::DOWN, compute::RoundMode::UP,
+ compute::RoundMode::TOWARDS_ZERO, 
compute::RoundMode::TOWARDS_INFINITY,
+ compute::RoundMode::HALF_DOWN, compute::RoundMode::HALF_UP,
+ compute::RoundMode::HALF_TOWARDS_ZERO,
+ compute::RoundMode::HALF_TOWARDS_INFINITY, 
compute::RoundMode::HALF_TO_EVEN,
+ compute::RoundMode::HALF_TO_ODD},
+compute::RoundMode::HALF_TOWARDS_INFINITY));
+ARROW_ASSIGN_OR_RAISE(std::vector value_args,
+  GetValueArgs(call, 0));
+std::shared_ptr options =

Review Comment:
   Do we want to optimize and call the unary round if the second value is a 
scalar?  If not in this PR can we create a follow-up github issue so we don't 
lose track of it?  Or maybe round_binary itself can fallback to unary rounding 
if the second argument is scalar.



##
cpp/src/arrow/compute/api_scalar.h:
##
@@ -882,6 +891,20 @@ ARROW_EXPORT
 Result Round(const Datum& arg, RoundOptions options = 
RoundOptions::Defaults(),
 ExecContext* ctx = NULLPTR);
 
+/// \brief Round a value to a given precision.
+///
+/// If argument is null the result will be null.
+///
+/// \param[in] arg1 the value rounded

Review Comment:
   ```suggestion
   /// \param[in] arg1 the value to be rounded
   ```



##
cpp/src/arrow/compute/api_scalar.h:
##
@@ -882,6 +891,20 @@ ARROW_EXPORT
 Result Round(const Datum& arg, RoundOptions options = 
RoundOptions::Defaults(),
 ExecContext* ctx = NULLPTR);
 
+/// \brief Round a value to a given precision.
+///
+/// If argument is null the result will be null.
+///
+/// \param[in] arg1 the value rounded
+/// \param[in] arg2 the number of significant digits to round to
+/// \param[in] options rounding options (rounding mode and number of digits), 
optional

Review Comment:
   ```suggestion
   /// \param[in] options rounding options (rounding mode), optional
   ```
   Or just get rid of the parentheses section entirely.



##
cpp/src/arrow/compute/api_scalar.h:
##
@@ -882,6 +891,20 @@ ARROW_EXPORT
 Result Round(const Datum& arg, RoundOptions options = 
RoundOptions::Defaults(),
 ExecContext* ctx = NULLPTR);
 
+/// \brief Round a value to a given precision.
+///
+/// If argument is null the result will be null.
+///
+/// \param[in] arg1 the value rounded
+/// \param[in] arg2 the number of significant digits to round to

Review Comment:
   Can this be negative?  Do we define elsewhere what that entails?



##
cpp/src/arrow/compute/kernels/scalar_round_arithmetic_test.cc:
##
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 

Review Comment:
   Our guideline for includes is [`iwyu`](https://include-what-you-use.org/).  
We don't always follow it perfectly (the conformance tool doesn't like type_fwd 
files) but it is what we aim for.  Please don't remove includes if they are 
used in the file (I still see many instances of `std::string`) even if the file 
compiles otherwise (transitive includes are potentially unstable).



##
cpp/src/arrow/compute/kernels/scalar_round.cc:
##
@@ -751,60 +877,25 @@ Array

[GitHub] [arrow] zeroshade commented on pull request #14989: ARROW-18438: [Go][Parquet] Panic in bitmap writer

2023-01-20 Thread via GitHub



zeroshade commented on PR #14989:
URL: https://github.com/apache/arrow/pull/14989#issuecomment-1399068130

   @minyoung I found the issue and the solution that's not a hack:
   
   in parquet/file/column_writer_types.gen.go.tmpl lines 143 - 147 change this:
   
   ```go
   if w.bitsBuffer != nil {
   w.writeValuesSpaced(vals, info.batchNum, w.bitsBuffer.Bytes(), 0)
   } else {
   w.writeValuesSpaced(vals, info.batchNum, validBits, 
validBitsOffset+valueOffset)
  }
   ```
   To this instead:
   ```go
   if w.bitsBuffer != nil {
   w.writeValuesSpaced(vals, batch, w.bitsBuffer.Bytes(), 0)
   } else {
   w.writeValuesSpaced(vals, batch, validBits, 
validBitsOffset+valueOffset)
  }
   ```
   
   Note the change from `info.batchNum` to `batch`. It turns out that in this 
scenario we should be passing the full batch size to write and not just the 
number of raw values it found. This way it properly calculates the number of 
nulls in the parent (non-leaf) columns. After you make this change, re-run `go 
generate` so that it re-generates all of the writers.
   
   Finally, on line 289 of `pqarrow/column_readers.go`, change the argument to 
BuildArray from `validityIO.Read` to `lenBound`. That way it creates the 
correctly sized null bitmap buffer.
   
   in my testing that was enough to solve the problem, leave your new tests in 
for good measure so this bug doesn't creep back :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #33377: [Python] Table.drop should support passing a single column

2023-01-20 Thread via GitHub



danepitkin commented on issue #33377:
URL: https://github.com/apache/arrow/issues/33377#issuecomment-1399062211

   Whoops, looks like I did not correctly link my PR to the original issue. My 
PR is here https://github.com/apache/arrow/pull/33810.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] danepitkin commented on pull request #33810: GH-33377: [Python] Table.drop should support passing a single column

2023-01-20 Thread via GitHub



danepitkin commented on PR #33810:
URL: https://github.com/apache/arrow/pull/33810#issuecomment-1399047969

   This is my first Arrow PR. I am very open to any feedback or suggestions you 
might have!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace merged pull request #33725: GH-33724: [Doc] Update the substrait conformance doc with the latest support

2023-01-20 Thread via GitHub



westonpace merged PR #33725:
URL: https://github.com/apache/arrow/pull/33725


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

2023-01-20 Thread via GitHub



westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1399046967

   Hmm, backpressure should be applied then.  Once you call `to_batches` it 
should start to read in the background.  Eventually, at a certain point, it 
should stop reading because too much data has accumulated.  This is normally 
around a few GB.  You mention there are 13k fragments, just to confirm this is 
13k files right?  How large is each file?  How many row groups are in each file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] danepitkin opened a new pull request, #33810: GH-33377: [Python] Table.drop should support passing a single column

2023-01-20 Thread via GitHub



danepitkin opened a new pull request, #33810:
URL: https://github.com/apache/arrow/pull/33810

   
   
   ### Rationale for this change
   
   
   
   Provide a better user experience in pyarrow when working with `Table`.
   
   ### What changes are included in this PR?
   
   
   
   Allow `Table.drop()` to accept a single column name as a `str` argument.
   Provide a wrapper `Table.drop_column(str)`, which calls `Table.drop()`, to 
match similar APIs such as `add_column()`, `append_column()`.
   
   ### Are these changes tested?
   
   
   
   Updated the pytest for `Table.drop()` and added a pytest for 
`Table.drop_column()`. Verified both test cases ran successfully locally.
   
   ### Are there any user-facing changes?
   
   
   
   Yes, a new pyarrow API is added. The existing pyarrow API is backwards 
compatible, but does support additional types of input now (str). The doc 
strings are updated so I assume the python API reference will be auto-updated 
somehow? Let me know if this is not the case.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33790: [Python] Support for reading .csv files from a zip archive

2023-01-20 Thread via GitHub



westonpace commented on issue #33790:
URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399044202

   Outside of datasets this is normally achieved by opening a compressed input 
stream and using the CSV stream reader.  If the path ends in `.gz` or `.bz2` I 
think we also guess that it is compressed and do this for you.
   
   Within datasets there are a few un/under documented features which may help. 
 There is a similar "extension guessing" mechanism: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_base.cc#L93
  So if your files end in `gz` or `gzip` it should automatically be picked up.
   
   There is also `stream_transform_func` as part of the dataset-csv options 
which takes an arbitrary callable that transforms the stream before you start 
reading it.  In theory this could maybe be used to provide support for zipped 
files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33797: [C++] Add decimal version of Round benchmarks

2023-01-20 Thread via GitHub



westonpace commented on issue #33797:
URL: https://github.com/apache/arrow/issues/33797#issuecomment-1399039367

   @aayushpandey014 I have assigned this to you.  In the future you can always 
comment `take` and our bots will assign an issue to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #14353: ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type

2023-01-20 Thread via GitHub



wjones127 commented on code in PR #14353:
URL: https://github.com/apache/arrow/pull/14353#discussion_r1083109332


##
cpp/src/parquet/encoding.h:
##
@@ -317,6 +317,13 @@ class TypedDecoder : virtual public Decoder {
   int64_t valid_bits_offset,
   typename EncodingTraits::Accumulator* out) = 
0;
 
+  virtual int DecodeArrow_opt(int num_values, int null_count, const uint8_t* 
valid_bits,
+  int32_t* offset,
+  std::shared_ptr<::arrow::ResizableBuffer>& 
values,
+  int64_t valid_bits_offset, int32_t* 
bianry_length) {
+return 0;

Review Comment:
   It seems like we are adding this because the other methods are based on 
builders (in the accumulator), and builder don't provide a way to transfer 
multiple values in one `memcpy`. Does that sound right?
   
   I wonder if we could add such a method on builders, and that might be a 
cleaner solution. Something like:
   
   ```cpp
   class BaseBinaryBuilder {
   ...
 Status UnsafeAppendValues(const uint8_t* values, int64_t length, const 
uint8_t* valid_bytes = NULL_PTR) { ... }
   };
   ```
   
   The reason getting back to builders might be good is that I don't think 
these changes handle the case where there are more values than can fit into a 
StringArray. The current implementation will split it up into chunks if it 
reaches capacity, and I think we need to keep that behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33809: MINOR: [R] Update BugReports field in DESCRIPTION

2023-01-20 Thread via GitHub



ursabot commented on PR #33809:
URL: https://github.com/apache/arrow/pull/33809#issuecomment-1399037705

   Benchmark runs are scheduled for baseline = 
f9ce32ebab5071b8fc48a135a730c22313aaf9b3 and contender = 
f7aa50dbeccdcc800a0ffc695b107c1cdc688156. 
f7aa50dbeccdcc800a0ffc695b107c1cdc688156 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4688a72856d74b63b4494800e30e4bfe...d176eae6bed441eb97ee4baa57dc3fdf/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/387f736caaf5494d9cad1f4081ca78d4...4f3232dcf7d04ea082736425413a1641/)
   [Failed] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/aff01843b1ca4ea79a6a5cbdeddc1644...b8cf43c42a19473b8089647069d4f1c5/)
   [Failed] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/3cad7f913b124ec1af1cd10ff0385c78...b56d62491b8b490287ee545f66a3526e/)
   Buildkite builds:
   [Failed] [`f7aa50db` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2238)
   [Failed] [`f7aa50db` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2263)
   [Failed] [`f7aa50db` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2233)
   [Failed] [`f7aa50db` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2256)
   [Failed] [`f9ce32eb` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2237)
   [Failed] [`f9ce32eb` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2262)
   [Failed] [`f9ce32eb` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2232)
   [Failed] [`f9ce32eb` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2255)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion-python] martin-g commented on pull request #129: test: Expand tests for built-in functions

2023-01-20 Thread via GitHub



martin-g commented on PR #129:
URL: 
https://github.com/apache/arrow-datafusion-python/pull/129#issuecomment-1399036701

   I always prefer using `git rebase` for Pull Request branches. This way the 
commit history is cleaner.
   Rebasing also makes it easier  to use `Squash and merge` Github UI 
functionality.
   
   The workflows should be triggered manually for first time contributors. Once 
your first PR is merged all following PRs will execute the checks automatically!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] richtia commented on a diff in pull request #14596: ARROW-18258: [Docker] Substrait Integration Testing

2023-01-20 Thread via GitHub



richtia commented on code in PR #14596:
URL: https://github.com/apache/arrow/pull/14596#discussion_r1083107221


##
ci/docker/conda-python-substrait.dockerfile:
##
@@ -0,0 +1,45 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+ARG repo
+ARG arch
+ARG python=3.9
+
+FROM ${repo}:${arch}-conda-python-${python}
+
+COPY ci/conda_env_python.txt \
+ ci/conda_env_sphinx.txt \
+ /arrow/ci/
+RUN mamba install -q -y \
+--file arrow/ci/conda_env_python.txt \
+--file arrow/ci/conda_env_sphinx.txt \
+$([ "$python" == "3.9" ] && echo "pickle5") \
+python=${python} \

Review Comment:
   where do we install openjdk?  should it be here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest

2023-01-20 Thread via GitHub



alamb commented on PR #4834:
URL: 
https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399016790

   Assuming this PR passes CI checks I plan to merge 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4834: (#4462) Postgres compatibility tests using sqllogictest

2023-01-20 Thread via GitHub



alamb commented on PR #4834:
URL: 
https://github.com/apache/arrow-datafusion/pull/4834#issuecomment-1399015229

   > For now, I merge master into this branch, so that it is mergeable.
   
   Thanks @melgenek . I filed issues for the follow on tasks:
   - [ ] https://github.com/apache/arrow-datafusion/issues/5009
   - [ ] https://github.com/apache/arrow-datafusion/issues/5010
   - [ ] https://github.com/apache/arrow-datafusion/issues/5011


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb opened a new issue, #5011: [sqllogictest] Remove `integration-tests` directory

2023-01-20 Thread via GitHub



alamb opened a new issue, #5011:
URL: https://github.com/apache/arrow-datafusion/issues/5011

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   https://github.com/apache/arrow-datafusion/pull/4834 adds a way to run 
sqllogictests using postgres 🦾 (kudos to @melgenek )
   
   This now duplicates the coverage  in  
https://github.com/apache/arrow-datafusion/tree/master/integration-tests as 
described in https://github.com/apache/arrow-datafusion/issues/4462
   
   **Describe the solution you'd like**
   Remove 
https://github.com/apache/arrow-datafusion/tree/master/integration-tests and 
the CI checks that use it
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb opened a new issue, #5010: [sqllogictest] Consolidate normalization code for the postgres and non-postgres paths

2023-01-20 Thread via GitHub



alamb opened a new issue, #5010:
URL: https://github.com/apache/arrow-datafusion/issues/5010

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   https://github.com/apache/arrow-datafusion/pull/4834 adds a way to run 
sqllogictests using postgres 🦾 (kudos to @melgenek )
   
   However, it uses different normalization code for the postgres and 
non-postgres paths 
   
   
   **Describe the solution you'd like**
   
   I would like a single set of comparison code that works for both postgres 
and not postgres paths 
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33795: GH-33794: [Go] Add SetRecordReader to PreparedStatement

2023-01-20 Thread via GitHub



ursabot commented on PR #33795:
URL: https://github.com/apache/arrow/pull/33795#issuecomment-1399012764

   Benchmark runs are scheduled for baseline = 
0e4a2e19e36d70a3072ce5275129d15fdb187c64 and contender = 
f9ce32ebab5071b8fc48a135a730c22313aaf9b3. 
f9ce32ebab5071b8fc48a135a730c22313aaf9b3 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/deff7301a8094167a5f3458685d0c0e7...4688a72856d74b63b4494800e30e4bfe/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/4ff6080bb6794a91b251f13e894603e4...387f736caaf5494d9cad1f4081ca78d4/)
   [Failed] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/48362c3cb97c43f2afb7c20547cf4ad0...aff01843b1ca4ea79a6a5cbdeddc1644/)
   [Failed] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/747a35465dfa415fa603317ae0ca67e0...3cad7f913b124ec1af1cd10ff0385c78/)
   Buildkite builds:
   [Failed] [`f9ce32eb` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2237)
   [Failed] [`f9ce32eb` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2262)
   [Failed] [`f9ce32eb` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2232)
   [Failed] [`f9ce32eb` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2255)
   [Failed] [`0e4a2e19` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2236)
   [Failed] [`0e4a2e19` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2261)
   [Failed] [`0e4a2e19` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2231)
   [Failed] [`0e4a2e19` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2254)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb opened a new issue, #5009: [sqllogictest] Don't orchestrate the postgres containers with rust / docker

2023-01-20 Thread via GitHub



alamb opened a new issue, #5009:
URL: https://github.com/apache/arrow-datafusion/issues/5009

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   https://github.com/apache/arrow-datafusion/pull/4834 adds a way to run 
sqllogictests using postgres 🦾 (kudos to @melgenek )
   
   However, it currently orchestrates the postgres containers with rust test 
code which means running these tests requires  docker to run the (full) 
datafusion test suit locally
   
   **Describe the solution you'd like**
   1. Remove `testcontainers` dependency
   2. Allow sqllogictest `PG_COMPAT` mode to run against an existing postgres 
database (supply the connection information)
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083090721


##
cpp/src/parquet/page_index.h:
##
@@ -126,4 +132,94 @@ class PARQUET_EXPORT OffsetIndex {
   virtual const std::vector& page_locations() const = 0;
 };
 
+/// \brief Interface for reading the page index for a Parquet row group.
+class PARQUET_EXPORT RowGroupPageIndexReader {
+ public:
+  virtual ~RowGroupPageIndexReader() = default;
+
+  /// \brief Read column index of a column chunk.
+  ///
+  /// \param[in] i column ordinal of the column chunk.
+  /// \returns column index of the column or nullptr if it does not exist.
+  /// \throws ParquetException if the index is out of bound.
+  virtual std::shared_ptr GetColumnIndex(int32_t i) = 0;
+
+  /// \brief Read offset index of a column chunk.
+  ///
+  /// \param[in] i column ordinal of the column chunk.
+  /// \returns offset index of the column or nullptr if it does not exist.
+  /// \throws ParquetException if the index is out of bound.
+  virtual std::shared_ptr GetOffsetIndex(int32_t i) = 0;
+};
+
+struct IndexSelection {
+  /// Specifies whether to read the column index.
+  bool column_index = false;
+  /// Specifies whether to read the offset index.
+  bool offset_index = false;
+};
+
+struct RowGroupIndexReadRange {
+  /// Base start and total size of column index of all column chunks in a row 
group.
+  /// If none of the column chunks have column index, it is set to 
std::nullopt.
+  std::optional<::arrow::io::ReadRange> column_index = std::nullopt;
+  /// Base start and total size of offset index of all column chunks in a row 
group.
+  /// If none of the column chunks have offset index, it is set to 
std::nullopt.
+  std::optional<::arrow::io::ReadRange> offset_index = std::nullopt;
+};
+
+/// \brief Interface for reading the page index for a Parquet file.
+class PARQUET_EXPORT PageIndexReader {
+ public:
+  virtual ~PageIndexReader() = default;
+
+  /// \brief Create a PageIndexReader instance.
+  /// \returns a PageIndexReader instance.
+  /// WARNING: The returned PageIndexReader references to all the input 
parameters, so
+  /// it must not outlive all of the input parameters. Usually these input 
parameters
+  /// come from the same ParquetFileReader object, so it must not outlive the 
reader
+  /// that creates this PageIndexReader.
+  static std::shared_ptr Make(
+  ::arrow::io::RandomAccessFile* input, std::shared_ptr 
file_metadata,
+  const ReaderProperties& properties,
+  std::shared_ptr file_decryptor = NULLPTR);
+
+  /// \brief Get the page index reader of a specific row group.
+  /// \param[in] i row group ordinal to get page index reader.
+  /// \returns RowGroupPageIndexReader of the specified row group. A nullptr 
may or may
+  ///  not be returned if the page index for the row group is 
unavailable. It is
+  ///  the caller's responsibility to check the return value of 
follow-up calls
+  ///  to the RowGroupPageIndexReader.
+  /// \throws ParquetException if the index is out of bound.
+  virtual std::shared_ptr RowGroup(int i) = 0;
+
+  /// \brief Advise the reader which part of page index will be read later.
+  ///
+  /// The PageIndexReader implementation can optionally prefetch and cache 
page index
+  /// that may be read later. Follow-up read should not fail even if 
WillNeed() is not
+  /// called, or the requested page index is out of range from WillNeed() call.
+  ///
+  /// \param[in] row_group_indices list of row group ordinal to read page 
index later.
+  /// \param[in] index_selection tell if any of the page index is required 
later.
+  virtual void WillNeed(const std::vector& row_group_indices,
+IndexSelection index_selection) = 0;

Review Comment:
   nit, even though this might be more effient, probably pays to use const 
IndexSelection& as the formal parameter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083090112


##
cpp/src/parquet/page_index.h:
##
@@ -126,4 +132,94 @@ class PARQUET_EXPORT OffsetIndex {
   virtual const std::vector& page_locations() const = 0;
 };
 
+/// \brief Interface for reading the page index for a Parquet row group.
+class PARQUET_EXPORT RowGroupPageIndexReader {
+ public:
+  virtual ~RowGroupPageIndexReader() = default;
+
+  /// \brief Read column index of a column chunk.
+  ///
+  /// \param[in] i column ordinal of the column chunk.
+  /// \returns column index of the column or nullptr if it does not exist.
+  /// \throws ParquetException if the index is out of bound.
+  virtual std::shared_ptr GetColumnIndex(int32_t i) = 0;
+
+  /// \brief Read offset index of a column chunk.
+  ///
+  /// \param[in] i column ordinal of the column chunk.
+  /// \returns offset index of the column or nullptr if it does not exist.
+  /// \throws ParquetException if the index is out of bound.
+  virtual std::shared_ptr GetOffsetIndex(int32_t i) = 0;
+};
+
+struct IndexSelection {
+  /// Specifies whether to read the column index.
+  bool column_index = false;
+  /// Specifies whether to read the offset index.
+  bool offset_index = false;
+};
+
+struct RowGroupIndexReadRange {

Review Comment:
   Yes, thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083089551


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize column index for column {}", 
i);
+}
+return column_index;
+  }
+
+  /// Read offset index of a column chunk.
+  std::shared_ptr GetOffsetIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get offset index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted offset index yet");
+}
+
+auto offset_index_location = col_chunk->GetOffsetIndexLocation();
+if (!offset_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.offset_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (offset_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(offset_index_buffer_,
+  
input_->ReadAt(index_read_range_.offset_index->offset,
+ 
index_read_range_.offset_index->length));
+}
+
+auto buffer = offset_index_buffer_.get();
+int64_t buffer_offset =
+offset_index_location->offset - index_read_range_.offset_index->offset;
+uint32_t length = static_cast(offset_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.offset_index->length);
+
+std::shared_ptr offset_index;
+try {
+  offset_index =
+  OffsetIndex::Make(buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize offset index for column {}", 
i);
+}
+return offset_index;
+  }
+
+ private:
+  /// The input stream that can perform random access read.
+  ::arrow::io::RandomAccessFile* input_;
+
+  /// The row group metadata to get column chunk metadata.
+  std::shared_ptr row_group_metadata_;
+
+  /// Reader properties used to deserialize thrift object.
+  const ReaderProperties& properties_;
+
+  /// File-level decryptor.
+  std::shared_ptr file_decryptor_;
+
+  /// File offsets and sizes of the page Index of all column chunks in the row 
group.
+  RowGroupIndexReadRange i

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083089071


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize column index for column {}", 
i);
+}
+return column_index;
+  }
+
+  /// Read offset index of a column chunk.
+  std::shared_ptr GetOffsetIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get offset index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted offset index yet");
+}
+
+auto offset_index_location = col_chunk->GetOffsetIndexLocation();
+if (!offset_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.offset_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (offset_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(offset_index_buffer_,
+  
input_->ReadAt(index_read_range_.offset_index->offset,
+ 
index_read_range_.offset_index->length));
+}
+
+auto buffer = offset_index_buffer_.get();
+int64_t buffer_offset =
+offset_index_location->offset - index_read_range_.offset_index->offset;
+uint32_t length = static_cast(offset_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.offset_index->length);
+
+std::shared_ptr offset_index;
+try {
+  offset_index =
+  OffsetIndex::Make(buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize offset index for column {}", 
i);

Review Comment:
   propogate exception information from the caugh tone?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088957


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize column index for column {}", 
i);
+}
+return column_index;
+  }
+
+  /// Read offset index of a column chunk.
+  std::shared_ptr GetOffsetIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get offset index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted offset index yet");
+}
+
+auto offset_index_location = col_chunk->GetOffsetIndexLocation();
+if (!offset_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.offset_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (offset_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(offset_index_buffer_,
+  
input_->ReadAt(index_read_range_.offset_index->offset,
+ 
index_read_range_.offset_index->length));
+}
+
+auto buffer = offset_index_buffer_.get();

Review Comment:
   is this a pointer? could we at least to auto* i not spell out the whle type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] lidavidm merged pull request #369: fix(go/adbc/driver/flightsql): heap-allocate Go handles

2023-01-20 Thread via GitHub



lidavidm merged PR #369:
URL: https://github.com/apache/arrow-adbc/pull/369


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088746


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize column index for column {}", 
i);
+}
+return column_index;
+  }
+
+  /// Read offset index of a column chunk.
+  std::shared_ptr GetOffsetIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get offset index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted offset index yet");
+}
+
+auto offset_index_location = col_chunk->GetOffsetIndexLocation();
+if (!offset_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.offset_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (offset_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(offset_index_buffer_,
+  
input_->ReadAt(index_read_range_.offset_index->offset,
+ 
index_read_range_.offset_index->length));
+}
+
+auto buffer = offset_index_buffer_.get();
+int64_t buffer_offset =
+offset_index_location->offset - index_read_range_.offset_index->offset;
+uint32_t length = static_cast(offset_index_location->length);

Review Comment:
   same comment as above on casts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088486


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize column index for column {}", 
i);
+}
+return column_index;
+  }
+
+  /// Read offset index of a column chunk.
+  std::shared_ptr GetOffsetIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get offset index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {

Review Comment:
   same comment about removing file_decriptor_ check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083088136


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, index_read_range_.column_index->length);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  throw ParquetException("Cannot deserialize column index for column {}", 
i);

Review Comment:
   seems like we should propogate information about the caught error here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083087767


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return nullptr;
+}
+
+if (!index_read_range_.column_index.has_value()) {
+  throw ParquetException("Missing column index read range");
+}
+
+if (column_index_buffer_ == nullptr) {
+  PARQUET_ASSIGN_OR_THROW(column_index_buffer_,
+  
input_->ReadAt(index_read_range_.column_index->offset,
+ 
index_read_range_.column_index->length));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset =
+column_index_location->offset - index_read_range_.column_index->offset;
+uint32_t length = static_cast(column_index_location->length);
+DCHECK_GE(buffer_offset, 0);

Review Comment:
   if location->length is derived from data in the parquet file, we should be 
checking eagerly that the length is within the allowed range.
   
   Could we also add a comment on why the down-cast



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #4989: Add support for linear range calculation in WINDOW functions

2023-01-20 Thread via GitHub



ursabot commented on PR #4989:
URL: 
https://github.com/apache/arrow-datafusion/pull/4989#issuecomment-1399003734

   Benchmark runs are scheduled for baseline = 
92d0a054c23e5fba91718db32ccd933ce86dd2b6 and contender = 
b71cae8aa556369bc5ee72b063ed1fc5a81192f1. 
b71cae8aa556369bc5ee72b063ed1fc5a81192f1 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ec2-t3-xlarge-us-east-2] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b7602ae58f5d4b7ba3324d1ffe741657...ebc718c2b1854f13b01b15b9ab814210/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on test-mac-arm] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/a4af3422df424d45b8c593bf88fcdb2e...9c071dfa45984472843f37989cf6184f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-i9-9960x] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/62bf27b1bb66413091df940f3d3c4b2c...84d58f2d947140169e6aa56dc6992ca0/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-thinkcentre-m75q] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/dcbd2be5fd934bd7957e5699f4165edd...3fbb7ffd22844f168faeed49ff4bdbcc/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083085814


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,219 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(std::move(row_group_metadata)),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)),
+index_read_range_(
+
PageIndexReader::DeterminePageIndexRangesInRowGroup(*row_group_metadata_)) {}
+
+  /// Read column index of a column chunk.
+  std::shared_ptr GetColumnIndex(int32_t i) override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  throw ParquetException("Invalid column {} to get column index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");

Review Comment:
   given this error should the second part of the check above be removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083084578


##
cpp/src/parquet/file_reader.cc:
##
@@ -302,6 +303,17 @@ class SerializedFile : public ParquetFileReader::Contents {
 
   std::shared_ptr metadata() const override { return 
file_metadata_; }
 
+  std::shared_ptr GetPageIndexReader() override {
+if (!file_metadata_) {
+  throw ParquetException("Cannot get PageIndexReader as file metadata is 
not ready");

Review Comment:
   nit: can we instruct the api caller what went wrong here (i.e. should they 
call a certain method?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a diff in pull request #14964: GH-33596: [C++][Parquet] Parquet page index read support

2023-01-20 Thread via GitHub



emkornfield commented on code in PR #14964:
URL: https://github.com/apache/arrow/pull/14964#discussion_r1083083858


##
cpp/src/parquet/page_index.cc:
##
@@ -184,8 +185,241 @@ class OffsetIndexImpl : public OffsetIndex {
   std::vector page_locations_;
 };
 
+class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader {
+ public:
+  RowGroupPageIndexReaderImpl(::arrow::io::RandomAccessFile* input,
+  std::shared_ptr 
row_group_metadata,
+  const ReaderProperties& properties,
+  int32_t row_group_ordinal,
+  std::shared_ptr 
file_decryptor)
+  : input_(input),
+row_group_metadata_(row_group_metadata),
+properties_(properties),
+file_decryptor_(std::move(file_decryptor)) {
+bool has_column_index = false;
+bool has_offset_index = false;
+PageIndexReader::DeterminePageIndexRangesInRowGroup(
+*row_group_metadata, &column_index_base_offset_, &column_index_size_,
+&offset_index_base_offset_, &offset_index_size_, &has_column_index,
+&has_offset_index);
+  }
+
+  /// Read column index of a column chunk.
+  ::arrow::Result> GetColumnIndex(int32_t i) 
override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  return ::arrow::Status::IndexError("Invalid column {} to get column 
index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted column index yet");
+}
+
+auto column_index_location = col_chunk->GetColumnIndexLocation();
+if (!column_index_location.has_value()) {
+  return ::arrow::Status::Invalid("Column index does not exist for column 
{}", i);
+}
+
+if (column_index_buffer_ == nullptr) {
+  ARROW_ASSIGN_OR_RAISE(
+  column_index_buffer_,
+  input_->ReadAt(column_index_base_offset_, column_index_size_));
+}
+
+auto buffer = column_index_buffer_.get();
+int64_t buffer_offset = column_index_location->offset - 
column_index_base_offset_;
+uint32_t length = column_index_location->length;
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, column_index_size_);
+
+auto descr = row_group_metadata_->schema()->Column(i);
+std::shared_ptr column_index;
+try {
+  column_index =
+  ColumnIndex::Make(*descr, buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  return ::arrow::Status::SerializationError(
+  "Cannot deserialize column index for column {}", i);
+}
+return column_index;
+  }
+
+  /// Read offset index of a column chunk.
+  ::arrow::Result> GetOffsetIndex(int32_t i) 
override {
+if (i < 0 || i >= row_group_metadata_->num_columns()) {
+  return ::arrow::Status::IndexError("Invalid column {} to get offset 
index", i);
+}
+
+auto col_chunk = row_group_metadata_->ColumnChunk(i);
+
+std::unique_ptr crypto_metadata = 
col_chunk->crypto_metadata();
+if (crypto_metadata != nullptr && file_decryptor_ == nullptr) {
+  ParquetException::NYI("Cannot read encrypted offset index yet");
+}
+
+auto offset_index_location = col_chunk->GetOffsetIndexLocation();
+if (!offset_index_location.has_value()) {
+  return ::arrow::Status::Invalid("Offset index does not exist for column 
{}", i);
+}
+
+if (offset_index_buffer_ == nullptr) {
+  ARROW_ASSIGN_OR_RAISE(
+  offset_index_buffer_,
+  input_->ReadAt(offset_index_base_offset_, offset_index_size_));
+}
+
+auto buffer = offset_index_buffer_.get();
+int64_t buffer_offset = offset_index_location->offset - 
offset_index_base_offset_;
+uint32_t length = offset_index_location->length;
+DCHECK_GE(buffer_offset, 0);
+DCHECK_LE(buffer_offset + length, offset_index_size_);
+
+std::shared_ptr offset_index;
+try {
+  offset_index =
+  OffsetIndex::Make(buffer->data() + buffer_offset, length, 
properties_);
+} catch (...) {
+  return ::arrow::Status::SerializationError(
+  "Cannot deserialize offset index for column {}", i);
+}
+return offset_index;
+  }
+
+ private:
+  /// The input stream that can perform random access read.
+  ::arrow::io::RandomAccessFile* input_;
+
+  /// The row group metadata to get column chunk metadata.
+  std::shared_ptr row_group_metadata_;
+
+  /// Reader properties used to deserialize thrift object.
+  const ReaderProperties& properties_;
+
+  /// File-level decryptor.
+  std::shared_ptr file_decryptor_;
+
+  /// File offsets and sizes of the page Index.
+  int64_t column_index_base_offset_;
+  int64_t column_index_size_;
+  int64_t offset_index_base_offset_;
+  int64_t offset_index_size_;
+
+  /// Buffer to hold the raw bytes of the page index.
+  /// Wi

[GitHub] [arrow] thisisnic merged pull request #33809: MINOR: [R] Update BugReports field in DESCRIPTION

2023-01-20 Thread via GitHub



thisisnic merged PR #33809:
URL: https://github.com/apache/arrow/pull/33809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #4979: Add support for linear range search

2023-01-20 Thread via GitHub



alamb closed issue #4979: Add support for linear range search
URL: https://github.com/apache/arrow-datafusion/issues/4979


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #4989: Add support for linear range calculation in WINDOW functions

2023-01-20 Thread via GitHub



alamb merged PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 5 >

1 - 100 of 409 matches

Mail list logo