[jira] [Created] (ARROW-16277) [Python] No builds for macOS arm64.
A. Coady created ARROW-16277: Summary: [Python] No builds for macOS arm64. Key: ARROW-16277 URL: https://issues.apache.org/jira/browse/ARROW-16277 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 8.0.0 Environment: macOS Reporter: A. Coady Nightly builds no longer include a build for macOS for arm64. The last one to do so was 8.0.0.dev312. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16276) [R] Release News
Jonathan Keane created ARROW-16276: -- Summary: [R] Release News Key: ARROW-16276 URL: https://issues.apache.org/jira/browse/ARROW-16276 Project: Apache Arrow Issue Type: Improvement Reporter: Jonathan Keane Assignee: Will Jones I typically use a command like: {code} git log fcab481 --grep=".*\[R\].*" --format="%s" {code} Which will find all the commits with {{[R]}}, since commit fcab481. I found commit fcab481 by going to the 7.0.0 release branch and then finding the last commit that is in the master branch as well as the 7.0.0 release. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16275) [C++] Add support for pushdown projection of nested references
Weston Pace created ARROW-16275: --- Summary: [C++] Add support for pushdown projection of nested references Key: ARROW-16275 URL: https://issues.apache.org/jira/browse/ARROW-16275 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Now that we support nested field references we should support pushdown predicates based on nested field references. For example: {noformat} dataset.to_table(filter=ds.field('values', 'one') > 200) {noformat} {{file_parquet.cc}} tests which row groups to include when scanning a parquet fragment using parquet statistics. At the moment it skips any non-leaf columns. That will need to change. Second, even if we were able to detect and produce a guarantee based on nested references, it's not clear the simplification logic would be able to detect this and appropriately simplify. So there may be changes needed there too. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16274) [C++] Substrait consumer should be feature-aware
Weston Pace created ARROW-16274: --- Summary: [C++] Substrait consumer should be feature-aware Key: ARROW-16274 URL: https://issues.apache.org/jira/browse/ARROW-16274 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace The Substrait consumer should be aware of which features the consumer was built with and gracefully reject plans that request unsupported features. For example, Substrait plans could specify the parquet format and Arrow can be built without parquet support. In that case the consumer should still compile but reject all parquet plans. Today we simply force all features that Substrait could possibly require to be turned on. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16273) [C++] Valgrind error in arrow-compute-scalar-test
Weston Pace created ARROW-16273: --- Summary: [C++] Valgrind error in arrow-compute-scalar-test Key: ARROW-16273 URL: https://issues.apache.org/jira/browse/ARROW-16273 Project: Apache Arrow Issue Type: Bug Reporter: Weston Pace Currently valgrind is failing earlier on the tpch-node-test and hash-join-node-test. Once we fix those tests it seems the next error is this: {noformat} [ RUN ] TestStringKernels/0.Strptime ==9928== Conditional jump or move depends on uninitialised value(s) ==9928==at 0x411AEA2: arrow::TestInitialized(arrow::ArrayData const&) (gtest_util.cc:682) ==9928==by 0xAE1C79: arrow::compute::(anonymous namespace)::ValidateOutput(arrow::ArrayData const&) (test_util.cc:287) ==9928==by 0xAE23FC: arrow::compute::ValidateOutput(arrow::Datum const&) (test_util.cc:320) ==9928==by 0xAE4946: arrow::compute::CheckScalarNonRecursive(std::__cxx11::basic_string, std::allocator > const&, std::vector > const&, arrow::Datum const&, arrow::compute::FunctionOptions const*) (test_util.cc:80) ==9928==by 0xAE63A4: arrow::compute::CheckScalar(std::__cxx11::basic_string, std::allocator >, std::vector > const&, arrow::Datum, arrow::compute::FunctionOptions const*) (test_util.cc:108) ==9928==by 0xAE7E28: arrow::compute::CheckScalarUnary(std::__cxx11::basic_string, std::allocator >, arrow::Datum, arrow::Datum, arrow::compute::FunctionOptions const*) (test_util.cc:254) ==9928==by 0xAE80D3: arrow::compute::CheckScalarUnary(std::__cxx11::basic_string, std::allocator >, std::shared_ptr, std::__cxx11::basic_string, std::allocator >, std::shared_ptr, std::__cxx11::basic_string, std::allocator >, arrow::compute::FunctionOptions const*) (test_util.cc:260) ==9928==by 0x9F783F: arrow::compute::BaseTestStringKernels::CheckUnary(std::__cxx11::basic_string, std::allocator >, std::__cxx11::basic_string, std::allocator >, std::shared_ptr, std::__cxx11::basic_string, std::allocator >, arrow::compute::FunctionOptions const*) (scalar_string_test.cc:56) ==9928==by 0xA2A62D: arrow::compute::TestStringKernels_Strptime_Test::TestBody() (scalar_string_test.cc:1855) ==9928==by 0x64974DC: void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607) ==9928==by 0x648E90C: void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643) ==9928==by 0x6469CDC: testing::Test::Run() (gtest.cc:2682) ==9928==by 0x646A6FE: testing::TestInfo::Run() (gtest.cc:2861) ==9928==by 0x646B0BD: testing::TestSuite::Run() (gtest.cc:3015) ==9928==by 0x647B1DB: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5855) ==9928==by 0x6498497: bool testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607) ==9928==by 0x648FAF9: bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643) ==9928==by 0x64796A8: testing::UnitTest::Run() (gtest.cc:5438) ==9928==by 0x4204918: RUN_ALL_TESTS() (gtest.h:2490) ==9928==by 0x420495B: main (gtest_main.cc:52) ==9928== { Memcheck:Cond fun:_ZN5arrow15TestInitializedERKNS_9ArrayDataE fun:_ZN5arrow7compute12_GLOBAL__N_114ValidateOutputERKNS_9ArrayDataE fun:_ZN5arrow7compute14ValidateOutputERKNS_5DatumE fun:_ZN5arrow7compute23CheckScalarNonRecursiveERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaISA_EERKSA_PKNS0_15FunctionOptionsE fun:_ZN5arrow7compute11CheckScalarENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaIS8_EES8_PKNS0_15FunctionOptionsE fun:_ZN5arrow7compute16CheckScalarUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_5DatumES7_PKNS0_15FunctionOptionsE fun:_ZN5arrow7compute16CheckScalarUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS_8DataTypeEES6_S9_S6_PKNS0_15FunctionOptionsE fun:_ZN5arrow7compute21BaseTestStringKernelsINS_10StringTypeEE10CheckUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St10shared_ptrINS_8DataTypeEES9_PKNS0_15FunctionOptionsE fun:_ZN5arrow7compute31TestStringKernels_Strptime_TestINS_10StringTypeEE8TestBodyEv fun:_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc fun:_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc fun:_ZN7testing4Test3RunEv fun:_ZN7testing8TestInfo3RunEv fun:_ZN7testing9TestSuite3RunEv fun:_ZN7testing8internal12UnitTestImpl11RunAllTestsEv fun:_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc fun:_ZN
[jira] [Created] (ARROW-16272) Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
Sahil Gupta created ARROW-16272: --- Summary: Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv` Key: ARROW-16272 URL: https://issues.apache.org/jira/browse/ARROW-16272 Project: Apache Arrow Issue Type: Bug Affects Versions: 7.0.0, 5.0.0, 4.0.1 Environment: MacOS 12.1 MacBook Pro Intel x86 Reporter: Sahil Gupta `pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`. ```python import pandas as pd import time from pyarrow.fs import S3FileSystem def load_parking_tickets(): print("Running...") t0 = time.time() fs = S3FileSystem( anonymous=True, region="us-east-2", endpoint_override=None, proxy_options=None, ) print("Time to create fs: ", time.time() - t0) t0 = time.time() # fhandler = fs.open_input_stream( # "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", # ) fhandler = fs.open_input_file( "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", ) print("Time to create fhandler: ", time.time() - t0) t0 = time.time() year_2016_df = pd.read_csv( fhandler, nrows=100, ) print("read time:", time.time() - t0) return year_2016_df t0 = time.time() load_parking_tickets() print("total time:", time.time() - t0) ``` Output: ```shell Running... Time to create fs: 0.0003612041473388672 Time to create fhandler: 0.22461509704589844 read time: 105.76488208770752 total time: 105.99135684967041 ``` This is with `pandas==1.4.2`. Getting similar performance with `fs.open_input_stream` as well (commented out in the code). ```shell Running... Time to create fs: 0.0002570152282714844 Time to create fhandler: 0.18540692329406738 read time: 186.8419930934906 total time: 187.03169012069702 ``` When running it with just pandas (which uses `s3fs` under the hood), it's much faster: ```python import pandas as pd import time def load_parking_tickets(): print("Running...") t0 = time.time() year_2016_df = pd.read_csv( "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", nrows=100, ) print("read time:", time.time() - t0) return year_2016_df t0 = time.time() load_parking_tickets() print("total time:", time.time() - t0) ``` Output: ```shell Running... read time: 1.1012001037597656 total time: 1.101264238357544 ``` Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance: ```python import pandas as pd import time from pyarrow.fs import S3FileSystem from fsspec.implementations.arrow import ArrowFSWrapper def load_parking_tickets(): print("Running...") t0 = time.time() fs = ArrowFSWrapper( S3FileSystem( anonymous=True, region="us-east-2", endpoint_override=None, proxy_options=None, ) ) print("Time to create fs: ", time.time() - t0) t0 = time.time() fhandler = fs._open( "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", ) print("Time to create fhandler: ", time.time() - t0) t0 = time.time() year_2016_df = pd.read_csv( fhandler, nrows=100, ) print("read time:", time.time() - t0) return year_2016_df t0 = time.time() load_parking_tickets() print("total time:", time.time() - t0) ``` Output: ```shell Running... Time to create fs: 0.0002467632293701172 Time to create fhandler: 0.1858382225036621 read time: 0.13701486587524414 total time: 0.3232450485229492 ``` Packages: ``` pyarrow=7.0.0 pandas : 1.4.2 numpy : 1.20.3 ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16271) [C++] Implement full chunked array support for replace_with_mask
David Li created ARROW-16271: Summary: [C++] Implement full chunked array support for replace_with_mask Key: ARROW-16271 URL: https://issues.apache.org/jira/browse/ARROW-16271 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li ARROW-15928 enables this function to accept chunked arrays for the input array, but not for the mask or replacements array. More work is needed to implement those cases (which currently just return an error). We should also consider how to make this work at least somewhat reusable for similar kernels (e.g. replace_with_indices) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16270) [C++][Python][FileSystem] Make directory paths returned uniform
Micah Kornfield created ARROW-16270: --- Summary: [C++][Python][FileSystem] Make directory paths returned uniform Key: ARROW-16270 URL: https://issues.apache.org/jira/browse/ARROW-16270 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Micah Kornfield Depending on if paths are selected with recursion or without code the result of the returned directories changes to include a slash or not include a slash (see code linked below). It would be nice to provide consistent output here. It isn't clear i the breaking change is worthwhile here. [1] https://github.com/apache/arrow/blob/3eaa7dd0e8b3dabc5438203331f05e3e6c011e37/python/pyarrow/tests/test_fs.py#L688 [2] https://github.com/apache/arrow/blob/3eaa7dd0e8b3dabc5438203331f05e3e6c011e37/cpp/src/arrow/filesystem/test_util.cc#L767 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16269) [R][Python] Roundtrip ChunkedArray with ExtensionType drops type
Dewey Dunnington created ARROW-16269: Summary: [R][Python] Roundtrip ChunkedArray with ExtensionType drops type Key: ARROW-16269 URL: https://issues.apache.org/jira/browse/ARROW-16269 Project: Apache Arrow Issue Type: Improvement Components: Python, R Reporter: Dewey Dunnington After ARROW-15168 we will use ExtensionType in more cases to handle R vector types that we don't natively implement a conversion for; however, roundtripping a Table through results in a Table with a slightly inconsistent state where the type of the ChunkedArray doesn't line up with the type in the schema: {code:R} # remotes::install_github("apache/arrow/r") library(arrow, warn.conflicts = FALSE) pa <- reticulate::import("pyarrow", convert = FALSE) table <- arrow_table( ext_col = chunked_array(vctrs_extension_array(1:10)) ) table$ext_col$type #> VctrsExtensionType #> integer(0) table$schema$ext_col$type #> VctrsExtensionType #> integer(0) table_py <- pa$Table$from_arrays(table$columns, schema = table$schema) table_py$column("ext_col")$type #> int32 table_py$schema$field("ext_col")$type #> int32 cols <- reticulate::py_to_r(table_py$columns) names(cols) <- reticulate::py_to_r(table_py$column_names) table2 <- Table$create(!!! cols, schema = table$schema) table2$ext_col$type #> Int32 #> int32 table2$schema$ext_col$type #> VctrsExtensionType #> integer(0) {code} The workaround in ARROW-15168 is to go through RecordBatchReader, which is probably fine but in some cases might result in ChunkedArray columns getting re-chunked to intersection of all the chunks. This doesn't copy any data, but isn't ideal (we should be able to roundtrip column-wise and avoid any re-chunking). {code:R} # remotes::install_github("apache/arrow/r#12817") library(arrow, warn.conflicts = FALSE) table <- arrow_table( c1 = chunked_array(1:2, 3:4, 5:6), c2 = chunked_array(1:6) ) table$c1 #> ChunkedArray #> [ #> [ #> 1, #> 2 #> ], #> [ #> 3, #> 4 #> ], #> [ #> 5, #> 6 #> ] #> ] table$c2 #> ChunkedArray #> [ #> [ #> 1, #> 2, #> 3, #> 4, #> 5, #> 6 #> ] #> ] rbr <- as_record_batch_reader(table) table2 <- rbr$read_table() table2$c1 #> ChunkedArray #> [ #> [ #> 1, #> 2 #> ], #> [ #> 3, #> 4 #> ], #> [ #> 5, #> 6 #> ] #> ] table2$c2 #> ChunkedArray #> [ #> [ #> 1, #> 2 #> ], #> [ #> 3, #> 4 #> ], #> [ #> 5, #> 6 #> ] #> ] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16268) [R] Reorganized deprecated functions
Dewey Dunnington created ARROW-16268: Summary: [R] Reorganized deprecated functions Key: ARROW-16268 URL: https://issues.apache.org/jira/browse/ARROW-16268 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington In R/deprecated.R we have a few deprecated functions that have been deprecated for some time. In ARROW-15168 we deprecated {{type()}} in favour of {{infer_type()}}, and there are a few deprecated methods in dataset.R as well. We should consider removing some of the older deprecations and for the newer ones, we should probably add a comment about when they were deprecated so that we can better schedule their removal (or adopt tidyverse lifecycle terminology: https://lifecycle.r-lib.org/articles/stages.html ). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16267) [Java] Support Java 18
Alessandro Molina created ARROW-16267: - Summary: [Java] Support Java 18 Key: ARROW-16267 URL: https://issues.apache.org/jira/browse/ARROW-16267 Project: Apache Arrow Issue Type: Sub-task Reporter: Alessandro Molina Assignee: David Dali Susanibar Arce -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16266) [R] Add StructArray$create()
Dewey Dunnington created ARROW-16266: Summary: [R] Add StructArray$create() Key: ARROW-16266 URL: https://issues.apache.org/jira/browse/ARROW-16266 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington In ARROW-13371 we implemented the {{make_struct}} compute function bound to {{data.frame()}} / {{tibble()}} in dplyr evaluation; however, we didn't actually implement {{StructArray$create()}}. In ARROW-15168, it turns out that we need to do this to support {{StructArray}} creation from data.frames whose columns aren't all convertable using the internal C++ conversion. The hack used in that PR is below (but we should clearly implement the C++ function instead of using the hack): {code:R} library(arrow, warn.conflicts = FALSE) struct_array <- function(...) { batch <- record_batch(...) array_ptr <- arrow:::allocate_arrow_array() schema_ptr <- arrow:::allocate_arrow_schema() batch$export_to_c(array_ptr, schema_ptr) Array$import_from_c(array_ptr, schema_ptr) } struct_array(a = 1, b = "two") #> StructArray #> > #> -- is_valid: all not null #> -- child 0 type: double #> [ #> 1 #> ] #> -- child 1 type: string #> [ #> "two" #> ] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16265) [C++][Docs]
Jacob Wujciak-Jens created ARROW-16265: -- Summary: [C++][Docs] Key: ARROW-16265 URL: https://issues.apache.org/jira/browse/ARROW-16265 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: Jacob Wujciak-Jens Fix For: 9.0.0 Add a note [here|https://arrow.apache.org/docs/developers/cpp/windows.html#windows-dependency-resolution-issues] that {{ZSTD_MSVC_STATIC_LIB_SUFFIX}} will be automatically set to {{_static }}if MCVS is used (see [PR #7388|https://github.com/apache/arrow/pull/7388]) and the option needs to be passed as{{ -DZSTD_MSVC_STATIC_LIB_SUFFIX=. }}Maybe a Log message in case of set suffix would also make sense. I ran into this issue due to the current VCPKG version deviating from the zstd_static.lib naming scheme in a recent commit, -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16264) [C++][CI] Valgrind timeout in arrow-compute-hash-join-node-test
Weston Pace created ARROW-16264: --- Summary: [C++][CI] Valgrind timeout in arrow-compute-hash-join-node-test Key: ARROW-16264 URL: https://issues.apache.org/jira/browse/ARROW-16264 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Assignee: Weston Pace This is starting to show up once we fixed the valgrind errors in the tpch node test. https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=23628&view=results -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16263) [Doc] Document backpressure for the C++ streaming exec plan
Weston Pace created ARROW-16263: --- Summary: [Doc] Document backpressure for the C++ streaming exec plan Key: ARROW-16263 URL: https://issues.apache.org/jira/browse/ARROW-16263 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Assignee: Weston Pace This is described somewhat in https://github.com/apache/arrow/pull/12228 We should update our user guide for datasets to help users understand how much RAM they should expect to use and what parameters (if any) are available for tuning. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16262) [CI] Kartothek nightly integration build is failing because of Parquet statistics date change
Joris Van den Bossche created ARROW-16262: - Summary: [CI] Kartothek nightly integration build is failing because of Parquet statistics date change Key: ARROW-16262 URL: https://issues.apache.org/jira/browse/ARROW-16262 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, Python Reporter: Joris Van den Bossche Caused by ARROW-7350, see discussion at https://github.com/apache/arrow/pull/12902#issuecomment-1102750381 Upstream issue at https://github.com/JDASoftwareGroup/kartothek/issues/515 On the short term, we should also fix our nightly builds (either temporarily disabling them altogether, or ideally on skipping those failing tests) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16261) [CI][C++] HDFS Test failures
Antoine Pitrou created ARROW-16261: -- Summary: [CI][C++] HDFS Test failures Key: ARROW-16261 URL: https://issues.apache.org/jira/browse/ARROW-16261 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou Fix For: 8.0.0 The HDFS DeleteDirContents tests are failing, possibly following ARROW-16159. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16260) [C++] Add backpressure to aggregate node
Weston Pace created ARROW-16260: --- Summary: [C++] Add backpressure to aggregate node Key: ARROW-16260 URL: https://issues.apache.org/jira/browse/ARROW-16260 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace There are two possible cases I can think of where we might need backpressure handling in aggregate node, though both are not a concern until we have spillover * Once we have spillover we may want to pause the input while we are busy spilling to disk * Once we have spillover we may want to pause reading from the spillover cache if the sink is busy -- This message was sent by Atlassian Jira (v8.20.7#820007)