[jira] [Created] (ARROW-16278) [CI] Git installation failure on homebrew
Raúl Cumplido created ARROW-16278: - Summary: [CI] Git installation failure on homebrew Key: ARROW-16278 URL: https://issues.apache.org/jira/browse/ARROW-16278 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Raúl Cumplido Fix For: 8.0.0 Some builds are failing due to git unable to install on homebrew. This seems to be related to the new git release: _With the fixes for CVE-2022-24765 that are common with versions of_ _Git 2.30.4, 2.31.3, 2.32.2, 2.33.3, 2.34.3, and 2.35.3, Git has_ _been taught not to recognise repositories owned by other users, in_ _order to avoid getting affected by their config files and hooks._ _You can list the path to the safe/trusted repositories that may be_ _owned by others on a multi-valued configuration variable_ _safe.directory to override this behaviour, or use '*' to declare_ _that you trust anything._ Failed job example https://github.com/apache/arrow/runs/6114985460?check_suite_focus=true: {code:java} Installing automake Installing aws-sdk-cpp Installing boost Using brotli Using c-ares Installing ccache Using cmake Installing flatbuffers Installing git ==> Downloading https://ghcr.io/v2/homebrew/core/git/manifests/2.36.0 ==> Downloading https://ghcr.io/v2/homebrew/core/git/blobs/sha256:5739e703f9ad34dba01e343d76f363143f740bf6e05c945c8f19a073546c6ce5 ==> Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:5739e703f9ad34dba01e343d76f363143f740bf6e05c945c8f19a073546c6ce5?se=2022-04-21T18%3A35%3A00Z&sig=ZdiaSBdomnIwd4Ga4PORXPs2%2FYZXrrLLaks61mgmyEs%3D&sp=r&spr=https&sr=b&sv=2019-12-12 ==> Pouring git--2.36.0.big_sur.bottle.tar.gz Error: The `brew link` step did not complete successfully The formula built, but is not symlinked into /usr/local Could not symlink etc/bash_completion.d/git-completion.bash Target /usr/local/etc/bash_completion.d/git-completion.bash is a symlink belonging to git@2.35.1. You can unlink it: brew unlink git@2.35.1To force the link and overwrite all conflicting files: brew link --overwrite gitTo list all files that would be deleted: brew link --overwrite --dry-run gitPossible conflicting files are: /usr/local/etc/bash_completion.d/git-completion.bash -> /usr/local/Cellar/git@2.35.1/2.35.1/etc/bash_completion.d/git-completion.bash /usr/local/etc/bash_completion.d/git-prompt.sh -> /usr/local/Cellar/git@2.35.1/2.35.1/etc/bash_completion.d/git-prompt.sh /usr/local/bin/git -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git /usr/local/bin/git-cvsserver -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-cvsserver /usr/local/bin/git-receive-pack -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-receive-pack /usr/local/bin/git-shell -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-shell /usr/local/bin/git-upload-archive -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-upload-archive /usr/local/bin/git-upload-pack -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-upload-pack Error: Could not symlink share/doc/git-doc/MyFirstContribution.html Target /usr/local/share/doc/git-doc/MyFirstContribution.html is a symlink belonging to git@2.35.1. You can unlink it: brew unlink git@2.35.1To force the link and overwrite all conflicting files: brew link --overwrite git@2.35.1To list all files that would be deleted: brew link --overwrite --dry-run git@2.35.1 Installing git has failed! Installing glog Installing grpc Using llvm Installing llvm@12 Using lz4 Installing minio Installing ninja Installing numpy Using openssl@1.1 Installing protobuf Using python Installing rapidjson Installing snappy Installing thrift Using wget Using zstd Homebrew Bundle failed! 1 Brewfile dependency failed to install. Error: Process completed with exit code 1. {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16279) [Python] Support Expressions in `Table.filter`
Alessandro Molina created ARROW-16279: - Summary: [Python] Support Expressions in `Table.filter` Key: ARROW-16279 URL: https://issues.apache.org/jira/browse/ARROW-16279 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Alessandro Molina Assignee: Alessandro Molina Fix For: 9.0.0 *Umbrella ticket* At the moment {{Table.filter}} only accepts a mask, and building a mask that actually leads to the rows we care about can be complex and slow in cases where more than one compute function is used to generate the mask. It would be helpful to be able to pass an {{Expression}} as the argument and get the table filtered by that expression as expressions are easier to understand and reason about than masks. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16280) [C++] Avoid copying shared_ptr in Expression::type()
Tobias Zagorni created ARROW-16280: -- Summary: [C++] Avoid copying shared_ptr in Expression::type() Key: ARROW-16280 URL: https://issues.apache.org/jira/browse/ARROW-16280 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Tobias Zagorni Assignee: Tobias Zagorni Split off from ARROW-16161, since this is a fairly straightforward fix and completely independent of ExecBatch. Expression::type() currently copies a shared_ptr, while the return value is often used directly. We can avoid copying the shared_ptr, by returning a reference to it. This reduces thread contention on these shared_ptrs (ARROW-16161). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16281) [R] [CI]
Jonathan Keane created ARROW-16281: -- Summary: [R] [CI] Key: ARROW-16281 URL: https://issues.apache.org/jira/browse/ARROW-16281 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane Assignee: Jacob Wujciak-Jens Now that R 4.2 is released, we should bump all of our R versions where we have ones hardcoded. This will mean dropping support for 3.4 entirely and adding in 4.0 to https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/r/github.linux.versions.yml#L34 There are a few other places that we have hard-coded versions (we might need to wait a few days for these to catch up): https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/tasks.yml#L1291-L1295 https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/.github/workflows/r.yml#L60 (and a few other places in that file — though one note: we build an old version of windows that uses rtools35 in the GHA CI so that we catch when we break that — we'll want to keep that!) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
Raúl Cumplido created ARROW-16282: - Summary: [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04 Key: ARROW-16282 URL: https://issues.apache.org/jira/browse/ARROW-16282 Project: Apache Arrow Issue Type: Bug Components: C#, Continuous Integration Reporter: Raúl Cumplido Fix For: 8.0.0 We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu 22.04 and we can see how the nightly release job has been failing since then. Working for ubuntu 20.04 on 2022-04-08: [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] Failing for ubuntu 22.04 on 2022-04-09: [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] The error seems to be related with missing libssl: {code:java} === Build and test C# libraries === └ Ensuring that C# is installed... └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! - SDK Version: 3.1.405Telemetry - The .NET Core tools collect usage data in order to help us improve your experience. It is collected by Microsoft and shared with the community. You can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT environment variable to '1' or 'true' using your favorite shell.Read more about .NET Core CLI Tools telemetry: https://aka.ms/dotnet-cli-telemetry Explore documentation: https://aka.ms/dotnet-docs Report issues and find source on GitHub: https://github.com/dotnet/core Find out what's new: https://aka.ms/dotnet-whats-new Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs Write your first app: https://aka.ms/first-net-core-app -- No usable version of libssl was found /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted (core dumped) dotnet tool install --tool-path ${csharp_bin} sourcelink Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. 134 Error: `docker-compose --file /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16283) [Go] Cleanup Panics in new Buffered Reader
Matthew Topol created ARROW-16283: - Summary: [Go] Cleanup Panics in new Buffered Reader Key: ARROW-16283 URL: https://issues.apache.org/jira/browse/ARROW-16283 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16284) [Python][Packaging] Use delocate-fuse to create universal2 wheels
Krisztian Szucs created ARROW-16284: --- Summary: [Python][Packaging] Use delocate-fuse to create universal2 wheels Key: ARROW-16284 URL: https://issues.apache.org/jira/browse/ARROW-16284 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Krisztian Szucs Previously we used specific universal2 configurations for vcpkg to build the dependencies containing symbols for both architectures. This approach proved to be fragile to vcpkg changes making it hard to upgrade the vcpkg version. As an example https://github.com/apache/arrow/pull/12893 bumps the vcpkg version where absl has stopped compiling for two CMAKE_OSX_ARCHITECTURES, it has been already fixed in absl's upstream but that hasn't been released yet. The new approach uses multibuild's delocate to build the wheels for both arm64 and amd64 separately and fuse them in an upcoming step to a universal2 wheel (using {{lipo}} under the hood). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16285) [CI][Python} Enable skipped kartothek integration tests
Jacob Wujciak-Jens created ARROW-16285: -- Summary: [CI][Python} Enable skipped kartothek integration tests Key: ARROW-16285 URL: https://issues.apache.org/jira/browse/ARROW-16285 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Python Reporter: Jacob Wujciak-Jens -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16286) [C++] SimplifyWithGuarantee does not work with non-deterministic expressions
Weston Pace created ARROW-16286: --- Summary: [C++] SimplifyWithGuarantee does not work with non-deterministic expressions Key: ARROW-16286 URL: https://issues.apache.org/jira/browse/ARROW-16286 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace If an expression is non-deterministic (e.g. "random") then SimplifyWithGuarantee may incorrectly think it can fold constants. For example, if the call is {{random()}} then {{SimplifyWithGuarantee}} will detect that all the arguments are constants (or, more accurately, there are zero non-constant arguments) and decide it can execute the expression immediately and fold it into a constant. We could maybe add a hack for the random case since it is the only nullary function but, in general, we will probably need a way to define functions as "non-deterministic" and prevent constant folding. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
Kyle Barron created ARROW-16287: --- Summary: PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file Key: ARROW-16287 URL: https://issues.apache.org/jira/browse/ARROW-16287 Project: Apache Arrow Issue Type: Bug Components: Parquet Affects Versions: 7.0.0 Environment: MacOS. Python 3.8.10. pyarrow: '7.0.0' pandas: '1.4.2' numpy: '1.22.3' Reporter: Kyle Barron I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: ``` from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame(\{"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) # Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") # Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) ``` This raises the error ``` --- RuntimeError Traceback (most recent call last) Input In [92], in () > 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. ``` But all schemas in the `metadata_collector` list seem to be the same: ``` all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16288) [C++] ValueDescr::SCALAR nearly unused and does not work for projection
Weston Pace created ARROW-16288: --- Summary: [C++] ValueDescr::SCALAR nearly unused and does not work for projection Key: ARROW-16288 URL: https://issues.apache.org/jira/browse/ARROW-16288 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace First, there are almost no kernels that actually use this shape. Only the functions "all", "any", "list_element", "mean", "product", "struct_field", and "sum" have kernels with this shape. Most kernels that have special logic for scalars handle it by using {{ValueDescr::ANY}} Second, when passing an expression to the project node, the expression must be bound based on the dataset schema. Since the binding happens based on a schema (and not a batch) the function is bound to ValueDescr::ARRAY (https://github.com/apache/arrow/blob/a16be6b7b6c8271202ff766b99c199b2e29bdfa8/cpp/src/arrow/compute/exec/expression.cc#L461) This results in an error if the function has only ValueDescr::SCALAR kernels and would likely be a problem even if the function had both types of kernels because it would get bound to the wrong kernel. This simplest fix may be to just get rid of ValueDescr and change all kernels to ValueDescr::ANY behavior. If we choose to keep it we will need to figure out how to handle this kind of binding. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays
Weston Pace created ARROW-16289: --- Summary: [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays Key: ARROW-16289 URL: https://issues.apache.org/jira/browse/ARROW-16289 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace This JIRA is a proposal / discussion. I am not asserting this is the way to go but I would like to consider it. >From the execution engine's perspective an exec batch's columns are always >either arrays or scalars. The only time we make use of scalars today is for >the four augmented columns (e.g. __filename). Once we have support for RLE >arrays a scalar could easily be encoded as an RLE array and there would be no >need to use scalars here. The advantage would be reducing the complexity in exec nodes and avoiding issues like ARROW-16288. It is already rather difficult to explain the idea of a "scalar" and "vector" function and then have to turn around and explain that the word "scalar" has an entirely different meaning when talking about field shape. I think it's worth considering taking this even further and removing the concept from the compute layer entirely. Kernel functions that want to have special logic for scalars could do so using the RLE array. This would be a significant change to many kernels which currently declare the ANY shape and determine which logic to apply within the kernel itself (e.g. there is one array OR scalar kernel and not one kernel for each). Admittedly there is probably a few instructions and a few bytes more to handle an RLE scalar than the scalar we have today. However, this is just different flavors of O(1) and not likely to have significant impact. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16290) [C++] ExecuteScalarExpression, when calling a nullary function on a nullary batch, resets the batch length to 1
Weston Pace created ARROW-16290: --- Summary: [C++] ExecuteScalarExpression, when calling a nullary function on a nullary batch, resets the batch length to 1 Key: ARROW-16290 URL: https://issues.apache.org/jira/browse/ARROW-16290 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace At the moment ARROW-16286 prevents us from using ExecuteScalarExpression on nullary functions. However, if we bypass constant folding, then we run into another problem. The batch passed to the function always has length = 1. This appears to be tied up with the logic of ExecBatchIterator that I don't quite follow entirely. However, we should be preserving the batch length of the input to ExecuteScalarExpression and passing that to the function. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16291) [Java]: Support JSE17 for Java Cookbooks
David Dali Susanibar Arce created ARROW-16291: - Summary: [Java]: Support JSE17 for Java Cookbooks Key: ARROW-16291 URL: https://issues.apache.org/jira/browse/ARROW-16291 Project: Apache Arrow Issue Type: Sub-task Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce Realize changes needed to run cookbooks through JSE17. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16292) [Java][Doc]: Upgrade java documentation for JSE17
David Dali Susanibar Arce created ARROW-16292: - Summary: [Java][Doc]: Upgrade java documentation for JSE17 Key: ARROW-16292 URL: https://issues.apache.org/jira/browse/ARROW-16292 Project: Apache Arrow Issue Type: Sub-task Components: Documentation, Java Affects Versions: 9.0.0 Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce Document changes needed to support JSE17: # Changed for arrow side: Changes related to {{--add-exports"}} are needed to continue supporting erroProne base on JSE11+ [installation doc|https://errorprone.info/docs/installation]. It mean you won't need this changes if you run arrow java building code without errorProne validation (mvn clean install -P-error-prone-jdk11+ ) # Changes as a user of arrow: If the user are planning to use Arrow with JSE17 is needed to pass modules needed. For example if I run cookbook for IO [https://arrow.apache.org/cookbook/java/io.html] it finished with an error mention {{Unable to make field long java.nio.Buffer.address accessible: module java.base does not "opens java.nio" to unnamed module}} for that reason as a user for JSE17 (not for arrow changes) is needed to add VM arguments as {{-ea --add-opens=java.base/java.nio=ALL-UNNAMED}} and it will finished without errors. This ticket are related with https://github.com/apache/arrow/pull/12941#pullrequestreview-950090643 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16293) [CI][GLib] Tests are unstable
Kouhei Sutou created ARROW-16293: Summary: [CI][GLib] Tests are unstable Key: ARROW-16293 URL: https://issues.apache.org/jira/browse/ARROW-16293 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou 1. macOS test is timed out because ccache cache isn't available: https://github.com/apache/arrow/runs/6134456502?check_suite_focus=true 2. {{gparquet_row_group_metadata_equal()}} isn't stable on Windows: https://github.com/apache/arrow/runs/6134457213?check_suite_focus=true#step:14:308 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16294) [C++] Improve performance of parquet readahead
Weston Pace created ARROW-16294: --- Summary: [C++] Improve performance of parquet readahead Key: ARROW-16294 URL: https://issues.apache.org/jira/browse/ARROW-16294 Project: Apache Arrow Issue Type: Improvement Reporter: Weston Pace The 7.0.0 readahead for parquet would read up to 256 row groups at once which meant that, if the consumer were too slow, we would almost certainly run out of memory. ARROW-15410 improved readahead as a whole and, in the process, changed parquet so it's always reading 1 row group in advance. This is not always ideal in S3 scenarios. We may want to read many row groups in advance if the row groups are small. To fix this we should continue reading in parallel until there are at least batch_size * batch_readahead rows being fetched. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16295) [CI][Release] verify-rc-source-windows still uses windows-2016
Kouhei Sutou created ARROW-16295: Summary: [CI][Release] verify-rc-source-windows still uses windows-2016 Key: ARROW-16295 URL: https://issues.apache.org/jira/browse/ARROW-16295 Project: Apache Arrow Issue Type: Test Components: Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou windows-2016 is deprecated: https://github.com/actions/virtual-environments/issues/4312 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16296) [GLib][Parquet] Add missing casts for GArrowRoundMode
Kouhei Sutou created ARROW-16296: Summary: [GLib][Parquet] Add missing casts for GArrowRoundMode Key: ARROW-16296 URL: https://issues.apache.org/jira/browse/ARROW-16296 Project: Apache Arrow Issue Type: Improvement Components: GLib, Parquet Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.7#820007)