[jira] [Created] (ARROW-18048) [Dev][Archery][Crossbow] Comment bot doesn't report suitable task URLs
Kouhei Sutou created ARROW-18048: Summary: [Dev][Archery][Crossbow] Comment bot doesn't report suitable task URLs Key: ARROW-18048 URL: https://issues.apache.org/jira/browse/ARROW-18048 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou If we generate a report immediately after we submit CI tasks, the generated report doesn't have suitable task CI task URL. Because CI tasks aren't started. We need to wait for a while before we generate a report to collect suitable CI task URLs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18049) [R] Support column renaming in col_select argument to file reading functions
Nicola Crane created ARROW-18049: Summary: [R] Support column renaming in col_select argument to file reading functions Key: ARROW-18049 URL: https://issues.apache.org/jira/browse/ARROW-18049 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane We should support the ability to rename columns when reading in data via the CSV/Parquet/Feather/JSON file readers. We currently have an argument {{col_select}}, which allows users to choose which columns to read in, but renaming doesn't work. To implement this, we'd need to check if any columns have been renamed by {{col_select}} and then updating the schema of the object being returned once the file has been read. {code:r} library(readr) library(arrow) readr::read_csv(readr_example("mtcars.csv"), col_select = c(not_hp = hp)) #> # A tibble: 32 × 1 #>not_hp #> #> 1110 #> 2110 #> 3 93 #> 4110 #> 5175 #> 6105 #> 7245 #> 8 62 #> 9 95 #> 10123 #> # … with 22 more rows arrow::read_csv_arrow(readr_example("mtcars.csv"), col_select = c(not_hp = hp)) #> # A tibble: 32 × 1 #> hp #> #> 1 110 #> 2 110 #> 393 #> 4 110 #> 5 175 #> 6 105 #> 7 245 #> 862 #> 995 #> 10 123 #> # … with 22 more rows {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18050) [C++] Substrait consumer should reject plans containing options that it doesn't recognize
Weston Pace created ARROW-18050: --- Summary: [C++] Substrait consumer should reject plans containing options that it doesn't recognize Key: ARROW-18050 URL: https://issues.apache.org/jira/browse/ARROW-18050 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace This is a follow-up to ARROW-17966. This is a bit tricky because we basically just pass the call today to: {noformat} using SubstraitCallToArrow = std::function(const SubstraitCall&)>; {noformat} Then we have no way of knowing which options were actually used by the call. Since we don't know this we have no way of checking to make sure we used all of the given options. This is rather important since a given, but not recognized, option should be rejected. This means the producer is asking for some kind of specific behavior and we are probably not going to be able to provide it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18051) [C++] Enable tests skipped by ARROW-16392
Weston Pace created ARROW-18051: --- Summary: [C++] Enable tests skipped by ARROW-16392 Key: ARROW-18051 URL: https://issues.apache.org/jira/browse/ARROW-18051 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace There are a number of unit tests that we still skip (on Windows) due to ARROW-16392. However, ARROW-16392 has been fixed. There is no reason to skip these any longer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18052) pyarrow.parquet.write_to_dataset permssion denied when write to s3 bucket in another account
John created ARROW-18052: Summary: pyarrow.parquet.write_to_dataset permssion denied when write to s3 bucket in another account Key: ARROW-18052 URL: https://issues.apache.org/jira/browse/ARROW-18052 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0 Reporter: John We are using pyarrow.parquet.write_to_dataset to write to an s3 file system in another account. This fails because write_to_dataset calls pyarrow.dataset.write_dataset with create_dir=True. There is not a parameter in the call to write_to_dataset that allows you to set create_dir to False. When setting it to False, this works correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18053) [Dev] merge_arrow_pr.py doesn't detect Co-authored-by:
Kouhei Sutou created ARROW-18053: Summary: [Dev] merge_arrow_pr.py doesn't detect Co-authored-by: Key: ARROW-18053 URL: https://issues.apache.org/jira/browse/ARROW-18053 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou For example, {{Co-authored-by:}} in https://github.com/apache/arrow/pull/14381/commits/41b7f26a3631c5c6cfa3abd369fbf39263cfb536 isn't detected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18054) [Python][CI] Enable Cython tests on windows wheels
Raúl Cumplido created ARROW-18054: - Summary: [Python][CI] Enable Cython tests on windows wheels Key: ARROW-18054 URL: https://issues.apache.org/jira/browse/ARROW-18054 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Python Reporter: Raúl Cumplido Assignee: Raúl Cumplido Fix For: 10.0.0 We currently have `set PYARROW_TEST_CYTHON=OFF` on the windows wheel tests. We should run the cython tests for windows too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18055) arrow-dataset-dataset-writer-test still times out occassionally
Weston Pace created ARROW-18055: --- Summary: arrow-dataset-dataset-writer-test still times out occassionally Key: ARROW-18055 URL: https://issues.apache.org/jira/browse/ARROW-18055 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace Fix For: 10.0.0 https://github.com/ursacomputing/crossbow/actions/runs/3242586440/jobs/5316061574 https://github.com/ursacomputing/crossbow/actions/runs/3246657465/jobs/5325660174 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18056) [Ruby] Add support for building Arrow::Table from {name: Arrow::Tensor}
Kouhei Sutou created ARROW-18056: Summary: [Ruby] Add support for building Arrow::Table from {name: Arrow::Tensor} Key: ARROW-18056 URL: https://issues.apache.org/jira/browse/ARROW-18056 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18057) [R] test for slice functions fail on builds without Datasets capability
Nicola Crane created ARROW-18057: Summary: [R] test for slice functions fail on builds without Datasets capability Key: ARROW-18057 URL: https://issues.apache.org/jira/browse/ARROW-18057 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The changes in ARROW-13766 introduced a test which depends on datasets functionality being enabled - we should skip this on CI builds where it is not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18058) [Dev][Archery] Remove removed ARROW_JNI related code
Kouhei Sutou created ARROW-18058: Summary: [Dev][Archery] Remove removed ARROW_JNI related code Key: ARROW-18058 URL: https://issues.apache.org/jira/browse/ARROW-18058 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18059) [CI][Packaging] Nightly test for centos-8-stream-arm64 times out installing arrow-libs
Raúl Cumplido created ARROW-18059: - Summary: [CI][Packaging] Nightly test for centos-8-stream-arm64 times out installing arrow-libs Key: ARROW-18059 URL: https://issues.apache.org/jira/browse/ARROW-18059 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Packaging Reporter: Raúl Cumplido It seems we sometimes time out when installing some dependencies: {code:java} + dnf install -y --enablerepo=powertools arrow9-libs Last metadata expiration check: 0:00:05 ago on Thu Oct 13 10:45:07 2022. Dependencies resolved. PackageArch Version Repository Size Installing: arrow9-libsaarch649.0.0-1.el8 apache-arrow-centos-stream6.3 M Transaction Summary Install 1 Package Total download size: 6.3 M Installed size: 26 M Downloading Packages: [MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (28): Timeout was reached for https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm [Operation timed out after 3 milliseconds with 0 out of 0 bytes received] [MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (35): SSL connect error for https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm [OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to apache.jfrog.io:443 ] [MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (35): SSL connect error for https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm [OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to apache.jfrog.io:443 ] [MIRROR] arrow9-libs-9.0.0-1.el8.aarch64.rpm: Curl error (28): Timeout was reached for https://apache.jfrog.io/artifactory/arrow/centos/8-stream/aarch64/Packages/arrow9-libs-9.0.0-1.el8.aarch64.rpm [Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds] [FAILED] arrow9-libs-9.0.0-1.el8.aarch64.rpm: No more mirrors to try - All mirrors were already tried without success {code} This is the job that failed: * [centos-8-stream-arm64|https://github.com/ursacomputing/crossbow/runs/8864772917] we should probably retry or try to find if there is a way to override the default timeout as the library is available. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18060) [C++] Writing a dataset with 0 rows doesn't create any files
David Li created ARROW-18060: Summary: [C++] Writing a dataset with 0 rows doesn't create any files Key: ARROW-18060 URL: https://issues.apache.org/jira/browse/ARROW-18060 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 9.0.0 Reporter: David Li If the input data has no rows, no files get created. This is potentially unexpected as it looks like "nothing happened". It might be nicer to create an empty file. With partitioning, though, that then gets weird (there's no partition values) so maybe an error might make more sense instead. Reproduction in Python {code:python} import tempfile from pathlib import Path import pyarrow import pyarrow.dataset print("PyArrow version:", pyarrow.__version__) table = pyarrow.table([ [], ], schema=pyarrow.schema([ ("ints", "int64"), ])) with tempfile.TemporaryDirectory() as d: pyarrow.dataset.write_dataset(table, d, format="feather") print(list(Path(d).iterdir())) {code} Output {noformat} > python repro.py PyArrow version: 9.0.0 [] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18061) [CI][R] Reduce number of jobs on every commit
Neal Richardson created ARROW-18061: --- Summary: [CI][R] Reduce number of jobs on every commit Key: ARROW-18061 URL: https://issues.apache.org/jira/browse/ARROW-18061 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Neal Richardson * Force tests: true fully contains all tests in the false version, so false is redundant * The CentOS 7 job previously was useful to catch gcc 4.8 issues, but that's no longer a problem, and the CentOS with devtoolset-8 variant is already tested nightly * The Windows builds can be pruned if we test only on current release and development version, which only use UCRT. We test the older build nightly in the r-binary-packages workflow -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18062) [R] error in CI jobs for R 3.5 and 3.6 when R package being installed
Nicola Crane created ARROW-18062: Summary: [R] error in CI jobs for R 3.5 and 3.6 when R package being installed Key: ARROW-18062 URL: https://issues.apache.org/jira/browse/ARROW-18062 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane e.g. https://github.com/ursacomputing/crossbow/actions/runs/3246698242/jobs/5325752692#step:5:3164 >From the install logs on that CI job: {code} ** R ** inst ** byte-compile and prepare package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded from temporary location Error: package or namespace load failed for ‘arrow’: .onLoad failed in loadNamespace() for 'arrow', details: call: fun_cache[[unqualified_name]] <- fun error: invalid type/length (closure/0) in vector allocation Error: loading failed {code} It is currently erroring for R 3.5 and 3.6 in the nightlies with this error. The line of code where the error comes from was added in ARROW-16444 but seeing as that was 3 months ago, it seems unlikely that this change introduced the error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18063) [C++][Python]
Ben Kietzman created ARROW-18063: Summary: [C++][Python] Key: ARROW-18063 URL: https://issues.apache.org/jira/browse/ARROW-18063 Project: Apache Arrow Issue Type: Improvement Reporter: Ben Kietzman [Mailing list thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so] The goal is to: - generate a substrait plan in Python using Ibis - ... wherein tables are specified using custom URLs - use the python API {{run_query}} to execute the plan - ... against source data which is *streamed* from those URLs rather than pulled fully into local memory The obstacles include: - The API for constructing a data stream from the custom URLs is only available in c++ - The python {{run_query}} function requires tables as input and cannot accept a RecordBatchReader even if one could be constructed from a custom URL - Writing custom cython is not preferred Some potential solutions: - Use ExecuteSerializedPlan() directly usable from c++ so that construction of data sources need not be handled in python. Passing a buffer from python/ibis down to C++ is much simpler and can be navigated without writing cython - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} into a registry so that data source factories can be added from c++ then referenced by name from python - Extend {{run_query}} to support non-Table sources and require the user to write a python mapping from URLs to {{pa.RecordBatchReader}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18064) Error of wrong number of rows read from file
Blake erickson created ARROW-18064: -- Summary: Error of wrong number of rows read from file Key: ARROW-18064 URL: https://issues.apache.org/jira/browse/ARROW-18064 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0, 8.0.1, 8.0.0, 7.0.1, 7.0.0 Environment: Python Info 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)] Pyarrow Info 6.0.1 Platform Info Windows-10-10.0.19042-SP0 Windows 10 10.0.19042 19042 AMD64 Reporter: Blake erickson Attachments: badplug.parquet, readBadParquet.py on version greater than 6.0.1 fail to read tables saying expected length n, got n=1 rows Tables can be read column by column fine, or with a fixed number of rows matching the meta data fine. Reads correctly in version 6.0.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18065) Change the way how arrow reads IPC buffered files
Percy Camilo Triveño Aucahuasi created ARROW-18065: -- Summary: Change the way how arrow reads IPC buffered files Key: ARROW-18065 URL: https://issues.apache.org/jira/browse/ARROW-18065 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Percy Camilo Triveño Aucahuasi Assignee: Percy Camilo Triveño Aucahuasi The [PR#14226|https://github.com/apache/arrow/pull/14226], that solves ARROW-17599, is changing a bit the logic for reading parquet buffered files through async FileReader::RecordBatchGenerator. Initially those changes were thought for the parquet file reader, but we need to reflect those same changes for the ipc file reader as well. The goal for this ticket is to change the ipc reader for buffered files in the same way is being defined by the [PR#14226|https://github.com/apache/arrow/pull/14226]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18066) [Python][Doc] Documentation for Test Groups out of sync
Bryce Mecum created ARROW-18066: --- Summary: [Python][Doc] Documentation for Test Groups out of sync Key: ARROW-18066 URL: https://issues.apache.org/jira/browse/ARROW-18066 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 10.0.0 Reporter: Bryce Mecum In the Python documentation for [Test Groups|https://arrow.apache.org/docs/dev/developers/python.html#test-groups] it's stated: {quote}We have many tests that are grouped together using pytest marks. Some of these are disabled by default. To enable a test group, pass --$GROUP_NAME, e.g. --parquet. [continues] {quote} However, I get: {{$ pytest --parquet # from arrow/python dir}} {{ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]}} {{pytest: error: unrecognized arguments: --parquet}} Am I interpreting the docs wrong or has something changed since writing the docs? Related: `pytest --markers` only reports the full list of pytest Marks when I'm in the `arrow/python/pyarrow` directory but not `arrow/python`. When I'm in the `arrow/python/pyarrow` subdirectory, marks work as expected (ie `pytest -m parquet works`). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18067) compile error in ARM64
chenqiang created ARROW-18067: - Summary: compile error in ARM64 Key: ARROW-18067 URL: https://issues.apache.org/jira/browse/ARROW-18067 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 8.0.0 Environment: Kunpeng 920 CentOS 7 Reporter: chenqiang Attachments: image-2022-10-15-11-08-33-142.png Compile error in ARM64,failed to verify the thirdparty boost's SHA256. The boost's version is boost_1_75_0.tar.gz, download address is : [https://sourceforge.net/projects/boost/files/boost/1.75.0/] !image-2022-10-15-11-08-33-142.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)