[jira] [Created] (ARROW-14357) [C++] Improve array size estimation to account for shared buffers
Weston Pace created ARROW-14357: --- Summary: [C++] Improve array size estimation to account for shared buffers Key: ARROW-14357 URL: https://issues.apache.org/jira/browse/ARROW-14357 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace Overlapping buffers could be detected using some kind of sorted list of ranges and then detecting and subtracting overlaps. This could provide a more accurate size estimation when tables or record batches share the same buffers. This should be controlled by an option as sometimes it may be important to know how much space in memory a table is occupying and somehow it is more important to instead know how much data a table represents (e.g. the amount of CPU work necessary to process a table is going to depend on the latter). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14356) [C++] Improve array size estimation to account for offsets
Weston Pace created ARROW-14356: --- Summary: [C++] Improve array size estimation to account for offsets Key: ARROW-14356 URL: https://issues.apache.org/jira/browse/ARROW-14356 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace It is difficult to calculate the size (in bytes) of an array that has offsets because offsets are "# of values" there is no type-erased way to known how many bytes each value occupies. This could be handled somewhat manually with a visitor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14355) [C++] Create naive implementation of algorithm to estimate table/batch buffer size
Weston Pace created ARROW-14355: --- Summary: [C++] Create naive implementation of algorithm to estimate table/batch buffer size Key: ARROW-14355 URL: https://issues.apache.org/jira/browse/ARROW-14355 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace This will simply sum up all of the buffers. It will overestimate in a few cases: * If there are offsets it will overestimate * If there are shared buffers it will overestimate It only measures the size of the buffers and will not consider the control data (e.g. the C objects wrapping the data) or, specifically for ExecBatch, it will not count scalars. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14354) [C++] Investigate reducing I/O thread pool size to avoid CPU wastage.
Weston Pace created ARROW-14354: --- Summary: [C++] Investigate reducing I/O thread pool size to avoid CPU wastage. Key: ARROW-14354 URL: https://issues.apache.org/jira/browse/ARROW-14354 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace If we are reading over HTTP (e.g. S3) we generally want high parallelism in the I/O thread pool. If we are reading from disk then high parallelism is usually harmless but ineffective. Most of the I/O threads will spend their time in a waiting state and the cores can be used for other work. However, it appears that when we are reading locally, and the data is cached in memory, then having too much parallelism will be harmful, but some parallelism is beneficial. Once the DRAM <-> CPU bandwidth limit is hit then all reading threads will experience high DRAM latency. Unlike an I/O bottleneck a RAM bottleneck will waste cycles on the physical core. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14353) [CI][C++] Update linting from Clang 8
Benson Muite created ARROW-14353: Summary: [CI][C++] Update linting from Clang 8 Key: ARROW-14353 URL: https://issues.apache.org/jira/browse/ARROW-14353 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Benson Muite Assignee: Benson Muite Update linting from Clang 8 as this was released in 2019, current version is Clang 13 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14352) [IR] Remove schema property from Source
Phillip Cloud created ARROW-14352: - Summary: [IR] Remove schema property from Source Key: ARROW-14352 URL: https://issues.apache.org/jira/browse/ARROW-14352 Project: Apache Arrow Issue Type: Task Components: Compute IR Affects Versions: 6.0.0 Reporter: Phillip Cloud Assignee: Phillip Cloud The {{schema}} field of {{Source}} isn't being used by any producer (ibis, duckdb) or consumer (arrow C++, duckdb). It's not clear that it's useful, so let's consider removing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14351) [IR] Add projection list to Source node
Phillip Cloud created ARROW-14351: - Summary: [IR] Add projection list to Source node Key: ARROW-14351 URL: https://issues.apache.org/jira/browse/ARROW-14351 Project: Apache Arrow Issue Type: Improvement Components: Compute IR Affects Versions: 6.0.0 Reporter: Phillip Cloud Assignee: Phillip Cloud Fix For: 7.0.0 {{Source}} should store a list of columns to read, so that consumers can prune columns and push projections all the way down to the source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14350) [IR] Add filter expression to Source node
Phillip Cloud created ARROW-14350: - Summary: [IR] Add filter expression to Source node Key: ARROW-14350 URL: https://issues.apache.org/jira/browse/ARROW-14350 Project: Apache Arrow Issue Type: Improvement Components: Compute IR Affects Versions: 6.0.0 Reporter: Phillip Cloud Assignee: Phillip Cloud Fix For: 7.0.0 Add an optional filter expression to {{Source}} nodes to allow consumers that push predicates down to push them all the way to the source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14349) [IR] Remove RelBase
Phillip Cloud created ARROW-14349: - Summary: [IR] Remove RelBase Key: ARROW-14349 URL: https://issues.apache.org/jira/browse/ARROW-14349 Project: Apache Arrow Issue Type: Bug Components: Compute IR Affects Versions: 6.0.0 Reporter: Phillip Cloud Assignee: Phillip Cloud Fix For: 7.0.0 Based on conversations with the folks at DuckDB working on this PR (https://github.com/duckdb/duckdb/pull/2331) and our own consumer implementation {{RelBase}} isn't very useful. We should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14348) [R] add group_vars.RecordBatchReader method
Jonathan Keane created ARROW-14348: -- Summary: [R] add group_vars.RecordBatchReader method Key: ARROW-14348 URL: https://issues.apache.org/jira/browse/ARROW-14348 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Jonathan Keane Assignee: Jonathan Keane https://github.com/apache/arrow/pull/11032/commits/fbe6e884fa3457e9d20e93137688b85346fa86df Added a hack to get around lack of this method. Instead we should add a method that returns {{NULL}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14347) [C++] Implement "random access" reads for GCS FileSystem
Carlos O'Ryan created ARROW-14347: - Summary: [C++] Implement "random access" reads for GCS FileSystem Key: ARROW-14347 URL: https://issues.apache.org/jira/browse/ARROW-14347 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Carlos O'Ryan Implement the {{GcsFileSystem::OpenInputFile()}} overloads and tests for them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14346) [C++] Implement streaming writes for GCS FileSystem
Carlos O'Ryan created ARROW-14346: - Summary: [C++] Implement streaming writes for GCS FileSystem Key: ARROW-14346 URL: https://issues.apache.org/jira/browse/ARROW-14346 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Carlos O'Ryan Implement the {{GcsFileSystem::OpenOutputStream}} function and tests for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14345) [C++] Implement streaming reads for GCS FileSystem
Carlos O'Ryan created ARROW-14345: - Summary: [C++] Implement streaming reads for GCS FileSystem Key: ARROW-14345 URL: https://issues.apache.org/jira/browse/ARROW-14345 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Carlos O'Ryan Implement the {{GcsFileSystem::OpenInputStream()}} functions and tests for them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14344) Crash when reading empty .feather file
Reinier van Linschoten created ARROW-14344: -- Summary: Crash when reading empty .feather file Key: ARROW-14344 URL: https://issues.apache.org/jira/browse/ARROW-14344 Project: Apache Arrow Issue Type: Bug Components: Python, R Affects Versions: 5.0.0 Environment: Ubuntu Server 20.04.3, arrow (R) 5.0.02, pyarrow 3.0.0 (Python), RStudio 1.4.1717, R 4.1.0 Reporter: Reinier van Linschoten I get an R Session Error in RStudio Server when I try to read an empty .feather file. Error: The previous R session was abnormally terminated due to an unexpected crash. You may have lost workspace data as a result of this crash. Reproduce: * Create empty pandas dataframe in Python * Write to .feather file with .reset_index(drop=True) and compression="uncompressed" * Try to read data in RStudio with arrow::read_feather(path) * Error I can read dataframes with one or more rows in RStudio. I can read the empty dataframe with pandas.read_feather(). This returns an empty pandas dataframe. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14343) [Packaging][Python] Enable NEON SIMD optimization for M1 wheels
Krisztian Szucs created ARROW-14343: --- Summary: [Packaging][Python] Enable NEON SIMD optimization for M1 wheels Key: ARROW-14343 URL: https://issues.apache.org/jira/browse/ARROW-14343 Project: Apache Arrow Issue Type: New Feature Components: Packaging, Python Reporter: Krisztian Szucs Fix For: 6.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14342) Add support for the SSO credential provider
Björn Boschman created ARROW-14342: -- Summary: Add support for the SSO credential provider Key: ARROW-14342 URL: https://issues.apache.org/jira/browse/ARROW-14342 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 5.0.0, 3.0.0 Reporter: Björn Boschman see also: [https://github.com/boto/botocore/pull/2070] {code:java} from pyarrow.fs import S3FileSystem bucket = 'some-bucket-with-read-access' key = 'some-existing-key' s3 = S3FileSystem() s3.open_input_file(f'{bucket}/{key}'){code} results in {code:java} Traceback (most recent call last): File "test.py", line 7, in s3.open_input_file(f'{bucket}/{key}') File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status OSError: When reading information for key 'some-existing-key' in bucket 'some-bucket-with-read-access': AWS Error [code 15]: No response body. {code} without sso creds supported - shouldn't it raise some kind of AccessDenied Exception? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14341) [C++] Refine decimal benchmark
Yibo Cai created ARROW-14341: Summary: [C++] Refine decimal benchmark Key: ARROW-14341 URL: https://issues.apache.org/jira/browse/ARROW-14341 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai Decimal benchmark mixes {{+-*/}} operations in one test loop[1]. Divide always dominates the result. It's ~6x slower than multiplication, let alone addition. It's better to test division, multiplication, addition/subtraction separately to get more reasonable results. [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal_benchmark.cc#L141-L145 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14340) [C++] Fix xsimd build error on apple m1
Yibo Cai created ARROW-14340: Summary: [C++] Fix xsimd build error on apple m1 Key: ARROW-14340 URL: https://issues.apache.org/jira/browse/ARROW-14340 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai Related fixes are merged in xsimd. Bump xsimd to latest version should fix the error. https://github.com/xtensor-stack/xsimd/issues/597 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14339) [Docs] Add canonical url to the pkgdown (R) docs
Nicola Crane created ARROW-14339: Summary: [Docs] Add canonical url to the pkgdown (R) docs Key: ARROW-14339 URL: https://issues.apache.org/jira/browse/ARROW-14339 Project: Apache Arrow Issue Type: Sub-task Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14338) [Docs] Add version dropdown to the pkgdown (R) docs
Nicola Crane created ARROW-14338: Summary: [Docs] Add version dropdown to the pkgdown (R) docs Key: ARROW-14338 URL: https://issues.apache.org/jira/browse/ARROW-14338 Project: Apache Arrow Issue Type: Sub-task Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14337) Arrow doesn't build on M1 when SIMD acceleration is enabled
Alessandro Molina created ARROW-14337: - Summary: Arrow doesn't build on M1 when SIMD acceleration is enabled Key: ARROW-14337 URL: https://issues.apache.org/jira/browse/ARROW-14337 Project: Apache Arrow Issue Type: Improvement Affects Versions: 6.0.0 Reporter: Alessandro Molina Assignee: Krisztian Szucs Fix For: 7.0.0 There is a build error in C++ that seems related to XSIMD. An issue was opened on XSIMD ( [https://github.com/xtensor-stack/xsimd/issues/597] ) which now looks resolved. It's necessary to test if Arrow now builds with the new XSIMD release. -- This message was sent by Atlassian Jira (v8.3.4#803005)