[jira] [Created] (ARROW-14357) [C++] Improve array size estimation to account for shared buffers

2021-10-15 Thread Weston Pace (Jira)
Weston Pace created ARROW-14357:
---

 Summary: [C++] Improve array size estimation to account for shared 
buffers
 Key: ARROW-14357
 URL: https://issues.apache.org/jira/browse/ARROW-14357
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


Overlapping buffers could be detected using some kind of sorted list of ranges 
and then detecting and subtracting overlaps.  This could provide a more 
accurate size estimation when tables or record batches share the same buffers.

This should be controlled by an option as sometimes it may be important to know 
how much space in memory a table is occupying and somehow it is more important 
to instead know how much data a table represents (e.g. the amount of CPU work 
necessary to process a table is going to depend on the latter).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14356) [C++] Improve array size estimation to account for offsets

2021-10-15 Thread Weston Pace (Jira)
Weston Pace created ARROW-14356:
---

 Summary: [C++] Improve array size estimation to account for offsets
 Key: ARROW-14356
 URL: https://issues.apache.org/jira/browse/ARROW-14356
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


It is difficult to calculate the size (in bytes) of an array that has offsets 
because offsets are "# of values" there is no type-erased way to known how many 
bytes each value occupies.

This could be handled somewhat manually with a visitor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14355) [C++] Create naive implementation of algorithm to estimate table/batch buffer size

2021-10-15 Thread Weston Pace (Jira)
Weston Pace created ARROW-14355:
---

 Summary: [C++] Create naive implementation of algorithm to 
estimate table/batch buffer size
 Key: ARROW-14355
 URL: https://issues.apache.org/jira/browse/ARROW-14355
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


This will simply sum up all of the buffers.  It will overestimate in a few 
cases:

 * If there are offsets it will overestimate
 * If there are shared buffers it will overestimate

It only measures the size of the buffers and will not consider the control data 
(e.g. the C objects wrapping the data) or, specifically for ExecBatch, it will 
not count scalars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14354) [C++] Investigate reducing I/O thread pool size to avoid CPU wastage.

2021-10-15 Thread Weston Pace (Jira)
Weston Pace created ARROW-14354:
---

 Summary: [C++] Investigate reducing I/O thread pool size to avoid 
CPU wastage.
 Key: ARROW-14354
 URL: https://issues.apache.org/jira/browse/ARROW-14354
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


If we are reading over HTTP (e.g. S3) we generally want high parallelism in the 
I/O thread pool.

If we are reading from disk then high parallelism is usually harmless but 
ineffective.  Most of the I/O threads will spend their time in a waiting state 
and the cores can be used for other work.

However, it appears that when we are reading locally, and the data is cached in 
memory, then having too much parallelism will be harmful, but some parallelism 
is beneficial.  Once the DRAM <-> CPU bandwidth limit is hit then all reading 
threads will experience high DRAM latency.  Unlike an I/O bottleneck a RAM 
bottleneck will waste cycles on the physical core.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14353) [CI][C++] Update linting from Clang 8

2021-10-15 Thread Benson Muite (Jira)
Benson Muite created ARROW-14353:


 Summary: [CI][C++] Update linting from Clang 8
 Key: ARROW-14353
 URL: https://issues.apache.org/jira/browse/ARROW-14353
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Benson Muite
Assignee: Benson Muite


Update linting from Clang 8 as this was released in 2019, current version is 
Clang 13



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14352) [IR] Remove schema property from Source

2021-10-15 Thread Phillip Cloud (Jira)
Phillip Cloud created ARROW-14352:
-

 Summary: [IR] Remove schema property from Source
 Key: ARROW-14352
 URL: https://issues.apache.org/jira/browse/ARROW-14352
 Project: Apache Arrow
  Issue Type: Task
  Components: Compute IR
Affects Versions: 6.0.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud


The {{schema}} field of {{Source}} isn't being used by any producer (ibis, 
duckdb) or consumer (arrow C++, duckdb). It's not clear that it's useful, so 
let's consider removing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14351) [IR] Add projection list to Source node

2021-10-15 Thread Phillip Cloud (Jira)
Phillip Cloud created ARROW-14351:
-

 Summary: [IR] Add projection list to Source node
 Key: ARROW-14351
 URL: https://issues.apache.org/jira/browse/ARROW-14351
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Compute IR
Affects Versions: 6.0.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 7.0.0


{{Source}} should store a list of columns to read, so that consumers can prune 
columns and push projections all the way down to the source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14350) [IR] Add filter expression to Source node

2021-10-15 Thread Phillip Cloud (Jira)
Phillip Cloud created ARROW-14350:
-

 Summary: [IR] Add filter expression to Source node
 Key: ARROW-14350
 URL: https://issues.apache.org/jira/browse/ARROW-14350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Compute IR
Affects Versions: 6.0.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 7.0.0


Add an optional filter expression to {{Source}} nodes to allow consumers that 
push predicates down to push them all the way to the source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14349) [IR] Remove RelBase

2021-10-15 Thread Phillip Cloud (Jira)
Phillip Cloud created ARROW-14349:
-

 Summary: [IR] Remove RelBase
 Key: ARROW-14349
 URL: https://issues.apache.org/jira/browse/ARROW-14349
 Project: Apache Arrow
  Issue Type: Bug
  Components: Compute IR
Affects Versions: 6.0.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 7.0.0


Based on conversations with the folks at DuckDB working on this PR 
(https://github.com/duckdb/duckdb/pull/2331) and our own consumer 
implementation {{RelBase}} isn't very useful. We should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14348) [R] add group_vars.RecordBatchReader method

2021-10-15 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-14348:
--

 Summary: [R] add group_vars.RecordBatchReader method
 Key: ARROW-14348
 URL: https://issues.apache.org/jira/browse/ARROW-14348
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Jonathan Keane
Assignee: Jonathan Keane


https://github.com/apache/arrow/pull/11032/commits/fbe6e884fa3457e9d20e93137688b85346fa86df
 

Added a hack to get around lack of this method. Instead we should add a method 
that returns {{NULL}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14347) [C++] Implement "random access" reads for GCS FileSystem

2021-10-15 Thread Carlos O'Ryan (Jira)
Carlos O'Ryan created ARROW-14347:
-

 Summary: [C++] Implement "random access" reads for GCS FileSystem
 Key: ARROW-14347
 URL: https://issues.apache.org/jira/browse/ARROW-14347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Carlos O'Ryan


Implement the {{GcsFileSystem::OpenInputFile()}} overloads and tests for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14346) [C++] Implement streaming writes for GCS FileSystem

2021-10-15 Thread Carlos O'Ryan (Jira)
Carlos O'Ryan created ARROW-14346:
-

 Summary: [C++] Implement streaming writes for GCS FileSystem
 Key: ARROW-14346
 URL: https://issues.apache.org/jira/browse/ARROW-14346
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Carlos O'Ryan


Implement the {{GcsFileSystem::OpenOutputStream}} function and tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14345) [C++] Implement streaming reads for GCS FileSystem

2021-10-15 Thread Carlos O'Ryan (Jira)
Carlos O'Ryan created ARROW-14345:
-

 Summary: [C++] Implement streaming reads for GCS FileSystem
 Key: ARROW-14345
 URL: https://issues.apache.org/jira/browse/ARROW-14345
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Carlos O'Ryan


Implement the {{GcsFileSystem::OpenInputStream()}} functions and tests for them.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14344) Crash when reading empty .feather file

2021-10-15 Thread Reinier van Linschoten (Jira)
Reinier van Linschoten created ARROW-14344:
--

 Summary: Crash when reading empty .feather file
 Key: ARROW-14344
 URL: https://issues.apache.org/jira/browse/ARROW-14344
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python, R
Affects Versions: 5.0.0
 Environment: Ubuntu Server 20.04.3, arrow (R) 5.0.02, pyarrow 3.0.0 
(Python), RStudio 1.4.1717, R 4.1.0
Reporter: Reinier van Linschoten


I get an R Session Error in RStudio Server when I try to read an empty .feather 
file.

Error: The previous R session was abnormally terminated due to an unexpected 
crash. You may have lost workspace data as a result of this crash. 

Reproduce:
 * Create empty pandas dataframe in Python
 * Write to .feather file with .reset_index(drop=True) and 
compression="uncompressed"
 * Try to read data in RStudio with arrow::read_feather(path)
 * Error

I can read dataframes with one or more rows in RStudio.

I can read the empty dataframe with pandas.read_feather(). This returns an 
empty pandas dataframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14343) [Packaging][Python] Enable NEON SIMD optimization for M1 wheels

2021-10-15 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-14343:
---

 Summary: [Packaging][Python] Enable NEON SIMD optimization for M1 
wheels
 Key: ARROW-14343
 URL: https://issues.apache.org/jira/browse/ARROW-14343
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Packaging, Python
Reporter: Krisztian Szucs
 Fix For: 6.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14342) Add support for the SSO credential provider

2021-10-15 Thread Jira
Björn Boschman created ARROW-14342:
--

 Summary: Add support for the SSO credential provider
 Key: ARROW-14342
 URL: https://issues.apache.org/jira/browse/ARROW-14342
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 5.0.0, 3.0.0
Reporter: Björn Boschman


 
see also: [https://github.com/boto/botocore/pull/2070]
{code:java}
from pyarrow.fs import S3FileSystem 

bucket = 'some-bucket-with-read-access' 
key = 'some-existing-key' 

s3 = S3FileSystem() 
s3.open_input_file(f'{bucket}/{key}'){code}
 
results in

 
{code:java}
Traceback (most recent call last):
  File "test.py", line 7, in 
s3.open_input_file(f'{bucket}/{key}')
  File "pyarrow/_fs.pyx", line 587, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When reading information for key 'some-existing-key' in bucket 
'some-bucket-with-read-access': AWS Error [code 15]: No response body.
{code}
 

without sso creds supported - shouldn't it raise some kind of AccessDenied 
Exception?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14341) [C++] Refine decimal benchmark

2021-10-15 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-14341:


 Summary: [C++] Refine decimal benchmark
 Key: ARROW-14341
 URL: https://issues.apache.org/jira/browse/ARROW-14341
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Decimal benchmark mixes {{+-*/}} operations in one test loop[1]. Divide always 
dominates the result. It's ~6x slower than multiplication, let alone addition.
It's better to test division, multiplication, addition/subtraction separately 
to get more reasonable results.

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal_benchmark.cc#L141-L145



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14340) [C++] Fix xsimd build error on apple m1

2021-10-15 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-14340:


 Summary: [C++] Fix xsimd build error on apple m1
 Key: ARROW-14340
 URL: https://issues.apache.org/jira/browse/ARROW-14340
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Related fixes are merged in xsimd. Bump xsimd to latest version should fix the 
error.
https://github.com/xtensor-stack/xsimd/issues/597



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14339) [Docs] Add canonical url to the pkgdown (R) docs

2021-10-15 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14339:


 Summary: [Docs] Add canonical url to the pkgdown (R) docs
 Key: ARROW-14339
 URL: https://issues.apache.org/jira/browse/ARROW-14339
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14338) [Docs] Add version dropdown to the pkgdown (R) docs

2021-10-15 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14338:


 Summary: [Docs] Add version dropdown to the pkgdown (R) docs
 Key: ARROW-14338
 URL: https://issues.apache.org/jira/browse/ARROW-14338
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14337) Arrow doesn't build on M1 when SIMD acceleration is enabled

2021-10-15 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14337:
-

 Summary: Arrow doesn't build on M1 when SIMD acceleration is 
enabled
 Key: ARROW-14337
 URL: https://issues.apache.org/jira/browse/ARROW-14337
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 6.0.0
Reporter: Alessandro Molina
Assignee: Krisztian Szucs
 Fix For: 7.0.0


There is a build error in C++ that seems related to XSIMD.

An issue was opened on XSIMD ( 
[https://github.com/xtensor-stack/xsimd/issues/597] ) which now looks resolved. 
It's necessary to test if Arrow now builds with the new XSIMD release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)