[jira] [Assigned] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment

2022-09-02 Thread Miles Granger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger reassigned ARROW-8201:


Assignee: Miles Granger

> [Python][Dataset] Improve ergonomics of FileFragment
> 
>
> Key: ARROW-8201
> URL: https://issues.apache.org/jira/browse/ARROW-8201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Miles Granger
>Priority: Major
>  Labels: dataset
>
> FileFragment can be made more directly useful by adding convenience methods.
> For example, a FileFragment could allow underlying file/buffer to be opened 
> directly:
> {code}
> def open(self):
> """
> Open a NativeFile of the buffer or file viewed by this fragment.
> """
> cdef:
> CFileSystem* c_filesystem
> shared_ptr[CRandomAccessFile] opened
> NativeFile out = NativeFile()
> buf = self.buffer
> if buf is not None:
> return pa.io.BufferReader(buf)
> with nogil:
> c_filesystem = self.file_fragment.source().filesystem()
> opened = GetResultValue(c_filesystem.OpenInputFile(
> self.file_fragment.source().path()))
> out.set_random_access_file(opened)
> out.is_readable = True
> return out
> {code}
> Additionally, a ParquetFileFragment's metadata could be introspectable:
> {code}
> @property
> def metadata(self):
> from pyarrow._parquet import ParquetReader
> reader = ParquetReader()
> reader.open(self.open())
> return reader.metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-09-02 Thread Vincent Nijs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599759#comment-17599759
 ] 

Vincent Nijs edited comment on ARROW-17374 at 9/3/22 2:03 AM:
--

Made some progress. I had to install `libsnappy-dev` and `libre2-dev` in Ubuntu 
22.04 and use 
[https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]

Arrow still doesn't compile successfully, however, and ends with the below. The 
build logs are attached ("build-images.out").

 

{{#21 1786.1 ** testing if installed package can be loaded from temporary 
location}}
{{#21 1786.4 Error: package or namespace load failed for ‘arrow’ in 
dyn.load(file, DLLpath = DLLpath, ...):}}
{{#21 1786.4  unable to load shared object 
'/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':}}
{{#21 1786.4   
/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined 
symbol: _ZTIN6apache6thrift8protocol9TProtocolE}}

 

 

{{{}# workaround for fPIC issue 
[https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]{}}}{{{}cp
 -a /usr/local/lib/R/etc/Makeconf /usr/local/lib/R/etc/Makeconf.bak;{}}}
{{sed -i 's/fpic/fPIC/g' /usr/local/lib/R/etc/Makeconf;}}
{{R -e "options(HTTPUserAgent = sprintf('R/%s R (%s)', getRversion(), 
paste(getRversion(), R.version['platform'], R.version['arch'], 
R.version['os']))); Sys.setenv('ARROW_R_DEV' = TRUE); 
install.packages(c('duckdb', 'arrow'), repo='${CRAN}', Ncpus=${NCPUS})"}}
{{mv /usr/local/lib/R/etc/Makeconf.bak /usr/local/lib/R/etc/Makeconf;}}


was (Author: vnijs):
Made some progress. I had to install `libsnappy-dev` and `libre2-dev` in Ubuntu 
22.04 and use 
[https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]

Arrow still doesn't compile successfully, however, and ends with the below. The 
build logs are attached ("build-images.out").

 

```

#21 1786.1 ** testing if installed package can be loaded from temporary location
#21 1786.4 Error: package or namespace load failed for ‘arrow’ in 
dyn.load(file, DLLpath = DLLpath, ...):
#21 1786.4  unable to load shared object 
'/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
#21 1786.4   
/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined 
symbol: _ZTIN6apache6thrift8protocol9TProtocolE

```

 

``` 

# workaround for fPIC issue 
[https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]

cp -a /usr/local/lib/R/etc/Makeconf /usr/local/lib/R/etc/Makeconf.bak;
sed -i 's/fpic/fPIC/g' /usr/local/lib/R/etc/Makeconf;
R -e "options(HTTPUserAgent = sprintf('R/%s R (%s)', getRversion(), 
paste(getRversion(), R.version['platform'], R.version['arch'], 
R.version['os']))); Sys.setenv('ARROW_R_DEV' = TRUE); 
install.packages(c('duckdb', 'arrow'), repo='${CRAN}', Ncpus=${NCPUS})"
mv /usr/local/lib/R/etc/Makeconf.bak /usr/local/lib/R/etc/Makeconf;

```

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: build-images.out
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND 

[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-09-02 Thread Vincent Nijs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599759#comment-17599759
 ] 

Vincent Nijs commented on ARROW-17374:
--

Made some progress. I had to install `libsnappy-dev` and `libre2-dev` in Ubuntu 
22.04 and use 
[https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]

Arrow still doesn't compile successfully, however, and ends with the below. The 
build logs are attached ("build-images.out").

 

```

#21 1786.1 ** testing if installed package can be loaded from temporary location
#21 1786.4 Error: package or namespace load failed for ‘arrow’ in 
dyn.load(file, DLLpath = DLLpath, ...):
#21 1786.4  unable to load shared object 
'/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
#21 1786.4   
/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined 
symbol: _ZTIN6apache6thrift8protocol9TProtocolE

```

 

``` 

# workaround for fPIC issue 
[https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]

cp -a /usr/local/lib/R/etc/Makeconf /usr/local/lib/R/etc/Makeconf.bak;
sed -i 's/fpic/fPIC/g' /usr/local/lib/R/etc/Makeconf;
R -e "options(HTTPUserAgent = sprintf('R/%s R (%s)', getRversion(), 
paste(getRversion(), R.version['platform'], R.version['arch'], 
R.version['os']))); Sys.setenv('ARROW_R_DEV' = TRUE); 
install.packages(c('duckdb', 'arrow'), repo='${CRAN}', Ncpus=${NCPUS})"
mv /usr/local/lib/R/etc/Makeconf.bak /usr/local/lib/R/etc/Makeconf;

```

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: build-images.out
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-09-02 Thread Vincent Nijs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Nijs updated ARROW-17374:
-
Attachment: build-images.out

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: build-images.out
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-09-02 Thread Vincent Nijs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Nijs updated ARROW-17374:
-
Attachment: (was: image-2022-08-26-21-42-13-338.png)

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: build-images.out
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-09-02 Thread Vincent Nijs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585662#comment-17585662
 ] 

Vincent Nijs edited comment on ARROW-17374 at 9/3/22 1:51 AM:
--

I ran into the same issue installing arrow from source for R on aarch64. Python 
is installed using conda but R was installed from source. The docker container 
for aarch64 is 
[vnijs/rsm-jupyter]([https://hub.docker.com/repository/docker/vnijs/rsm-jupyter).]
 Note that the image is rather large.


was (Author: vnijs):
I ran into the same issue installing arrow from source for R on aarch64. Python 
is installed using conda but R was installed from source. The docker container 
for aarch64 is 
[vnijs/rsm-jupyter]([https://hub.docker.com/repository/docker/vnijs/rsm-jupyter).]
 Note that the image is rather large.

 

FYI I also tried installing using binaries but that gives the error below even 
though the file clearly exists.

!image-2022-08-26-21-42-13-338.png!

 

 

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: image-2022-08-26-21-42-13-338.png
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

2022-09-02 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-17599:

Labels: good-second-issue  (was: )

> [C++] ReadRangeCache should not retain data after read
> --
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-second-issue
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17604) [Java][Docs] Improve docs around JVM flags

2022-09-02 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-17604:


Assignee: David Li

> [Java][Docs] Improve docs around JVM flags
> --
>
> Key: ARROW-17604
> URL: https://issues.apache.org/jira/browse/ARROW-17604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> * Clarify where the {{--add-opens}} flag should be added (as an argument to 
> {{java}})
> * Demonstrate how to configure Surefire with it
> * Demonstrate how to configure IntelliJ with it
> * Catch the exception that results if you forget and rethrow it with a 
> message telling you what to do/what docs to visit
> * Investigate MemoryHandles as a possible replacement for this
> * Poke MemoryUtil.UNSAFE ASAP at startup (e.g. in RootAllocator) to fail 
> sooner



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17604) [Java][Docs] Improve docs around JVM flags

2022-09-02 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17604:
-
Description: 
* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it
* Catch the exception that results if you forget and rethrow it with a message 
telling you what to do/what docs to visit
* Investigate MemoryHandles as a possible replacement for this

  was:
* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it
* Catch the exception that results if you forget and rethrow it with a message 
telling you what to do/what docs to visit


> [Java][Docs] Improve docs around JVM flags
> --
>
> Key: ARROW-17604
> URL: https://issues.apache.org/jira/browse/ARROW-17604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Priority: Major
>
> * Clarify where the {{--add-opens}} flag should be added (as an argument to 
> {{java}})
> * Demonstrate how to configure Surefire with it
> * Demonstrate how to configure IntelliJ with it
> * Catch the exception that results if you forget and rethrow it with a 
> message telling you what to do/what docs to visit
> * Investigate MemoryHandles as a possible replacement for this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17604) [Java][Docs] Improve docs around JVM flags

2022-09-02 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17604:
-
Description: 
* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it
* Catch the exception that results if you forget and rethrow it with a message 
telling you what to do/what docs to visit
* Investigate MemoryHandles as a possible replacement for this
* Poke MemoryUtil.UNSAFE ASAP at startup (e.g. in RootAllocator) to fail sooner

  was:
* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it
* Catch the exception that results if you forget and rethrow it with a message 
telling you what to do/what docs to visit
* Investigate MemoryHandles as a possible replacement for this


> [Java][Docs] Improve docs around JVM flags
> --
>
> Key: ARROW-17604
> URL: https://issues.apache.org/jira/browse/ARROW-17604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Priority: Major
>
> * Clarify where the {{--add-opens}} flag should be added (as an argument to 
> {{java}})
> * Demonstrate how to configure Surefire with it
> * Demonstrate how to configure IntelliJ with it
> * Catch the exception that results if you forget and rethrow it with a 
> message telling you what to do/what docs to visit
> * Investigate MemoryHandles as a possible replacement for this
> * Poke MemoryUtil.UNSAFE ASAP at startup (e.g. in RootAllocator) to fail 
> sooner



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17604) [Java][Docs] Improve docs around JVM flags

2022-09-02 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17604:
-
Description: 
* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it
* Catch the exception that results if you forget and rethrow it with a message 
telling you what to do/what docs to visit

  was:
* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it


> [Java][Docs] Improve docs around JVM flags
> --
>
> Key: ARROW-17604
> URL: https://issues.apache.org/jira/browse/ARROW-17604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: David Li
>Priority: Major
>
> * Clarify where the {{--add-opens}} flag should be added (as an argument to 
> {{java}})
> * Demonstrate how to configure Surefire with it
> * Demonstrate how to configure IntelliJ with it
> * Catch the exception that results if you forget and rethrow it with a 
> message telling you what to do/what docs to visit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17604) [Java][Docs] Improve docs around JVM flags

2022-09-02 Thread David Li (Jira)
David Li created ARROW-17604:


 Summary: [Java][Docs] Improve docs around JVM flags
 Key: ARROW-17604
 URL: https://issues.apache.org/jira/browse/ARROW-17604
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Java
Reporter: David Li


* Clarify where the {{--add-opens}} flag should be added (as an argument to 
{{java}})
* Demonstrate how to configure Surefire with it
* Demonstrate how to configure IntelliJ with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory

2022-09-02 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599720#comment-17599720
 ] 

Kouhei Sutou commented on ARROW-17595:
--

Thanks. I could install {{gcc-toolset-11}}. But it's installed from the 
"ol8_appstream" repository. What is your "acx-appstream" repository?

> [C++] Installation Error stdlib.h no such file or directory
> ---
>
> Key: ARROW-17595
> URL: https://issues.apache.org/jira/browse/ARROW-17595
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: OEL 8.5
>Reporter: Robert Tidwell
>Priority: Major
>
> Getting error installing arrow.  
> {noformat}
> /opt/rh/gcc-toolset-11/root/usr/include/c++/11/cstdlib:75:15: fatal error: 
> stdlib.h: No such file or directory
>    75 | #include_next 
> {noformat}
> Using the c\+\+11 toolset.  Tried with default c\+\+ on the system and was 
> getting the same type error.  OS is OEL 8.5.
> Here is the detailed output:
> {noformat}
> Installing package into ‘/usr/lib64/R/library’
> (as ‘lib’ is unspecified)
> trying URL 
> 'http://**/prod/cran/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'application/octet-stream' length 4900968 bytes (4.7 MB)
> ==
> downloaded 4.7 MB
> * installing *source* package ‘arrow’ ...
> ** package ‘arrow’ successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Found local C++ source: 'tools/cpp'
> *** Building libarrow from source
> For build options and troubleshooting, see the install vignette:
> https://cran.r-project.org/web/packages/arrow/vignettes/install.html
> *** Building with MAKEFLAGS= -j2 
>  cmake: /bin/cmake
>  arrow with SOURCE_DIR='tools/cpp' BUILD_DIR='/tmp/rmt' 
> DEST_DIR='libarrow/arrow-9.0.0' CMAKE='/bin/cmake' EXTRA_CMAKE_FLAGS='' 
> CC='gcc -m64' CXX='g++ -m64 -std=gnu++11' LDFLAGS='-Wl,-z,relro -Wl,-z,now 
> -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' ARROW_S3='OFF' ARROW_GCS='OFF' 
> ARROW_MIMALLOC='OFF' 
> ++ pwd
> + : /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow
> + : tools/cpp
> + : /tmp/rmt
> + : libarrow/arrow-9.0.0
> + : /bin/cmake
> ++ cd tools/cpp
> ++ pwd
> + SOURCE_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp
> ++ mkdir -p libarrow/arrow-9.0.0
> ++ cd libarrow/arrow-9.0.0
> ++ pwd
> + DEST_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0
> + '[' '' '!=' '' ']'
> + '[' '' = false ']'
> + ARROW_DEFAULT_PARAM=OFF
> + mkdir -p /tmp/rmt
> + pushd /tmp/rmt
> /tmp/rmt /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow
> + /bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
> -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
> -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
> -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
> -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
> -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
> -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
> -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
> -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
> -DCMAKE_INSTALL_LIBDIR=lib 
> -DCMAKE_INSTALL_PREFIX=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0
>  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
> -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
> -Dxsimd_SOURCE= -G 'Unix Makefiles' 
> /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp
> -- Building using CMake version: 3.20.2
> -- The C compiler identification is GNU 11.2.1
> -- The CXX compiler identification is GNU 11.2.1
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - failed
> -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc
> -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc - 
> works
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - failed
> -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++
> -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ - 
> works
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 9.0.0 (full: '9.0.0')
> -- Arrow SO version: 900 (full: 900.0.0)
> -- clang-tidy 12 not found
> -- clang-format 12 not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
> -- infer not found
> fatal: not a git repository (or any parent up to mount point /)
> Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
> -- Found Python3: /bin/python3.6 (found version "3.6.8") found 

[jira] [Comment Edited] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory

2022-09-02 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599511#comment-17599511
 ] 

Kouhei Sutou edited comment on ARROW-17595 at 9/2/22 9:20 PM:
--

{noformat}
dnf install gcc-toolset-11
{noformat}

Here are the packages that I have installed:

{noformat}
gcc-toolset-11.x86_6411.1-1.el8 
  @acx-appstream 
gcc-toolset-11-annobin-docs.noarch   10.23-1.el8
  @acx-appstream 
gcc-toolset-11-annobin-plugin-gcc.x86_64 10.23-1.el8
  @acx-appstream 
gcc-toolset-11-binutils.x86_64   2.36.1-2.0.1.el8   
  @acx-appstream 
gcc-toolset-11-dwz.x86_640.14-2.el8 
  @acx-appstream 
gcc-toolset-11-dyninst.x86_6411.0.0-2.el8   
  @acx-appstream 
gcc-toolset-11-elfutils.x86_64   0.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-debuginfod-client.x86_64 0.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-libelf.x86_640.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-libs.x86_64  0.185-5.el8
  @acx-appstream 
gcc-toolset-11-gcc.x86_6411.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-c++.x86_6411.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-gdb-plugin.x86_64 11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-gfortran.x86_64   11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-plugin-devel.x86_64   11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gdb.x86_6410.2-5.0.1.el8 
  @acx-appstream 
gcc-toolset-11-libquadmath-devel.x86_64  11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-libstdc++-devel.x86_6411.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-ltrace.x86_64 0.7.91-1.el8   
  @acx-appstream 
gcc-toolset-11-make.x86_64   1:4.3-2.el8
  @acx-appstream 
gcc-toolset-11-make-devel.x86_64 1:4.3-2.el8
  @acx-appstream 
gcc-toolset-11-perftools.x86_64  11.1-1.el8 
  @acx-appstream 
gcc-toolset-11-runtime.x86_6411.0-1.el8 
  @acx-appstream 
gcc-toolset-11-strace.x86_64 5.13-7.el8 
  @acx-appstream 
gcc-toolset-11-systemtap.x86_64  4.5-4.el8  
  @acx-appstream 
gcc-toolset-11-systemtap-client.x86_64   4.5-4.el8  
  @acx-appstream 
gcc-toolset-11-systemtap-devel.x86_644.5-4.el8  
  @acx-appstream 
gcc-toolset-11-systemtap-runtime.x86_64  4.5-4.el8  
  @acx-appstream 
gcc-toolset-11-toolchain.x86_64  11.1-1.el8 
  @acx-appstream 
gcc-toolset-11-valgrind.x86_64   1:3.17.0-6.el8 
  @acx-appstream
{noformat}


was (Author: JIRAUSER295252):
dnf install gcc-toolset-11

Here are the packages that I have installed:
gcc-toolset-11.x86_6411.1-1.el8 
  @acx-appstream 
gcc-toolset-11-annobin-docs.noarch   10.23-1.el8
  @acx-appstream 
gcc-toolset-11-annobin-plugin-gcc.x86_64 10.23-1.el8
  @acx-appstream 
gcc-toolset-11-binutils.x86_64   2.36.1-2.0.1.el8   
  @acx-appstream 
gcc-toolset-11-dwz.x86_640.14-2.el8 
  @acx-appstream 
gcc-toolset-11-dyninst.x86_6411.0.0-2.el8   
  @acx-appstream 
gcc-toolset-11-elfutils.x86_64   0.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-debuginfod-client.x86_64 0.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-libelf.x86_640.185-5.el8
  @acx-appstream 

[jira] [Commented] (ARROW-17319) [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes if CPU 0 is not available

2022-09-02 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599715#comment-17599715
 ] 

Kouhei Sutou commented on ARROW-17319:
--

Sorry. I missed this.

We don't need to do anything for "Please also update the crossbow configuration 
in order to keep the github ..." in the pull request. It's a task after we 
merge the pull request.

Our CI jobs use new vcpkg revision in the pull request: 
https://github.com/apache/arrow/pull/13835#issuecomment-1210326357

But it failed in patching phase:

https://github.com/ursacomputing/crossbow/runs/7762521768?check_suite_focus=true#step:6:65

{noformat}
error: patch failed: ports/snappy/portfile.cmake:4
error: ports/snappy/portfile.cmake: patch does not apply
Checking patch ports/snappy/snappy-disable-bmi.patch...
{noformat}

We need to confirm that the failed patch 
https://github.com/apache/arrow/blob/master/ci/vcpkg/ports.patch#L89-L125 is 
still needed. If it's no longer needed, we can remove the failed patch from our 
patch set.

> [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes 
> if CPU 0 is not available
> --
>
> Key: ARROW-17319
> URL: https://issues.apache.org/jira/browse/ARROW-17319
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: Ubuntu 20.02 / Python 3.8.10 (default, Jun 22 2022, 
> 20:18:18)
> $ pip list 
> Package Version
> --- ---
> numpy   1.23.1 
> pandas  1.4.3  
> pip 20.0.2 
> pkg-resources   0.0.0  
> pyarrow 9.0.0  
> python-dateutil 2.8.2  
> pytz2022.1 
> setuptools  44.0.0 
> six 1.16.0 
>Reporter: Mike Gevaert
>Assignee: Mike Gevaert
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I get the following traceback when exiting python after loading 
> {{pyarrow.parquet}}
> {code}
> Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
> [GCC 9.4.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> os.getpid()
> 25106
> >>> import pyarrow.parquet
> >>> 
> Fatal error condition occurred in 
> /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72:
>  aws_thread_launch(_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, _options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200af06) 
> [0x7f831b2b3f06]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x20028e5) 
> [0x7f831b2ab8e5]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) 
> [0x7f831b1d0e09]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) 
> [0x7f831b2b4a3d]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f25948) 
> [0x7f831b1ce948]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) 
> [0x7f831b2b4a3d]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) 
> [0x7f831b189b46]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x194546a) 
> [0x7f831abee46a]
> /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f831c6188a7]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f831c618a60]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f831c5f608a]
>  {code}
> To replicate this; one needs to make sure that CPU 0 isn't available to 
> schedule tasks on.  In HPC our environment, that happens due to slurm using 
> cgroups to constrain CPU usage.
> On a linux workstation, one should be able to:
> 1) open python as a normal user
> 2) get the pid
> 3) as root:
> {code}
> cd /sys/fs/cgroup/cpuset/
> mkdir pyarrow
> cd pyarrow
> echo 0 > cpuset.mems
> echo 1 > cpuset.cpus # sets the cgroup to only have access to cpu 1
> echo $PID > tasks
> {code}
> Then, in the python enviroment:
> {code}
> import pyarrow.parquet
> exit()
> {code}
> Which should trigger the crash.
> Sadly, I couldn't track down which {{aws-c-common}} and {{aws-c-io}} are 
> being used for the 9.0.0 py38 manylinux wheels. (libarrow.so.900 has 
> BuildID[sha1]=dd6c5a2efd5cacf09657780a58c40f7c930e4df1)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6772:
--
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [C++] Add operator== for interfaces with an Equals() method
> ---
>
> Key: ARROW-6772
> URL: https://issues.apache.org/jira/browse/ARROW-6772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The 
> addition of overloaded equality operators will allow this o be written 
> {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
> allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17601) [C++] Error when creating Expression on Decimal128 types: precision out of range

2022-09-02 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599693#comment-17599693
 ] 

Neal Richardson commented on ARROW-17601:
-

Also, in case it's relevant: we don't have any problems like this with integer 
or floating-point types:

{code}
> expr <- Expression$create("multiply_checked", Expression$field_ref("col1"), 
> Expression$field_ref("col2"))
> 
> expr$type(schema(col1=int64(), col2=int64()))
Int64
int64
> expr$type(schema(col1=int32(), col2=int32()))
Int32
int32
> expr$type(schema(col1=float32(), col2=float32()))
Float32
float
{code}

> [C++] Error when creating Expression on Decimal128 types: precision out of 
> range
> 
>
> Key: ARROW-17601
> URL: https://issues.apache.org/jira/browse/ARROW-17601
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> Reproducer in R:
> {code}
> library(arrow)
> library(dplyr)
> tab <- Table$create(col1 = 1:4, col2 = 5:8)
> tab <- tab$cast(schema(col1 = decimal128(33, 4), col2 = decimal128(15, 2)))
> tab %>% mutate(col1 * col2)
> # Error: Invalid: Decimal precision out of range [1, 38]: 49
> # /Users/me/arrow/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc:1078  
> DecimalType::Make(left_type.id(), precision, scale)
> # /Users/me/arrow/cpp/src/arrow/compute/exec/expression.cc:413  
> call.kernel->signature->out_type().Resolve(_context, types)
> {code}
> With integers and floats, we upcast to a wider size in some compute functions 
> like multiplication. Should this go up to Decimal256? Or is there a better 
> way to determine the size required?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17492) [C++] Hashing32/64 support for large var-binary types

2022-09-02 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-17492.
-
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13940
[https://github.com/apache/arrow/pull/13940]

> [C++] Hashing32/64 support for large var-binary types
> -
>
> Key: ARROW-17492
> URL: https://issues.apache.org/jira/browse/ARROW-17492
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, Hashing32/64 only supports non-large var-binary types. This issue 
> will add support for large var-binary types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17593) [C++] Try and maintain input shape in Acero

2022-09-02 Thread Aldrin Montana (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599684#comment-17599684
 ] 

Aldrin Montana commented on ARROW-17593:


Nice.

Since only the final projection would need to worry about the max_batch_size, 
not the internal plan nodes (I think), this sounds like there wouldn't be too 
much cost either. I'm mostly curious how efficiently aggregates can size their 
outputs, since they won't know exactly how many groups there will be until 
runtime. But, I also guess that would only be inefficient in the last batch, so 
it would be amortized for larger resultsets.

> [C++] Try and maintain input shape in Acero
> ---
>
> Key: ARROW-17593
> URL: https://issues.apache.org/jira/browse/ARROW-17593
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Data is scanned in large chunks based on the format.  For example, CSV scans 
> chunks based on a chunk_size while parquet scans entire row groups.
> Then, upon entry into Acero, these chunks are sliced into morsels (~L3 size) 
> for parallelism and batches (~L1-L2 size) for cache efficient processing.
> However, the way it is currently done, means that the output of Acero is a 
> stream of tiny batches.  This is somewhat undesirable in many cases.
> For example, if a pyarrow user calls pq.read_table they might expect to get 
> one batch per row group.  If they were to turn around and write out that 
> table to a new parquet file then either they end up with a non-ideal parquet 
> file (tiny row groups) or they are forced to concatenate the batches (which 
> is an allocation + copy).
> Even if the user is doing their own streaming processing (e.g. in pyarrow) 
> these small batch sizes are undesirable as the overhead of python means that 
> streaming processing should be done in larger batches.
> Instead, there should be a configurable max_batch_size, independent of row 
> group size and morsel size, which is configurable, and quite large by default 
> (1Mi or 64Mi rows).  This control exists for users that want to do their own 
> streaming processing and need to be able to tune for RAM usage.
> Acero will read in data based on the format, as it does today (e.g. CSV chunk 
> size, row group size).  If the source data is very large (bigger than 
> max_batch_size) it will be sliced.  From that point on, any morsels or 
> batches should simply be views into this larger output batch.  For example, 
> when doing a projection to add a new column, we should allocate a 
> max_batch_size array and then populate it over many runs of the project node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17598) [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON

2022-09-02 Thread Aldrin Montana (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599671#comment-17599671
 ] 

Aldrin Montana commented on ARROW-17598:


sounds good [~yibocai]. I wasn't sure if it would need any more discussion 
(which is why I made a JIRA instead of submitted a MINOR pr).

I opened a PR, but I don't think the CI will exercise it, so not sure what to 
do about that. Let me know if you'd like any other follow up!

> [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON
> --
>
> Key: ARROW-17598
> URL: https://issues.apache.org/jira/browse/ARROW-17598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Aldrin Montana
>Assignee: Aldrin Montana
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When compiling Arrow with the following parameter:
> {code:bash}
> -DARROW_BUILD_BENCHMARKS_REFERENCE="ON"
> {code}
> I get this error and other similar ones:
> {code:bash}
> cpp/src/arrow/io/memory_benchmark.cc:244:26: error: use of undeclared 
> identifier 'Read'; did you mean 'read'?
> using ApplyFn = decltype(Read);
> {code}
>  
> The reason is that `memory_benchmark.cc` does not define the Read function, 
> and others, if the SIMD level is NEON (`ARROW_HAVE_NEON` is true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17598) [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17598:
---
Labels: pull-request-available  (was: )

> [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON
> --
>
> Key: ARROW-17598
> URL: https://issues.apache.org/jira/browse/ARROW-17598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Aldrin Montana
>Assignee: Aldrin Montana
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When compiling Arrow with the following parameter:
> {code:bash}
> -DARROW_BUILD_BENCHMARKS_REFERENCE="ON"
> {code}
> I get this error and other similar ones:
> {code:bash}
> cpp/src/arrow/io/memory_benchmark.cc:244:26: error: use of undeclared 
> identifier 'Read'; did you mean 'read'?
> using ApplyFn = decltype(Read);
> {code}
>  
> The reason is that `memory_benchmark.cc` does not define the Read function, 
> and others, if the SIMD level is NEON (`ARROW_HAVE_NEON` is true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17598) [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON

2022-09-02 Thread Aldrin Montana (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldrin Montana reassigned ARROW-17598:
--

Assignee: Aldrin Montana

> [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON
> --
>
> Key: ARROW-17598
> URL: https://issues.apache.org/jira/browse/ARROW-17598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Aldrin Montana
>Assignee: Aldrin Montana
>Priority: Minor
>
> When compiling Arrow with the following parameter:
> {code:bash}
> -DARROW_BUILD_BENCHMARKS_REFERENCE="ON"
> {code}
> I get this error and other similar ones:
> {code:bash}
> cpp/src/arrow/io/memory_benchmark.cc:244:26: error: use of undeclared 
> identifier 'Read'; did you mean 'read'?
> using ApplyFn = decltype(Read);
> {code}
>  
> The reason is that `memory_benchmark.cc` does not define the Read function, 
> and others, if the SIMD level is NEON (`ARROW_HAVE_NEON` is true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16481) [Python] Remove the test usage of legacy dataset

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16481:
---
Labels: pull-request-available  (was: )

> [Python] Remove the test usage of legacy dataset
> 
>
> Key: ARROW-16481
> URL: https://issues.apache.org/jira/browse/ARROW-16481
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the parquet tests cover both new and old dataset implementation. 
> The usage of legacy dataset can be removed from the tests as the legacy 
> ParquetDataset is being deprecated.
> See https://issues.apache.org/jira/browse/ARROW-16119.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17519) [R] RTools35 job is failing

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17519:
---
Labels: pull-request-available  (was: )

> [R] RTools35 job is failing
> ---
>
> Key: ARROW-17519
> URL: https://issues.apache.org/jira/browse/ARROW-17519
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After ARROW-17436, the RTools35 job is consistently failing with:
> {noformat}
> Error: Error: package or namespace load failed for 'arrow' in inDL(x, 
> as.logical(local), as.logical(now), ...):
>  unable to load shared object 
> 'D:/a/arrow/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/i386/arrow.dll':
>   LoadLibrary failure:  A dynamic link library (DLL) initialization routine 
> failed.
> {noformat}
> Given that there is a mailing list discussion about dropping support for that 
> platform, should we disable the check? Or wait until that is resolved to 
> disable the check?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17519) [R] RTools35 job is failing

2022-09-02 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington reassigned ARROW-17519:


Assignee: Dewey Dunnington

> [R] RTools35 job is failing
> ---
>
> Key: ARROW-17519
> URL: https://issues.apache.org/jira/browse/ARROW-17519
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After ARROW-17436, the RTools35 job is consistently failing with:
> {noformat}
> Error: Error: package or namespace load failed for 'arrow' in inDL(x, 
> as.logical(local), as.logical(now), ...):
>  unable to load shared object 
> 'D:/a/arrow/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/i386/arrow.dll':
>   LoadLibrary failure:  A dynamic link library (DLL) initialization routine 
> failed.
> {noformat}
> Given that there is a mailing list discussion about dropping support for that 
> platform, should we disable the check? Or wait until that is resolved to 
> disable the check?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17603) [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17603:
---
Labels: pull-request-available  (was: )

> [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection 
> fails
> ---
>
> Key: ARROW-17603
> URL: https://issues.apache.org/jira/browse/ARROW-17603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Make it easier to debug build failures in CI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17603) [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails

2022-09-02 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-17603:


Assignee: David Li

> [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection 
> fails
> ---
>
> Key: ARROW-17603
> URL: https://issues.apache.org/jira/browse/ARROW-17603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> Make it easier to debug build failures in CI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17603) [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails

2022-09-02 Thread David Li (Jira)
David Li created ARROW-17603:


 Summary: [C++][FlightRPC] Print build logs if gRPC 
TlsCredentialsOptions detection fails
 Key: ARROW-17603
 URL: https://issues.apache.org/jira/browse/ARROW-17603
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: David Li


Make it easier to debug build failures in CI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-11335) [C++] Support FixedSizeList JSON parsing

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599626#comment-17599626
 ] 

Antoine Pitrou commented on ARROW-11335:


cc [~benpharkins] 

 

> [C++] Support FixedSizeList JSON parsing
> 
>
> Key: ARROW-11335
> URL: https://issues.apache.org/jira/browse/ARROW-11335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Johan Peltenburg
>Priority: Major
>  Labels: good-first-issue
>
> Currently, the C++ JSON table reader does not support the Arrow FixedSizeList 
> type.
> When the Arrow schema is not inferred but supplied to the parser, it should 
> attempt to turn JSON arrays into an Arrow FixedSizeList if the schema 
> dictates this for specific fields.
> This currently results in the JSON table reader returning an error, e.g.:
> {code:java}
> JSON conversion to fixed_size_list[12] is not 
> supported{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-4709) [C++] Optimize for ordered JSON fields

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599627#comment-17599627
 ] 

Antoine Pitrou commented on ARROW-4709:
---

cc [~benpharkins] if you'd like to tackle a performance issue at some point.

> [C++] Optimize for ordered JSON fields
> --
>
> Key: ARROW-4709
> URL: https://issues.apache.org/jira/browse/ARROW-4709
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>  Labels: good-second-issue
>
> Fields appear consistently ordered in most JSON data in the wild, but the 
> JSON parser currently looks fields up in a hash table. The ordering can 
> probably be exploited to yield better performance when looking up field 
> indices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-4709) [C++] Optimize for ordered JSON fields

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4709:
--
Labels: good-second-issue  (was: )

> [C++] Optimize for ordered JSON fields
> --
>
> Key: ARROW-4709
> URL: https://issues.apache.org/jira/browse/ARROW-4709
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>  Labels: good-second-issue
>
> Fields appear consistently ordered in most JSON data in the wild, but the 
> JSON parser currently looks fields up in a hash table. The ordering can 
> probably be exploited to yield better performance when looking up field 
> indices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing

2022-09-02 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington closed ARROW-17602.

Resolution: Duplicate

> [R] RTools35/R 3.6 CI job is failing
> 
>
> Key: ARROW-17602
> URL: https://issues.apache.org/jira/browse/ARROW-17602
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Dewey Dunnington
>Priority: Major
>
> After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about 
> to drop support for R 3.6 with the move to C++17, so instead of debugging 
> this failure I think we should just disable this check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-11335) [C++] Support FixedSizeList JSON parsing

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11335:
---
Priority: Major  (was: Minor)

> [C++] Support FixedSizeList JSON parsing
> 
>
> Key: ARROW-11335
> URL: https://issues.apache.org/jira/browse/ARROW-11335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Johan Peltenburg
>Priority: Major
>  Labels: good-first-issue
>
> Currently, the C++ JSON table reader does not support the Arrow FixedSizeList 
> type.
> When the Arrow schema is not inferred but supplied to the parser, it should 
> attempt to turn JSON arrays into an Arrow FixedSizeList if the schema 
> dictates this for specific fields.
> This currently results in the JSON table reader returning an error, e.g.:
> {code:java}
> JSON conversion to fixed_size_list[12] is not 
> supported{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing

2022-09-02 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599625#comment-17599625
 ] 

Dewey Dunnington commented on ARROW-17602:
--

(Facepalm...I forgot I created that!)

> [R] RTools35/R 3.6 CI job is failing
> 
>
> Key: ARROW-17602
> URL: https://issues.apache.org/jira/browse/ARROW-17602
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Dewey Dunnington
>Priority: Major
>
> After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about 
> to drop support for R 3.6 with the move to C++17, so instead of debugging 
> this failure I think we should just disable this check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16226) [C++] Add better coverage for filesystem tell.

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16226:
---
Labels: good-first-issue  (was: )

> [C++] Add better coverage for filesystem tell.
> --
>
> Key: ARROW-16226
> URL: https://issues.apache.org/jira/browse/ARROW-16226
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue
>
> Add a C++ generic file system test that writes wrote N bytes to a file. then 
> seeks to N/2 and and read the remainder.  Verify the remainder bytes are N/2 
> and expected from the bytes writter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599618#comment-17599618
 ] 

Antoine Pitrou commented on ARROW-6772:
---

I think it would be nice to add at least _basic_ tests to ensure that the 
operators properly redirect. No need to duplicate all tests though.

> [C++] Add operator== for interfaces with an Equals() method
> ---
>
> Key: ARROW-6772
> URL: https://issues.apache.org/jira/browse/ARROW-6772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue
>
> A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The 
> addition of overloaded equality operators will allow this o be written 
> {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
> allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15479) [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name)

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15479:
---
Labels: good-second-issue kernel  (was: kernel)

> [C++] Cast fixed size list to compatible fixed size list type (other values 
> type, other field name)
> ---
>
> Key: ARROW-15479
> URL: https://issues.apache.org/jira/browse/ARROW-15479
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: good-second-issue, kernel
>
> Casting a FixedSizeListArray to a compatible type but only a different field 
> name isn't implemented:
> {code:python}
> >>> my_type = pa.list_(pa.field("element", pa.int64()), 2)
> >>> arr = pa.FixedSizeListArray.from_arrays(pa.array([1, 2, 3, 4, 5, 6]), 2)
> >>> arr.type
> FixedSizeListType(fixed_size_list[2])
> >>> my_type
> FixedSizeListType(fixed_size_list[2])
> >>> arr.cast(my_type)
> ...
> ArrowNotImplementedError: Unsupported cast from fixed_size_list int64>[2] to fixed_size_list using function cast_fixed_size_list
> {code}
> While the similar operation with a variable sized list actually works:
> {code:python}
> >>> my_type = pa.list_(pa.field("element", pa.int64()))
> >>> arr = pa.array([[1, 2], [3, 4]], pa.list_(pa.int64()))
> >>> arr.type
> ListType(list)
> >>> arr.cast(my_type).type
> ListType(list)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2022-09-02 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599617#comment-17599617
 ] 

Ben Harkins commented on ARROW-6772:


I'm currently working on this one - planning on adding 
{{util::EqualityComparable}} to {{{}DataType{}}}, {{{}Field{}}}, and 
{{{}FieldRef{}}}. Should additional comparison tests be added (in addition to 
AssertXXXEqual) to type_test.cc or would that be considered redundant?

> [C++] Add operator== for interfaces with an Equals() method
> ---
>
> Key: ARROW-6772
> URL: https://issues.apache.org/jira/browse/ARROW-6772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue
>
> A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The 
> addition of overloaded equality operators will allow this o be written 
> {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
> allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-16326:
---

Assignee: Micah Kornfield

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599613#comment-17599613
 ] 

Micah Kornfield commented on ARROW-16326:
-

This was actually done in the PR.  "retry_limit_seconds" can be passed through 
to the URI and if no connection is established within that time it will fail.

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-16326.
-
Resolution: Fixed

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17033) [C++] Add GCS connection pool size option

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599612#comment-17599612
 ] 

Antoine Pitrou commented on ARROW-17033:


cc [~benpharkins] 

> [C++] Add GCS connection pool size option
> -
>
> Key: ARROW-17033
> URL: https://issues.apache.org/jira/browse/ARROW-17033
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Leonhard Gruenschloss
>Priority: Minor
>  Labels: GCP, good-first-issue, performance
>
> Multi-threaded read performance in Arrow's GCS file system implementation 
> currently is relatively low. Given the high latency of cloud blob systems 
> like GCS, a common strategy is to use many concurrent readers (if the system 
> has enough memory to support that), e.g. using 100 threads.
> The GCS client library offers a [{{ConnectionPoolSize}} 
> option|https://googleapis.dev/cpp/google-cloud-storage/latest/structgoogle_1_1cloud_1_1storage_1_1v1_1_1ConnectionPoolSizeOption.html].
>  If this option is set to a value that's too low, concurrency is throttled. 
> At the moment, this is not exposed in 
> [{{GcsOptions}}|https://github.com/apache/arrow/blob/73cdd6a59b52781cc43e097ccd63ac36f705ee2e/cpp/src/arrow/filesystem/gcsfs.h#L59],
>  consequently limiting multi-threaded throughput.
> Instead of exposing this option, an alternative implementation strategy could 
> be to use the same value as set by {{arrow::io::SetIOThreadPoolCapacity}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599610#comment-17599610
 ] 

Antoine Pitrou edited comment on ARROW-8201 at 9/2/22 4:41 PM:
---

[~milesgranger]  Perhaps you would be interested in finding out whether this 
issue still applies, and if so, to come up with a PR?


was (Author: pitrou):
[~milesgranger]  Perhaps you would be interested whether this issue still 
applies, and if so, to come up with a PR?

> [Python][Dataset] Improve ergonomics of FileFragment
> 
>
> Key: ARROW-8201
> URL: https://issues.apache.org/jira/browse/ARROW-8201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> FileFragment can be made more directly useful by adding convenience methods.
> For example, a FileFragment could allow underlying file/buffer to be opened 
> directly:
> {code}
> def open(self):
> """
> Open a NativeFile of the buffer or file viewed by this fragment.
> """
> cdef:
> CFileSystem* c_filesystem
> shared_ptr[CRandomAccessFile] opened
> NativeFile out = NativeFile()
> buf = self.buffer
> if buf is not None:
> return pa.io.BufferReader(buf)
> with nogil:
> c_filesystem = self.file_fragment.source().filesystem()
> opened = GetResultValue(c_filesystem.OpenInputFile(
> self.file_fragment.source().path()))
> out.set_random_access_file(opened)
> out.is_readable = True
> return out
> {code}
> Additionally, a ParquetFileFragment's metadata could be introspectable:
> {code}
> @property
> def metadata(self):
> from pyarrow._parquet import ParquetReader
> reader = ParquetReader()
> reader.open(self.open())
> return reader.metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599610#comment-17599610
 ] 

Antoine Pitrou commented on ARROW-8201:
---

[~milesgranger]  Perhaps you would be interested whether this issue still 
applies, and if so, to come up with a PR?

> [Python][Dataset] Improve ergonomics of FileFragment
> 
>
> Key: ARROW-8201
> URL: https://issues.apache.org/jira/browse/ARROW-8201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> FileFragment can be made more directly useful by adding convenience methods.
> For example, a FileFragment could allow underlying file/buffer to be opened 
> directly:
> {code}
> def open(self):
> """
> Open a NativeFile of the buffer or file viewed by this fragment.
> """
> cdef:
> CFileSystem* c_filesystem
> shared_ptr[CRandomAccessFile] opened
> NativeFile out = NativeFile()
> buf = self.buffer
> if buf is not None:
> return pa.io.BufferReader(buf)
> with nogil:
> c_filesystem = self.file_fragment.source().filesystem()
> opened = GetResultValue(c_filesystem.OpenInputFile(
> self.file_fragment.source().path()))
> out.set_random_access_file(opened)
> out.is_readable = True
> return out
> {code}
> Additionally, a ParquetFileFragment's metadata could be introspectable:
> {code}
> @property
> def metadata(self):
> from pyarrow._parquet import ParquetReader
> reader = ParquetReader()
> reader.open(self.open())
> return reader.metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing

2022-09-02 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599609#comment-17599609
 ] 

Neal Richardson commented on ARROW-17602:
-

Duplicate of ARROW-17519?

> [R] RTools35/R 3.6 CI job is failing
> 
>
> Key: ARROW-17602
> URL: https://issues.apache.org/jira/browse/ARROW-17602
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Dewey Dunnington
>Priority: Major
>
> After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about 
> to drop support for R 3.6 with the move to C++17, so instead of debugging 
> this failure I think we should just disable this check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16753) [C++] LocalFileSystem cannot list Linux directory recursively when permission to subdirectory contents are denied

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16753:
---
Labels: good-second-issue pull-request-available  (was: 
pull-request-available)

> [C++] LocalFileSystem cannot list Linux directory recursively when permission 
> to subdirectory contents are denied
> -
>
> Key: ARROW-16753
> URL: https://issues.apache.org/jira/browse/ARROW-16753
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
> Environment: Ubuntu 20.04 LTS
>Reporter: David Rauschenbach
>Assignee: David Rauschenbach
>Priority: Major
>  Labels: good-second-issue, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The following code to list my root directory fails:
>  
> {code:java}
> FileSelector file_selector;
> file_selector.base_dir = "/";
> file_selector.allow_not_found = true;
> file_selector.recursive = true;
> auto result = fs.GetFileInfo(file_selector);{code}
> The result.ok() value returns {+}false{+}, and then result.status().message() 
> returns {+}Cannot list directory '/var/run/wpa_supplicant'{+}. 
> An examination of the /run directory (which /var/run symlinks to) shows:
>  
> {code:java}
> $ ls -al /run
> drwxr-xr-x 35 root              root  1040 Jun  6 06:11 .
> drwxr-xr-x 20 root              root  4096 May 20 12:42 ..
> ...
> drwxr-x---  2 root              root    60 Jun  4 12:14 wpa_supplicant{code}
> And then attempting to list this directory reveals:
>  
> {code:java}
> $ ls -al /run/wpa_supplicant/
> ls: cannot open directory '/run/wpa_supplicant/': Permission denied{code}
>  
> As a user of LocalFileSystem, I should be able to list all of the files that 
> I have access to.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16746) [C++][Python] S3 tag support on write

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16746:
---
Labels: good-second-issue  (was: )

> [C++][Python] S3 tag support on write
> -
>
> Key: ARROW-16746
> URL: https://issues.apache.org/jira/browse/ARROW-16746
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: André Kelpe
>Priority: Major
>  Labels: good-second-issue
>
> S3 allows tagging data to better organize ones data 
> ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] 
> We use this for efficient downstream processes/inventory management.
> Currently arrow/pyarrow does not allow tags to be added on write. This is 
> causing us to scan the bucket and re-apply the tags after a pyrrow based 
> process has run.
> I looked through the code and think that it could potentially be done via the 
> metadata mechanism.
> The tags need to be added to the CreateMultipartUploadRequest here: 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156
> See also
> http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16326:
--

Assignee: (was: Micah Kornfield)

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599607#comment-17599607
 ] 

Antoine Pitrou commented on ARROW-16326:


cc [~benpharkins] 

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16326:
---
Labels: good-first-issue good-second-issue  (was: )

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16226) [C++] Add better coverage for filesystem tell.

2022-09-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599605#comment-17599605
 ] 

Antoine Pitrou commented on ARROW-16226:


cc [~benpharkins] 

> [C++] Add better coverage for filesystem tell.
> --
>
> Key: ARROW-16226
> URL: https://issues.apache.org/jira/browse/ARROW-16226
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> Add a C++ generic file system test that writes wrote N bytes to a file. then 
> seeks to N/2 and and read the remainder.  Verify the remainder bytes are N/2 
> and expected from the bytes writter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17601) [C++] Error when creating Expression on Decimal128 types: precision out of range

2022-09-02 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599604#comment-17599604
 ] 

Neal Richardson commented on ARROW-17601:
-

Here's a reproducer just using Expressions, no dplyr:

{code}
expr <- Expression$create("multiply_checked", Expression$field_ref("col1"), 
Expression$field_ref("col2"))
expr

# Expression
# multiply_checked(col1, col2)

expr$type(schema(col1=decimal128(33, 4), col2=decimal128(15, 2)))

# Error: Invalid: Decimal precision out of range [1, 38]: 49
{code}

Note that the different kernels have different promotion logic. Subtraction, 
for example, doesn't go out of range with these types.

{code}
expr <- Expression$create("subtract_checked", Expression$field_ref("col1"), 
Expression$field_ref("col2"))
expr$type(schema(col1=decimal128(33, 4), col2=decimal128(15, 2)))

# Decimal128Type
# decimal128(34, 4)
{code}

> [C++] Error when creating Expression on Decimal128 types: precision out of 
> range
> 
>
> Key: ARROW-17601
> URL: https://issues.apache.org/jira/browse/ARROW-17601
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> Reproducer in R:
> {code}
> library(arrow)
> library(dplyr)
> tab <- Table$create(col1 = 1:4, col2 = 5:8)
> tab <- tab$cast(schema(col1 = decimal128(33, 4), col2 = decimal128(15, 2)))
> tab %>% mutate(col1 * col2)
> # Error: Invalid: Decimal precision out of range [1, 38]: 49
> # /Users/me/arrow/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc:1078  
> DecimalType::Make(left_type.id(), precision, scale)
> # /Users/me/arrow/cpp/src/arrow/compute/exec/expression.cc:413  
> call.kernel->signature->out_type().Resolve(_context, types)
> {code}
> With integers and floats, we upcast to a wider size in some compute functions 
> like multiplication. Should this go up to Decimal256? Or is there a better 
> way to determine the size required?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing

2022-09-02 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-17602:


 Summary: [R] RTools35/R 3.6 CI job is failing
 Key: ARROW-17602
 URL: https://issues.apache.org/jira/browse/ARROW-17602
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Dewey Dunnington


After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about 
to drop support for R 3.6 with the move to C++17, so instead of debugging this 
failure I think we should just disable this check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17601) [C++] Error when creating Expression on Decimal128 types: precision out of range

2022-09-02 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-17601:
---

 Summary: [C++] Error when creating Expression on Decimal128 types: 
precision out of range
 Key: ARROW-17601
 URL: https://issues.apache.org/jira/browse/ARROW-17601
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson


Reproducer in R:

{code}
library(arrow)
library(dplyr)

tab <- Table$create(col1 = 1:4, col2 = 5:8)
tab <- tab$cast(schema(col1 = decimal128(33, 4), col2 = decimal128(15, 2)))
tab %>% mutate(col1 * col2)

# Error: Invalid: Decimal precision out of range [1, 38]: 49
# /Users/me/arrow/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc:1078  
DecimalType::Make(left_type.id(), precision, scale)
# /Users/me/arrow/cpp/src/arrow/compute/exec/expression.cc:413  
call.kernel->signature->out_type().Resolve(_context, types)
{code}

With integers and floats, we upcast to a wider size in some compute functions 
like multiplication. Should this go up to Decimal256? Or is there a better way 
to determine the size required?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17600) [Go] Implement Casting for Complex Types (List/Struct/etc.)

2022-09-02 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-17600:
-

 Summary: [Go] Implement Casting for Complex Types 
(List/Struct/etc.)
 Key: ARROW-17600
 URL: https://issues.apache.org/jira/browse/ARROW-17600
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17588) [Go] Casting to BinaryLike types

2022-09-02 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17588.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14027
[https://github.com/apache/arrow/pull/14027]

> [Go] Casting to BinaryLike types
> 
>
> Key: ARROW-17588
> URL: https://issues.apache.org/jira/browse/ARROW-17588
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2022-09-02 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-6772:
--

Assignee: Ben Harkins

> [C++] Add operator== for interfaces with an Equals() method
> ---
>
> Key: ARROW-6772
> URL: https://issues.apache.org/jira/browse/ARROW-6772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue
>
> A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The 
> addition of overloaded equality operators will allow this o be written 
> {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
> allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory

2022-09-02 Thread Robert Tidwell (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599511#comment-17599511
 ] 

Robert Tidwell commented on ARROW-17595:


dnf install gcc-toolset-11

Here are the packages that I have installed:
gcc-toolset-11.x86_6411.1-1.el8 
  @acx-appstream 
gcc-toolset-11-annobin-docs.noarch   10.23-1.el8
  @acx-appstream 
gcc-toolset-11-annobin-plugin-gcc.x86_64 10.23-1.el8
  @acx-appstream 
gcc-toolset-11-binutils.x86_64   2.36.1-2.0.1.el8   
  @acx-appstream 
gcc-toolset-11-dwz.x86_640.14-2.el8 
  @acx-appstream 
gcc-toolset-11-dyninst.x86_6411.0.0-2.el8   
  @acx-appstream 
gcc-toolset-11-elfutils.x86_64   0.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-debuginfod-client.x86_64 0.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-libelf.x86_640.185-5.el8
  @acx-appstream 
gcc-toolset-11-elfutils-libs.x86_64  0.185-5.el8
  @acx-appstream 
gcc-toolset-11-gcc.x86_6411.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-c++.x86_6411.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-gdb-plugin.x86_64 11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-gfortran.x86_64   11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gcc-plugin-devel.x86_64   11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-gdb.x86_6410.2-5.0.1.el8 
  @acx-appstream 
gcc-toolset-11-libquadmath-devel.x86_64  11.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-libstdc++-devel.x86_6411.2.1-9.1.0.2.el8 
  @acx-appstream 
gcc-toolset-11-ltrace.x86_64 0.7.91-1.el8   
  @acx-appstream 
gcc-toolset-11-make.x86_64   1:4.3-2.el8
  @acx-appstream 
gcc-toolset-11-make-devel.x86_64 1:4.3-2.el8
  @acx-appstream 
gcc-toolset-11-perftools.x86_64  11.1-1.el8 
  @acx-appstream 
gcc-toolset-11-runtime.x86_6411.0-1.el8 
  @acx-appstream 
gcc-toolset-11-strace.x86_64 5.13-7.el8 
  @acx-appstream 
gcc-toolset-11-systemtap.x86_64  4.5-4.el8  
  @acx-appstream 
gcc-toolset-11-systemtap-client.x86_64   4.5-4.el8  
  @acx-appstream 
gcc-toolset-11-systemtap-devel.x86_644.5-4.el8  
  @acx-appstream 
gcc-toolset-11-systemtap-runtime.x86_64  4.5-4.el8  
  @acx-appstream 
gcc-toolset-11-toolchain.x86_64  11.1-1.el8 
  @acx-appstream 
gcc-toolset-11-valgrind.x86_64   1:3.17.0-6.el8 
  @acx-appstream

> [C++] Installation Error stdlib.h no such file or directory
> ---
>
> Key: ARROW-17595
> URL: https://issues.apache.org/jira/browse/ARROW-17595
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: OEL 8.5
>Reporter: Robert Tidwell
>Priority: Major
>
> Getting error installing arrow.  
> {noformat}
> /opt/rh/gcc-toolset-11/root/usr/include/c++/11/cstdlib:75:15: fatal error: 
> stdlib.h: No such file or directory
>    75 | #include_next 
> {noformat}
> Using the c\+\+11 toolset.  Tried with default c\+\+ on the system and was 
> getting the same type error.  OS is OEL 8.5.
> Here is the detailed output:
> {noformat}
> Installing package into ‘/usr/lib64/R/library’
> (as ‘lib’ is unspecified)
> trying URL 
> 'http://**/prod/cran/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'application/octet-stream' length 4900968 bytes (4.7 MB)
> ==
> downloaded 4.7 MB
> * installing *source* package ‘arrow’ ...
> ** package ‘arrow’ successfully unpacked and 

[jira] [Updated] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory

2022-09-02 Thread Robert Tidwell (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Tidwell updated ARROW-17595:
---
Description: 
Getting error installing arrow.  

{noformat}
/opt/rh/gcc-toolset-11/root/usr/include/c++/11/cstdlib:75:15: fatal error: 
stdlib.h: No such file or directory
   75 | #include_next 
{noformat}

Using the c\+\+11 toolset.  Tried with default c\+\+ on the system and was 
getting the same type error.  OS is OEL 8.5.

Here is the detailed output:

{noformat}
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
trying URL 
'http://**/prod/cran/src/contrib/arrow_9.0.0.tar.gz'
Content type 'application/octet-stream' length 4900968 bytes (4.7 MB)
==
downloaded 4.7 MB

* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
For build options and troubleshooting, see the install vignette:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
*** Building with MAKEFLAGS= -j2 
 cmake: /bin/cmake
 arrow with SOURCE_DIR='tools/cpp' BUILD_DIR='/tmp/rmt' 
DEST_DIR='libarrow/arrow-9.0.0' CMAKE='/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='gcc 
-m64' CXX='g++ -m64 -std=gnu++11' LDFLAGS='-Wl,-z,relro -Wl,-z,now 
-specs=/usr/lib/rpm/redhat/redhat-hardened-ld' ARROW_S3='OFF' ARROW_GCS='OFF' 
ARROW_MIMALLOC='OFF' 
++ pwd
+ : /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow
+ : tools/cpp
+ : /tmp/rmt
+ : libarrow/arrow-9.0.0
+ : /bin/cmake
++ cd tools/cpp
++ pwd
+ SOURCE_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp
++ mkdir -p libarrow/arrow-9.0.0
++ cd libarrow/arrow-9.0.0
++ pwd
+ DEST_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0
+ '[' '' '!=' '' ']'
+ '[' '' = false ']'
+ ARROW_DEFAULT_PARAM=OFF
+ mkdir -p /tmp/rmt
+ pushd /tmp/rmt
/tmp/rmt /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow
+ /bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF 
-DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF 
-DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON 
-DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON 
-DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
-DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
-Dxsimd_SOURCE= -G 'Unix Makefiles' 
/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp
-- Building using CMake version: 3.20.2
-- The C compiler identification is GNU 11.2.1
-- The CXX compiler identification is GNU 11.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc
-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc - works
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++
-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ - 
works
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 9.0.0 (full: '9.0.0')
-- Arrow SO version: 900 (full: 900.0.0)
-- clang-tidy 12 not found
-- clang-format 12 not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
-- Found Python3: /bin/python3.6 (found version "3.6.8") found components: 
Interpreter 
-- Found cpplint executable at 
/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
-- Using ld linker
-- Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT
-- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT - Success
-- Using AUTO approach to find dependencies
-- ARROW_ABSL_BUILD_VERSION: 

[jira] [Updated] (ARROW-17512) [Doc] Updates to crossbow documentation for clarity

2022-09-02 Thread Fiona La (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fiona La updated ARROW-17512:
-
Description: 
While setting up a queue repository for testing changes to {{{}crossbow{}}}, we 
noticed some updates that can be made to help future developers set up their 
environment. 

They include:
 # Clarify Travis CI auto-cancellation default behavior
 # Fixing broken links
 # Typos

  was:
While setting up a queue repository for testing changes to {{{}crossbow{}}}, we 
noticed some updates that can be made to help future developers set up their 
environment. 

They include:
 # Clarify Travis CI auto-cancellation default behavior.
 # Typos


> [Doc] Updates to crossbow documentation for clarity
> ---
>
> Key: ARROW-17512
> URL: https://issues.apache.org/jira/browse/ARROW-17512
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Fiona La
>Assignee: Fiona La
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While setting up a queue repository for testing changes to {{{}crossbow{}}}, 
> we noticed some updates that can be made to help future developers set up 
> their environment. 
> They include:
>  # Clarify Travis CI auto-cancellation default behavior
>  # Fixing broken links
>  # Typos



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15691) [Dev] Update archery to work with either master or main as default branch

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15691:
---
Labels: pull-request-available  (was: )

> [Dev] Update archery to work with either master or main as default branch
> -
>
> Key: ARROW-15691
> URL: https://issues.apache.org/jira/browse/ARROW-15691
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Neal Richardson
>Assignee: Fiona La
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17590) Lower memory usage with filters

2022-09-02 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin closed ARROW-17590.
---
Resolution: Duplicate

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17590) Lower memory usage with filters

2022-09-02 Thread Yin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599489#comment-17599489
 ] 

Yin commented on ARROW-17590:
-

Yep,
pa.total_allocated_bytes 289.74639892578125 MB dt.nbytes 0.0011539459228515625 
MB
sleep  5 seconds
pa.total_allocated_bytes 0.0184326171875 MB dt.nbytes 0.0011539459228515625 MB

Thanks Weston. Let me close this jira.

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16791) [R] Expose Azure Blob Storage filesystem

2022-09-02 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599485#comment-17599485
 ] 

Neal Richardson commented on ARROW-16791:
-

Thanks, yeah, that's being added to the C++ library in ARROW-2034. Once that is 
merged, we can add R bindings. 

> [R] Expose Azure Blob Storage filesystem
> 
>
> Key: ARROW-16791
> URL: https://issues.apache.org/jira/browse/ARROW-16791
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Dean MacGregor
>Priority: Critical
>
> I'd like to see the R arrow package be able to interface with the Azure Blob 
> Storage file system from the AzureStor package.
>  
> In python, pyarrow and adlfs work together so I'd like for AzureStor and 
> arrow under R to also work together.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17583:
---
Labels: pull-request-available  (was: )

> [Python] File write visitor throws exception on large parquet file
> --
>
> Key: ARROW-17583
> URL: https://issues.apache.org/jira/browse/ARROW-17583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Joost Hoozemans
>Assignee: Joost Hoozemans
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
> an exception:
> Traceback (most recent call last):
>   File "pyarrow/_dataset_parquet.pyx", line 165, in 
> pyarrow._dataset_parquet.ParquetFileFormat._finish_write
>   File "pyarrow/{_}dataset.pyx", line 2695, in 
> pyarrow._dataset.WrittenFile.{_}{_}init{_}_
> OverflowError: value too large to convert to int
> Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
> The file is written succesfully though. It seems related to this issue 
> https://issues.apache.org/jira/browse/ARROW-16761.
> I would guess the problem is the python field is an int while the C++ code 
> returns an int64_t 
> [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16791) [R] Expose Azure Blob Storage filesystem

2022-09-02 Thread Dean MacGregor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599464#comment-17599464
 ] 

Dean MacGregor commented on ARROW-16791:


I found that in arrow/r/src/filesystem.cpp, on lines 38-40 there's a comment 
block which says to uncomment them for AzureBlobFileSystem when R6 classes are 
made.  However, when looking further, it seems that there is more in the cpp 
file for s3 and GCS than just that reference but nothing else for Azure so I 
guess that comment should be more of a TODO since, unless I'm missing a lot 
(which is all too possible) there's much more needed for Azure to work than 
just an R6 class, no?

 

To that end, Microsoft has [https://github.com/Azure/azure-sdk-for-cpp] for 
using azure in cpp.

 

I don't know cpp enough (really, at all) to be able to extend MS's code into 
something usable for arrow but just wanted to put the link in here in case it's 
helpful.

> [R] Expose Azure Blob Storage filesystem
> 
>
> Key: ARROW-16791
> URL: https://issues.apache.org/jira/browse/ARROW-16791
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Dean MacGregor
>Priority: Critical
>
> I'd like to see the R arrow package be able to interface with the Azure Blob 
> Storage file system from the AzureStor package.
>  
> In python, pyarrow and adlfs work together so I'd like for AzureStor and 
> arrow under R to also work together.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

2022-09-02 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599444#comment-17599444
 ] 

David Li commented on ARROW-17599:
--

You could perhaps store each entry as a custom Buffer subclass (which delegates 
to the actual buffer) and hand out shared pointers of that.

> [C++] ReadRangeCache should not retain data after read
> --
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17319) [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes if CPU 0 is not available

2022-09-02 Thread Mike Gevaert (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599432#comment-17599432
 ] 

Mike Gevaert commented on ARROW-17319:
--

ping.

It looks like the package isn't building w/ vcpkg - I'm not knowledgeable about 
that ecosystem, so I'm not sure I can debug this.

> [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes 
> if CPU 0 is not available
> --
>
> Key: ARROW-17319
> URL: https://issues.apache.org/jira/browse/ARROW-17319
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: Ubuntu 20.02 / Python 3.8.10 (default, Jun 22 2022, 
> 20:18:18)
> $ pip list 
> Package Version
> --- ---
> numpy   1.23.1 
> pandas  1.4.3  
> pip 20.0.2 
> pkg-resources   0.0.0  
> pyarrow 9.0.0  
> python-dateutil 2.8.2  
> pytz2022.1 
> setuptools  44.0.0 
> six 1.16.0 
>Reporter: Mike Gevaert
>Assignee: Mike Gevaert
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I get the following traceback when exiting python after loading 
> {{pyarrow.parquet}}
> {code}
> Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
> [GCC 9.4.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> os.getpid()
> 25106
> >>> import pyarrow.parquet
> >>> 
> Fatal error condition occurred in 
> /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72:
>  aws_thread_launch(_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, _options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200af06) 
> [0x7f831b2b3f06]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x20028e5) 
> [0x7f831b2ab8e5]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) 
> [0x7f831b1d0e09]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) 
> [0x7f831b2b4a3d]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f25948) 
> [0x7f831b1ce948]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) 
> [0x7f831b2b4a3d]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) 
> [0x7f831b189b46]
> /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x194546a) 
> [0x7f831abee46a]
> /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f831c6188a7]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f831c618a60]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f831c5f608a]
>  {code}
> To replicate this; one needs to make sure that CPU 0 isn't available to 
> schedule tasks on.  In HPC our environment, that happens due to slurm using 
> cgroups to constrain CPU usage.
> On a linux workstation, one should be able to:
> 1) open python as a normal user
> 2) get the pid
> 3) as root:
> {code}
> cd /sys/fs/cgroup/cpuset/
> mkdir pyarrow
> cd pyarrow
> echo 0 > cpuset.mems
> echo 1 > cpuset.cpus # sets the cgroup to only have access to cpu 1
> echo $PID > tasks
> {code}
> Then, in the python enviroment:
> {code}
> import pyarrow.parquet
> exit()
> {code}
> Which should trigger the crash.
> Sadly, I couldn't track down which {{aws-c-common}} and {{aws-c-io}} are 
> being used for the 9.0.0 py38 manylinux wheels. (libarrow.so.900 has 
> BuildID[sha1]=dd6c5a2efd5cacf09657780a58c40f7c930e4df1)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17355:
---
Labels: pull-request-available  (was: )

> [R] Refactor the handle_* utility functions for a better dev experience
> ---
>
> Key: ARROW-17355
> URL: https://issues.apache.org/jira/browse/ARROW-17355
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In ARROW-15260, the utility functions for handling different kinds of reading 
> errors (handle_parquet_io_error, handle_csv_read_error, and 
> handle_augmented_field_misuse) were refactored so that multiple ones could be 
> chained together. An issue with this is that other errors may be swallowed if 
> they're used without any errors that they don't capture being raised manually 
> afterwards.  We should update the code to prevent this from being possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience

2022-09-02 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-17355:


Assignee: Nicola Crane

> [R] Refactor the handle_* utility functions for a better dev experience
> ---
>
> Key: ARROW-17355
> URL: https://issues.apache.org/jira/browse/ARROW-17355
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> In ARROW-15260, the utility functions for handling different kinds of reading 
> errors (handle_parquet_io_error, handle_csv_read_error, and 
> handle_augmented_field_misuse) were refactored so that multiple ones could be 
> chained together. An issue with this is that other errors may be swallowed if 
> they're used without any errors that they don't capture being raised manually 
> afterwards.  We should update the code to prevent this from being possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags

2022-09-02 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599308#comment-17599308
 ] 

Alenka Frim commented on ARROW-17580:
-

I did some research and found:
 - {{CMAKE_CXX_FLAGS}} in pyarrow include {{CXX_COMMON_FLAGS}} but this is not 
true for pyarrow c++ *->* I need to add {{set(CMAKE_CXX_FLAGS 
"${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}")}} to pyarrow c++ CMakeLists file
 - {{PYARROW_CXXFLAGS}} is ignored by pyarrow c++ and pyarrow *->* I need to 
add it to the cmake_options in setup.py, then they are read by the cmake
 - {{ARROW_CXXFLAGS}} are recognised by all three builds
 
{code:bash}
arrow c++
-- ARROW_CXXFLAGS: -ggdb
-- PYARROW_CXXFLAGS: 
-- CXXFLAGS: 
-- CXX_COMMON_FLAGS:  -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 
-Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand 
-Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed 
-stdlib=libc++ -march=armv8-a
--CMAKE_CXX_FLAGS:   -Qunused-arguments -fcolor-diagnostics -ggdb -O0  -Wall 
-Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces 
-Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address 
-Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ 
-march=armv8-a -ggdb

pyarrow c++
-- ARROW_CXXFLAGS: -ggdb
-- PYARROW_CXXFLAGS: 
-- CXXFLAGS: 
-- CXX_COMMON_FLAGS:  -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 
-Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand 
-Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed 
-stdlib=libc++ -march=armv8-a
--CMAKE_CXX_FLAGS:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0

pyarrow
-- ARROW_CXXFLAGS: -ggdb
-- PYARROW_CXXFLAGS: 
-- CXXFLAGS: 
-- CXX_COMMON_FLAGS:  -Wall -Wno-unknown-warning-option -Wno-pass-failed 
-stdlib=libc++ -march=armv8-a
-- CMAKE_CXX_FLAGS:   -Wall -Wno-unknown-warning-option -Wno-pass-failed 
-stdlib=libc++ -march=armv8-a  -Qunused-arguments -fcolor-diagnostics -g -O0  
-fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized 
-Wno-parentheses-equality -Wno-constant-logical-operand 
-Wno-missing-declarations -Wno-sometimes-uninitialized 
-Wno-return-type-c-linkage
{code}
 

 - I think {{CXXFLAGS}} doesn't get included in the {{CMAKE_CXX_FLAGS}} list. 
Maybe even setting {{CMAKE_CXX_FLAGS}} in 
[https://github.com/apache/arrow/blob/413e49385f52b8c76aa02f67588f4dc66d5ed86b/cpp/CMakeLists.txt#L555]
 makes the variable to be ignnored?

{code:bash}
(pyarrow-dev-9) (base) alenkafrim@Alenkas-MacBook-Pro arrow % printenv CXXFLAGS
-Woverloaded-virtual ../dist/configure
(pyarrow-dev-9) (base) alenkafrim@Alenkas-MacBook-Pro arrow % arrow cmake-debug
...
-- ARROW_CXXFLAGS: -ggdb
-- PYARROW_CXXFLAGS: 
-- CXXFLAGS: 
-- CXX_COMMON_FLAGS:  -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 
-Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand 
-Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed 
-stdlib=libc++ -march=armv8-a
-- CMAKE_CXX_FLAGS:   -Qunused-arguments -fcolor-diagnostics -ggdb -O0  -Wall 
-Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces 
-Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address 
-Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ 
-march=armv8-a -ggdb
-- **
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/alenkafrim/repos/arrow/cpp/build
~/repos/arrow

{code}

> [Doc][C++][Python] Unclear how to influence compilation flags
> -
>
> Key: ARROW-17580
> URL: https://issues.apache.org/jira/browse/ARROW-17580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Antoine Pitrou
>Priority: Critical
>
> Frequently people need to customize compilation flags for C++ and/or C files.
> Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find 
> out the proper way to do this.
> For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while 
> the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). 
> For PyArrow, I have not found a way to do it. -The {{CXXFLAGS}} environment 
> variable is ignored, and -the {{PYARROW_CXXFLAGS}} CMake variable has two 
> problems:
> * it is only recognized for Cython-generated files, not for PyArrow C++ 
> sources
> * it only affects linker calls, while it should actually affect compiler 
> calls (edit: does it?)
> *Edit: I was wrong, the {{CXXFLAGS}} environment variable is not ignored.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16431) [C++][Parquet] Improve error message in append_row_groups() when appending disjoint metadata

2022-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16431:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Improve error message in append_row_groups() when appending 
> disjoint metadata
> 
>
> Key: ARROW-16431
> URL: https://issues.apache.org/jira/browse/ARROW-16431
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Michael Milton
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently if you try to append together metadata from row groups with 
> different schemas (?), you get the following error: 
> {code:java}
>   File 
> "/home/mmilton/.conda/envs/mmilton/envs/driverpipe/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py",
>  line 52, in _append_row_groups
> metadata.append_row_groups(md)
>   File "pyarrow/_parquet.pyx", line 628, in 
> pyarrow._parquet.FileMetaData.append_row_groups
> self._metadata.AppendRowGroups(deref(c_metadata))
> RuntimeError: AppendRowGroups requires equal schemas.
> {code}
> What would be useful here is to actually pass the schema difference in the 
> error object in terms of which columns disagree. This information should 
> _also_ be in the error message.
> For example if it said:
> {code:java}
> RuntimeError: AppendRowGroups requires equal schemas. Column "foo" was 
> previously an int32 but the latest row group is storing it as an int64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)