[jira] [Assigned] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment
[ https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miles Granger reassigned ARROW-8201: Assignee: Miles Granger > [Python][Dataset] Improve ergonomics of FileFragment > > > Key: ARROW-8201 > URL: https://issues.apache.org/jira/browse/ARROW-8201 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Miles Granger >Priority: Major > Labels: dataset > > FileFragment can be made more directly useful by adding convenience methods. > For example, a FileFragment could allow underlying file/buffer to be opened > directly: > {code} > def open(self): > """ > Open a NativeFile of the buffer or file viewed by this fragment. > """ > cdef: > CFileSystem* c_filesystem > shared_ptr[CRandomAccessFile] opened > NativeFile out = NativeFile() > buf = self.buffer > if buf is not None: > return pa.io.BufferReader(buf) > with nogil: > c_filesystem = self.file_fragment.source().filesystem() > opened = GetResultValue(c_filesystem.OpenInputFile( > self.file_fragment.source().path())) > out.set_random_access_file(opened) > out.is_readable = True > return out > {code} > Additionally, a ParquetFileFragment's metadata could be introspectable: > {code} > @property > def metadata(self): > from pyarrow._parquet import ParquetReader > reader = ParquetReader() > reader.open(self.open()) > return reader.metadata > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599759#comment-17599759 ] Vincent Nijs edited comment on ARROW-17374 at 9/3/22 2:03 AM: -- Made some progress. I had to install `libsnappy-dev` and `libre2-dev` in Ubuntu 22.04 and use [https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708] Arrow still doesn't compile successfully, however, and ends with the below. The build logs are attached ("build-images.out"). {{#21 1786.1 ** testing if installed package can be loaded from temporary location}} {{#21 1786.4 Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...):}} {{#21 1786.4 unable to load shared object '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':}} {{#21 1786.4 /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: _ZTIN6apache6thrift8protocol9TProtocolE}} {{{}# workaround for fPIC issue [https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708]{}}}{{{}cp -a /usr/local/lib/R/etc/Makeconf /usr/local/lib/R/etc/Makeconf.bak;{}}} {{sed -i 's/fpic/fPIC/g' /usr/local/lib/R/etc/Makeconf;}} {{R -e "options(HTTPUserAgent = sprintf('R/%s R (%s)', getRversion(), paste(getRversion(), R.version['platform'], R.version['arch'], R.version['os']))); Sys.setenv('ARROW_R_DEV' = TRUE); install.packages(c('duckdb', 'arrow'), repo='${CRAN}', Ncpus=${NCPUS})"}} {{mv /usr/local/lib/R/etc/Makeconf.bak /usr/local/lib/R/etc/Makeconf;}} was (Author: vnijs): Made some progress. I had to install `libsnappy-dev` and `libre2-dev` in Ubuntu 22.04 and use [https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708] Arrow still doesn't compile successfully, however, and ends with the below. The build logs are attached ("build-images.out"). ``` #21 1786.1 ** testing if installed package can be loaded from temporary location #21 1786.4 Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): #21 1786.4 unable to load shared object '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so': #21 1786.4 /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: _ZTIN6apache6thrift8protocol9TProtocolE ``` ``` # workaround for fPIC issue [https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708] cp -a /usr/local/lib/R/etc/Makeconf /usr/local/lib/R/etc/Makeconf.bak; sed -i 's/fpic/fPIC/g' /usr/local/lib/R/etc/Makeconf; R -e "options(HTTPUserAgent = sprintf('R/%s R (%s)', getRversion(), paste(getRversion(), R.version['platform'], R.version['arch'], R.version['os']))); Sys.setenv('ARROW_R_DEV' = TRUE); install.packages(c('duckdb', 'arrow'), repo='${CRAN}', Ncpus=${NCPUS})" mv /usr/local/lib/R/etc/Makeconf.bak /usr/local/lib/R/etc/Makeconf; ``` > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0, 8.0.1 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: build-images.out > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND
[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599759#comment-17599759 ] Vincent Nijs commented on ARROW-17374: -- Made some progress. I had to install `libsnappy-dev` and `libre2-dev` in Ubuntu 22.04 and use [https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708] Arrow still doesn't compile successfully, however, and ends with the below. The build logs are attached ("build-images.out"). ``` #21 1786.1 ** testing if installed package can be loaded from temporary location #21 1786.4 Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): #21 1786.4 unable to load shared object '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so': #21 1786.4 /usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: _ZTIN6apache6thrift8protocol9TProtocolE ``` ``` # workaround for fPIC issue [https://github.com/duckdb/duckdb/issues/3049#issuecomment-1096671708] cp -a /usr/local/lib/R/etc/Makeconf /usr/local/lib/R/etc/Makeconf.bak; sed -i 's/fpic/fPIC/g' /usr/local/lib/R/etc/Makeconf; R -e "options(HTTPUserAgent = sprintf('R/%s R (%s)', getRversion(), paste(getRversion(), R.version['platform'], R.version['arch'], R.version['os']))); Sys.setenv('ARROW_R_DEV' = TRUE); install.packages(c('duckdb', 'arrow'), repo='${CRAN}', Ncpus=${NCPUS})" mv /usr/local/lib/R/etc/Makeconf.bak /usr/local/lib/R/etc/Makeconf; ``` > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0, 8.0.1 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: build-images.out > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Nijs updated ARROW-17374: - Attachment: build-images.out > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0, 8.0.1 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: build-images.out > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Nijs updated ARROW-17374: - Attachment: (was: image-2022-08-26-21-42-13-338.png) > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0, 8.0.1 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: build-images.out > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585662#comment-17585662 ] Vincent Nijs edited comment on ARROW-17374 at 9/3/22 1:51 AM: -- I ran into the same issue installing arrow from source for R on aarch64. Python is installed using conda but R was installed from source. The docker container for aarch64 is [vnijs/rsm-jupyter]([https://hub.docker.com/repository/docker/vnijs/rsm-jupyter).] Note that the image is rather large. was (Author: vnijs): I ran into the same issue installing arrow from source for R on aarch64. Python is installed using conda but R was installed from source. The docker container for aarch64 is [vnijs/rsm-jupyter]([https://hub.docker.com/repository/docker/vnijs/rsm-jupyter).] Note that the image is rather large. FYI I also tried installing using binaries but that gives the error below even though the file clearly exists. !image-2022-08-26-21-42-13-338.png! > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0, 8.0.1 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: image-2022-08-26-21-42-13-338.png > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17599) [C++] ReadRangeCache should not retain data after read
[ https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-17599: Labels: good-second-issue (was: ) > [C++] ReadRangeCache should not retain data after read > -- > > Key: ARROW-17599 > URL: https://issues.apache.org/jira/browse/ARROW-17599 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > Labels: good-second-issue > > I've added a unit test of the issue here: > https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention > We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes > those files are quite large (gigabytes). The usage is roughly: > for X in num_row_groups: > CacheAllThePiecesWeNeedForRowGroupX > WaitForPiecesToArriveForRowGroupX > ReadThePiecesWeNeedForRowGroupX > However, once we've read in row group X and passed it on to Acero, etc. we do > not release the data for row group X. The read range cache's entries vector > still holds a pointer to the buffer. The data is not released until the file > reader itself is destroyed which only happens when we have finished > processing an entire file. > This leads to excessive memory usage when pre-buffering is enabled. > This could potentially be a little difficult to implement because a single > read range's cache entry could be shared by multiple ranges so we will need > some kind of reference counting to know when we have fully finished with an > entry and can release it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17604) [Java][Docs] Improve docs around JVM flags
[ https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-17604: Assignee: David Li > [Java][Docs] Improve docs around JVM flags > -- > > Key: ARROW-17604 > URL: https://issues.apache.org/jira/browse/ARROW-17604 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: David Li >Assignee: David Li >Priority: Major > > * Clarify where the {{--add-opens}} flag should be added (as an argument to > {{java}}) > * Demonstrate how to configure Surefire with it > * Demonstrate how to configure IntelliJ with it > * Catch the exception that results if you forget and rethrow it with a > message telling you what to do/what docs to visit > * Investigate MemoryHandles as a possible replacement for this > * Poke MemoryUtil.UNSAFE ASAP at startup (e.g. in RootAllocator) to fail > sooner -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17604) [Java][Docs] Improve docs around JVM flags
[ https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-17604: - Description: * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it * Catch the exception that results if you forget and rethrow it with a message telling you what to do/what docs to visit * Investigate MemoryHandles as a possible replacement for this was: * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it * Catch the exception that results if you forget and rethrow it with a message telling you what to do/what docs to visit > [Java][Docs] Improve docs around JVM flags > -- > > Key: ARROW-17604 > URL: https://issues.apache.org/jira/browse/ARROW-17604 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: David Li >Priority: Major > > * Clarify where the {{--add-opens}} flag should be added (as an argument to > {{java}}) > * Demonstrate how to configure Surefire with it > * Demonstrate how to configure IntelliJ with it > * Catch the exception that results if you forget and rethrow it with a > message telling you what to do/what docs to visit > * Investigate MemoryHandles as a possible replacement for this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17604) [Java][Docs] Improve docs around JVM flags
[ https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-17604: - Description: * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it * Catch the exception that results if you forget and rethrow it with a message telling you what to do/what docs to visit * Investigate MemoryHandles as a possible replacement for this * Poke MemoryUtil.UNSAFE ASAP at startup (e.g. in RootAllocator) to fail sooner was: * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it * Catch the exception that results if you forget and rethrow it with a message telling you what to do/what docs to visit * Investigate MemoryHandles as a possible replacement for this > [Java][Docs] Improve docs around JVM flags > -- > > Key: ARROW-17604 > URL: https://issues.apache.org/jira/browse/ARROW-17604 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: David Li >Priority: Major > > * Clarify where the {{--add-opens}} flag should be added (as an argument to > {{java}}) > * Demonstrate how to configure Surefire with it > * Demonstrate how to configure IntelliJ with it > * Catch the exception that results if you forget and rethrow it with a > message telling you what to do/what docs to visit > * Investigate MemoryHandles as a possible replacement for this > * Poke MemoryUtil.UNSAFE ASAP at startup (e.g. in RootAllocator) to fail > sooner -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17604) [Java][Docs] Improve docs around JVM flags
[ https://issues.apache.org/jira/browse/ARROW-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-17604: - Description: * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it * Catch the exception that results if you forget and rethrow it with a message telling you what to do/what docs to visit was: * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it > [Java][Docs] Improve docs around JVM flags > -- > > Key: ARROW-17604 > URL: https://issues.apache.org/jira/browse/ARROW-17604 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: David Li >Priority: Major > > * Clarify where the {{--add-opens}} flag should be added (as an argument to > {{java}}) > * Demonstrate how to configure Surefire with it > * Demonstrate how to configure IntelliJ with it > * Catch the exception that results if you forget and rethrow it with a > message telling you what to do/what docs to visit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17604) [Java][Docs] Improve docs around JVM flags
David Li created ARROW-17604: Summary: [Java][Docs] Improve docs around JVM flags Key: ARROW-17604 URL: https://issues.apache.org/jira/browse/ARROW-17604 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Java Reporter: David Li * Clarify where the {{--add-opens}} flag should be added (as an argument to {{java}}) * Demonstrate how to configure Surefire with it * Demonstrate how to configure IntelliJ with it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory
[ https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599720#comment-17599720 ] Kouhei Sutou commented on ARROW-17595: -- Thanks. I could install {{gcc-toolset-11}}. But it's installed from the "ol8_appstream" repository. What is your "acx-appstream" repository? > [C++] Installation Error stdlib.h no such file or directory > --- > > Key: ARROW-17595 > URL: https://issues.apache.org/jira/browse/ARROW-17595 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: OEL 8.5 >Reporter: Robert Tidwell >Priority: Major > > Getting error installing arrow. > {noformat} > /opt/rh/gcc-toolset-11/root/usr/include/c++/11/cstdlib:75:15: fatal error: > stdlib.h: No such file or directory > 75 | #include_next > {noformat} > Using the c\+\+11 toolset. Tried with default c\+\+ on the system and was > getting the same type error. OS is OEL 8.5. > Here is the detailed output: > {noformat} > Installing package into ‘/usr/lib64/R/library’ > (as ‘lib’ is unspecified) > trying URL > 'http://**/prod/cran/src/contrib/arrow_9.0.0.tar.gz' > Content type 'application/octet-stream' length 4900968 bytes (4.7 MB) > == > downloaded 4.7 MB > * installing *source* package ‘arrow’ ... > ** package ‘arrow’ successfully unpacked and MD5 sums checked > ** using staged installation > *** Found local C++ source: 'tools/cpp' > *** Building libarrow from source > For build options and troubleshooting, see the install vignette: > https://cran.r-project.org/web/packages/arrow/vignettes/install.html > *** Building with MAKEFLAGS= -j2 > cmake: /bin/cmake > arrow with SOURCE_DIR='tools/cpp' BUILD_DIR='/tmp/rmt' > DEST_DIR='libarrow/arrow-9.0.0' CMAKE='/bin/cmake' EXTRA_CMAKE_FLAGS='' > CC='gcc -m64' CXX='g++ -m64 -std=gnu++11' LDFLAGS='-Wl,-z,relro -Wl,-z,now > -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' ARROW_S3='OFF' ARROW_GCS='OFF' > ARROW_MIMALLOC='OFF' > ++ pwd > + : /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow > + : tools/cpp > + : /tmp/rmt > + : libarrow/arrow-9.0.0 > + : /bin/cmake > ++ cd tools/cpp > ++ pwd > + SOURCE_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp > ++ mkdir -p libarrow/arrow-9.0.0 > ++ cd libarrow/arrow-9.0.0 > ++ pwd > + DEST_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0 > + '[' '' '!=' '' ']' > + '[' '' = false ']' > + ARROW_DEFAULT_PARAM=OFF > + mkdir -p /tmp/rmt > + pushd /tmp/rmt > /tmp/rmt /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow > + /bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF > -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON > -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO > -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF > -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF > -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON > -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON > -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF > -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release > -DCMAKE_INSTALL_LIBDIR=lib > -DCMAKE_INSTALL_PREFIX=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0 > -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON > -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF > -Dxsimd_SOURCE= -G 'Unix Makefiles' > /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp > -- Building using CMake version: 3.20.2 > -- The C compiler identification is GNU 11.2.1 > -- The CXX compiler identification is GNU 11.2.1 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - failed > -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc > -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc - > works > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - failed > -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ > -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ - > works > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 9.0.0 (full: '9.0.0') > -- Arrow SO version: 900 (full: 900.0.0) > -- clang-tidy 12 not found > -- clang-format 12 not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > fatal: not a git repository (or any parent up to mount point /) > Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). > -- Found Python3: /bin/python3.6 (found version "3.6.8") found
[jira] [Comment Edited] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory
[ https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599511#comment-17599511 ] Kouhei Sutou edited comment on ARROW-17595 at 9/2/22 9:20 PM: -- {noformat} dnf install gcc-toolset-11 {noformat} Here are the packages that I have installed: {noformat} gcc-toolset-11.x86_6411.1-1.el8 @acx-appstream gcc-toolset-11-annobin-docs.noarch 10.23-1.el8 @acx-appstream gcc-toolset-11-annobin-plugin-gcc.x86_64 10.23-1.el8 @acx-appstream gcc-toolset-11-binutils.x86_64 2.36.1-2.0.1.el8 @acx-appstream gcc-toolset-11-dwz.x86_640.14-2.el8 @acx-appstream gcc-toolset-11-dyninst.x86_6411.0.0-2.el8 @acx-appstream gcc-toolset-11-elfutils.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-debuginfod-client.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-libelf.x86_640.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-libs.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-gcc.x86_6411.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-c++.x86_6411.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-gdb-plugin.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-gfortran.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-plugin-devel.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gdb.x86_6410.2-5.0.1.el8 @acx-appstream gcc-toolset-11-libquadmath-devel.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-libstdc++-devel.x86_6411.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-ltrace.x86_64 0.7.91-1.el8 @acx-appstream gcc-toolset-11-make.x86_64 1:4.3-2.el8 @acx-appstream gcc-toolset-11-make-devel.x86_64 1:4.3-2.el8 @acx-appstream gcc-toolset-11-perftools.x86_64 11.1-1.el8 @acx-appstream gcc-toolset-11-runtime.x86_6411.0-1.el8 @acx-appstream gcc-toolset-11-strace.x86_64 5.13-7.el8 @acx-appstream gcc-toolset-11-systemtap.x86_64 4.5-4.el8 @acx-appstream gcc-toolset-11-systemtap-client.x86_64 4.5-4.el8 @acx-appstream gcc-toolset-11-systemtap-devel.x86_644.5-4.el8 @acx-appstream gcc-toolset-11-systemtap-runtime.x86_64 4.5-4.el8 @acx-appstream gcc-toolset-11-toolchain.x86_64 11.1-1.el8 @acx-appstream gcc-toolset-11-valgrind.x86_64 1:3.17.0-6.el8 @acx-appstream {noformat} was (Author: JIRAUSER295252): dnf install gcc-toolset-11 Here are the packages that I have installed: gcc-toolset-11.x86_6411.1-1.el8 @acx-appstream gcc-toolset-11-annobin-docs.noarch 10.23-1.el8 @acx-appstream gcc-toolset-11-annobin-plugin-gcc.x86_64 10.23-1.el8 @acx-appstream gcc-toolset-11-binutils.x86_64 2.36.1-2.0.1.el8 @acx-appstream gcc-toolset-11-dwz.x86_640.14-2.el8 @acx-appstream gcc-toolset-11-dyninst.x86_6411.0.0-2.el8 @acx-appstream gcc-toolset-11-elfutils.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-debuginfod-client.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-libelf.x86_640.185-5.el8 @acx-appstream
[jira] [Commented] (ARROW-17319) [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes if CPU 0 is not available
[ https://issues.apache.org/jira/browse/ARROW-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599715#comment-17599715 ] Kouhei Sutou commented on ARROW-17319: -- Sorry. I missed this. We don't need to do anything for "Please also update the crossbow configuration in order to keep the github ..." in the pull request. It's a task after we merge the pull request. Our CI jobs use new vcpkg revision in the pull request: https://github.com/apache/arrow/pull/13835#issuecomment-1210326357 But it failed in patching phase: https://github.com/ursacomputing/crossbow/runs/7762521768?check_suite_focus=true#step:6:65 {noformat} error: patch failed: ports/snappy/portfile.cmake:4 error: ports/snappy/portfile.cmake: patch does not apply Checking patch ports/snappy/snappy-disable-bmi.patch... {noformat} We need to confirm that the failed patch https://github.com/apache/arrow/blob/master/ci/vcpkg/ports.patch#L89-L125 is still needed. If it's no longer needed, we can remove the failed patch from our patch set. > [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes > if CPU 0 is not available > -- > > Key: ARROW-17319 > URL: https://issues.apache.org/jira/browse/ARROW-17319 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 > Environment: Ubuntu 20.02 / Python 3.8.10 (default, Jun 22 2022, > 20:18:18) > $ pip list > Package Version > --- --- > numpy 1.23.1 > pandas 1.4.3 > pip 20.0.2 > pkg-resources 0.0.0 > pyarrow 9.0.0 > python-dateutil 2.8.2 > pytz2022.1 > setuptools 44.0.0 > six 1.16.0 >Reporter: Mike Gevaert >Assignee: Mike Gevaert >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > I get the following traceback when exiting python after loading > {{pyarrow.parquet}} > {code} > Python 3.8.10 (default, Jun 22 2022, 20:18:18) > [GCC 9.4.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> os.getpid() > 25106 > >>> import pyarrow.parquet > >>> > Fatal error condition occurred in > /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: > aws_thread_launch(_thread, s_event_loop_destroy_async_thread_fn, > el_group, _options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200af06) > [0x7f831b2b3f06] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x20028e5) > [0x7f831b2ab8e5] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) > [0x7f831b1d0e09] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) > [0x7f831b2b4a3d] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f25948) > [0x7f831b1ce948] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) > [0x7f831b2b4a3d] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) > [0x7f831b189b46] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x194546a) > [0x7f831abee46a] > /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f831c6188a7] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f831c618a60] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f831c5f608a] > {code} > To replicate this; one needs to make sure that CPU 0 isn't available to > schedule tasks on. In HPC our environment, that happens due to slurm using > cgroups to constrain CPU usage. > On a linux workstation, one should be able to: > 1) open python as a normal user > 2) get the pid > 3) as root: > {code} > cd /sys/fs/cgroup/cpuset/ > mkdir pyarrow > cd pyarrow > echo 0 > cpuset.mems > echo 1 > cpuset.cpus # sets the cgroup to only have access to cpu 1 > echo $PID > tasks > {code} > Then, in the python enviroment: > {code} > import pyarrow.parquet > exit() > {code} > Which should trigger the crash. > Sadly, I couldn't track down which {{aws-c-common}} and {{aws-c-io}} are > being used for the 9.0.0 py38 manylinux wheels. (libarrow.so.900 has > BuildID[sha1]=dd6c5a2efd5cacf09657780a58c40f7c930e4df1) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method
[ https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6772: -- Labels: good-first-issue pull-request-available (was: good-first-issue) > [C++] Add operator== for interfaces with an Equals() method > --- > > Key: ARROW-6772 > URL: https://issues.apache.org/jira/browse/ARROW-6772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Harkins >Priority: Major > Labels: good-first-issue, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The > addition of overloaded equality operators will allow this o be written > {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will > allow more informative assertion failure messages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17601) [C++] Error when creating Expression on Decimal128 types: precision out of range
[ https://issues.apache.org/jira/browse/ARROW-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599693#comment-17599693 ] Neal Richardson commented on ARROW-17601: - Also, in case it's relevant: we don't have any problems like this with integer or floating-point types: {code} > expr <- Expression$create("multiply_checked", Expression$field_ref("col1"), > Expression$field_ref("col2")) > > expr$type(schema(col1=int64(), col2=int64())) Int64 int64 > expr$type(schema(col1=int32(), col2=int32())) Int32 int32 > expr$type(schema(col1=float32(), col2=float32())) Float32 float {code} > [C++] Error when creating Expression on Decimal128 types: precision out of > range > > > Key: ARROW-17601 > URL: https://issues.apache.org/jira/browse/ARROW-17601 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > Reproducer in R: > {code} > library(arrow) > library(dplyr) > tab <- Table$create(col1 = 1:4, col2 = 5:8) > tab <- tab$cast(schema(col1 = decimal128(33, 4), col2 = decimal128(15, 2))) > tab %>% mutate(col1 * col2) > # Error: Invalid: Decimal precision out of range [1, 38]: 49 > # /Users/me/arrow/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc:1078 > DecimalType::Make(left_type.id(), precision, scale) > # /Users/me/arrow/cpp/src/arrow/compute/exec/expression.cc:413 > call.kernel->signature->out_type().Resolve(_context, types) > {code} > With integers and floats, we upcast to a wider size in some compute functions > like multiplication. Should this go up to Decimal256? Or is there a better > way to determine the size required? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17492) [C++] Hashing32/64 support for large var-binary types
[ https://issues.apache.org/jira/browse/ARROW-17492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-17492. - Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13940 [https://github.com/apache/arrow/pull/13940] > [C++] Hashing32/64 support for large var-binary types > - > > Key: ARROW-17492 > URL: https://issues.apache.org/jira/browse/ARROW-17492 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Currently, Hashing32/64 only supports non-large var-binary types. This issue > will add support for large var-binary types. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17593) [C++] Try and maintain input shape in Acero
[ https://issues.apache.org/jira/browse/ARROW-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599684#comment-17599684 ] Aldrin Montana commented on ARROW-17593: Nice. Since only the final projection would need to worry about the max_batch_size, not the internal plan nodes (I think), this sounds like there wouldn't be too much cost either. I'm mostly curious how efficiently aggregates can size their outputs, since they won't know exactly how many groups there will be until runtime. But, I also guess that would only be inefficient in the last batch, so it would be amortized for larger resultsets. > [C++] Try and maintain input shape in Acero > --- > > Key: ARROW-17593 > URL: https://issues.apache.org/jira/browse/ARROW-17593 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Priority: Major > > Data is scanned in large chunks based on the format. For example, CSV scans > chunks based on a chunk_size while parquet scans entire row groups. > Then, upon entry into Acero, these chunks are sliced into morsels (~L3 size) > for parallelism and batches (~L1-L2 size) for cache efficient processing. > However, the way it is currently done, means that the output of Acero is a > stream of tiny batches. This is somewhat undesirable in many cases. > For example, if a pyarrow user calls pq.read_table they might expect to get > one batch per row group. If they were to turn around and write out that > table to a new parquet file then either they end up with a non-ideal parquet > file (tiny row groups) or they are forced to concatenate the batches (which > is an allocation + copy). > Even if the user is doing their own streaming processing (e.g. in pyarrow) > these small batch sizes are undesirable as the overhead of python means that > streaming processing should be done in larger batches. > Instead, there should be a configurable max_batch_size, independent of row > group size and morsel size, which is configurable, and quite large by default > (1Mi or 64Mi rows). This control exists for users that want to do their own > streaming processing and need to be able to tune for RAM usage. > Acero will read in data based on the format, as it does today (e.g. CSV chunk > size, row group size). If the source data is very large (bigger than > max_batch_size) it will be sliced. From that point on, any morsels or > batches should simply be views into this larger output batch. For example, > when doing a projection to add a new column, we should allocate a > max_batch_size array and then populate it over many runs of the project node. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17598) [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON
[ https://issues.apache.org/jira/browse/ARROW-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599671#comment-17599671 ] Aldrin Montana commented on ARROW-17598: sounds good [~yibocai]. I wasn't sure if it would need any more discussion (which is why I made a JIRA instead of submitted a MINOR pr). I opened a PR, but I don't think the CI will exercise it, so not sure what to do about that. Let me know if you'd like any other follow up! > [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON > -- > > Key: ARROW-17598 > URL: https://issues.apache.org/jira/browse/ARROW-17598 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Aldrin Montana >Assignee: Aldrin Montana >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When compiling Arrow with the following parameter: > {code:bash} > -DARROW_BUILD_BENCHMARKS_REFERENCE="ON" > {code} > I get this error and other similar ones: > {code:bash} > cpp/src/arrow/io/memory_benchmark.cc:244:26: error: use of undeclared > identifier 'Read'; did you mean 'read'? > using ApplyFn = decltype(Read); > {code} > > The reason is that `memory_benchmark.cc` does not define the Read function, > and others, if the SIMD level is NEON (`ARROW_HAVE_NEON` is true). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17598) [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON
[ https://issues.apache.org/jira/browse/ARROW-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17598: --- Labels: pull-request-available (was: ) > [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON > -- > > Key: ARROW-17598 > URL: https://issues.apache.org/jira/browse/ARROW-17598 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Aldrin Montana >Assignee: Aldrin Montana >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When compiling Arrow with the following parameter: > {code:bash} > -DARROW_BUILD_BENCHMARKS_REFERENCE="ON" > {code} > I get this error and other similar ones: > {code:bash} > cpp/src/arrow/io/memory_benchmark.cc:244:26: error: use of undeclared > identifier 'Read'; did you mean 'read'? > using ApplyFn = decltype(Read); > {code} > > The reason is that `memory_benchmark.cc` does not define the Read function, > and others, if the SIMD level is NEON (`ARROW_HAVE_NEON` is true). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17598) [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON
[ https://issues.apache.org/jira/browse/ARROW-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aldrin Montana reassigned ARROW-17598: -- Assignee: Aldrin Montana > [C++] Arrow io memory_benchmark does not compile if SIMD level is NEON > -- > > Key: ARROW-17598 > URL: https://issues.apache.org/jira/browse/ARROW-17598 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Aldrin Montana >Assignee: Aldrin Montana >Priority: Minor > > When compiling Arrow with the following parameter: > {code:bash} > -DARROW_BUILD_BENCHMARKS_REFERENCE="ON" > {code} > I get this error and other similar ones: > {code:bash} > cpp/src/arrow/io/memory_benchmark.cc:244:26: error: use of undeclared > identifier 'Read'; did you mean 'read'? > using ApplyFn = decltype(Read); > {code} > > The reason is that `memory_benchmark.cc` does not define the Read function, > and others, if the SIMD level is NEON (`ARROW_HAVE_NEON` is true). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16481) [Python] Remove the test usage of legacy dataset
[ https://issues.apache.org/jira/browse/ARROW-16481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16481: --- Labels: pull-request-available (was: ) > [Python] Remove the test usage of legacy dataset > > > Key: ARROW-16481 > URL: https://issues.apache.org/jira/browse/ARROW-16481 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alenka Frim >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently the parquet tests cover both new and old dataset implementation. > The usage of legacy dataset can be removed from the tests as the legacy > ParquetDataset is being deprecated. > See https://issues.apache.org/jira/browse/ARROW-16119. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17519) [R] RTools35 job is failing
[ https://issues.apache.org/jira/browse/ARROW-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17519: --- Labels: pull-request-available (was: ) > [R] RTools35 job is failing > --- > > Key: ARROW-17519 > URL: https://issues.apache.org/jira/browse/ARROW-17519 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > After ARROW-17436, the RTools35 job is consistently failing with: > {noformat} > Error: Error: package or namespace load failed for 'arrow' in inDL(x, > as.logical(local), as.logical(now), ...): > unable to load shared object > 'D:/a/arrow/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/i386/arrow.dll': > LoadLibrary failure: A dynamic link library (DLL) initialization routine > failed. > {noformat} > Given that there is a mailing list discussion about dropping support for that > platform, should we disable the check? Or wait until that is resolved to > disable the check? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17519) [R] RTools35 job is failing
[ https://issues.apache.org/jira/browse/ARROW-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington reassigned ARROW-17519: Assignee: Dewey Dunnington > [R] RTools35 job is failing > --- > > Key: ARROW-17519 > URL: https://issues.apache.org/jira/browse/ARROW-17519 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > After ARROW-17436, the RTools35 job is consistently failing with: > {noformat} > Error: Error: package or namespace load failed for 'arrow' in inDL(x, > as.logical(local), as.logical(now), ...): > unable to load shared object > 'D:/a/arrow/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/i386/arrow.dll': > LoadLibrary failure: A dynamic link library (DLL) initialization routine > failed. > {noformat} > Given that there is a mailing list discussion about dropping support for that > platform, should we disable the check? Or wait until that is resolved to > disable the check? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17603) [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails
[ https://issues.apache.org/jira/browse/ARROW-17603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17603: --- Labels: pull-request-available (was: ) > [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection > fails > --- > > Key: ARROW-17603 > URL: https://issues.apache.org/jira/browse/ARROW-17603 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Make it easier to debug build failures in CI. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17603) [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails
[ https://issues.apache.org/jira/browse/ARROW-17603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-17603: Assignee: David Li > [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection > fails > --- > > Key: ARROW-17603 > URL: https://issues.apache.org/jira/browse/ARROW-17603 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > > Make it easier to debug build failures in CI. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17603) [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails
David Li created ARROW-17603: Summary: [C++][FlightRPC] Print build logs if gRPC TlsCredentialsOptions detection fails Key: ARROW-17603 URL: https://issues.apache.org/jira/browse/ARROW-17603 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: David Li Make it easier to debug build failures in CI. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-11335) [C++] Support FixedSizeList JSON parsing
[ https://issues.apache.org/jira/browse/ARROW-11335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599626#comment-17599626 ] Antoine Pitrou commented on ARROW-11335: cc [~benpharkins] > [C++] Support FixedSizeList JSON parsing > > > Key: ARROW-11335 > URL: https://issues.apache.org/jira/browse/ARROW-11335 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Johan Peltenburg >Priority: Major > Labels: good-first-issue > > Currently, the C++ JSON table reader does not support the Arrow FixedSizeList > type. > When the Arrow schema is not inferred but supplied to the parser, it should > attempt to turn JSON arrays into an Arrow FixedSizeList if the schema > dictates this for specific fields. > This currently results in the JSON table reader returning an error, e.g.: > {code:java} > JSON conversion to fixed_size_list[12] is not > supported{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-4709) [C++] Optimize for ordered JSON fields
[ https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599627#comment-17599627 ] Antoine Pitrou commented on ARROW-4709: --- cc [~benpharkins] if you'd like to tackle a performance issue at some point. > [C++] Optimize for ordered JSON fields > -- > > Key: ARROW-4709 > URL: https://issues.apache.org/jira/browse/ARROW-4709 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ben Kietzman >Priority: Minor > Labels: good-second-issue > > Fields appear consistently ordered in most JSON data in the wild, but the > JSON parser currently looks fields up in a hash table. The ordering can > probably be exploited to yield better performance when looking up field > indices -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-4709) [C++] Optimize for ordered JSON fields
[ https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4709: -- Labels: good-second-issue (was: ) > [C++] Optimize for ordered JSON fields > -- > > Key: ARROW-4709 > URL: https://issues.apache.org/jira/browse/ARROW-4709 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ben Kietzman >Priority: Minor > Labels: good-second-issue > > Fields appear consistently ordered in most JSON data in the wild, but the > JSON parser currently looks fields up in a hash table. The ordering can > probably be exploited to yield better performance when looking up field > indices -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing
[ https://issues.apache.org/jira/browse/ARROW-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington closed ARROW-17602. Resolution: Duplicate > [R] RTools35/R 3.6 CI job is failing > > > Key: ARROW-17602 > URL: https://issues.apache.org/jira/browse/ARROW-17602 > Project: Apache Arrow > Issue Type: Bug >Reporter: Dewey Dunnington >Priority: Major > > After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about > to drop support for R 3.6 with the move to C++17, so instead of debugging > this failure I think we should just disable this check. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-11335) [C++] Support FixedSizeList JSON parsing
[ https://issues.apache.org/jira/browse/ARROW-11335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-11335: --- Priority: Major (was: Minor) > [C++] Support FixedSizeList JSON parsing > > > Key: ARROW-11335 > URL: https://issues.apache.org/jira/browse/ARROW-11335 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Johan Peltenburg >Priority: Major > Labels: good-first-issue > > Currently, the C++ JSON table reader does not support the Arrow FixedSizeList > type. > When the Arrow schema is not inferred but supplied to the parser, it should > attempt to turn JSON arrays into an Arrow FixedSizeList if the schema > dictates this for specific fields. > This currently results in the JSON table reader returning an error, e.g.: > {code:java} > JSON conversion to fixed_size_list[12] is not > supported{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing
[ https://issues.apache.org/jira/browse/ARROW-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599625#comment-17599625 ] Dewey Dunnington commented on ARROW-17602: -- (Facepalm...I forgot I created that!) > [R] RTools35/R 3.6 CI job is failing > > > Key: ARROW-17602 > URL: https://issues.apache.org/jira/browse/ARROW-17602 > Project: Apache Arrow > Issue Type: Bug >Reporter: Dewey Dunnington >Priority: Major > > After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about > to drop support for R 3.6 with the move to C++17, so instead of debugging > this failure I think we should just disable this check. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16226) [C++] Add better coverage for filesystem tell.
[ https://issues.apache.org/jira/browse/ARROW-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16226: --- Labels: good-first-issue (was: ) > [C++] Add better coverage for filesystem tell. > -- > > Key: ARROW-16226 > URL: https://issues.apache.org/jira/browse/ARROW-16226 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Micah Kornfield >Priority: Major > Labels: good-first-issue > > Add a C++ generic file system test that writes wrote N bytes to a file. then > seeks to N/2 and and read the remainder. Verify the remainder bytes are N/2 > and expected from the bytes writter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method
[ https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599618#comment-17599618 ] Antoine Pitrou commented on ARROW-6772: --- I think it would be nice to add at least _basic_ tests to ensure that the operators properly redirect. No need to duplicate all tests though. > [C++] Add operator== for interfaces with an Equals() method > --- > > Key: ARROW-6772 > URL: https://issues.apache.org/jira/browse/ARROW-6772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Harkins >Priority: Major > Labels: good-first-issue > > A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The > addition of overloaded equality operators will allow this o be written > {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will > allow more informative assertion failure messages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15479) [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name)
[ https://issues.apache.org/jira/browse/ARROW-15479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15479: --- Labels: good-second-issue kernel (was: kernel) > [C++] Cast fixed size list to compatible fixed size list type (other values > type, other field name) > --- > > Key: ARROW-15479 > URL: https://issues.apache.org/jira/browse/ARROW-15479 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: good-second-issue, kernel > > Casting a FixedSizeListArray to a compatible type but only a different field > name isn't implemented: > {code:python} > >>> my_type = pa.list_(pa.field("element", pa.int64()), 2) > >>> arr = pa.FixedSizeListArray.from_arrays(pa.array([1, 2, 3, 4, 5, 6]), 2) > >>> arr.type > FixedSizeListType(fixed_size_list[2]) > >>> my_type > FixedSizeListType(fixed_size_list[2]) > >>> arr.cast(my_type) > ... > ArrowNotImplementedError: Unsupported cast from fixed_size_list int64>[2] to fixed_size_list using function cast_fixed_size_list > {code} > While the similar operation with a variable sized list actually works: > {code:python} > >>> my_type = pa.list_(pa.field("element", pa.int64())) > >>> arr = pa.array([[1, 2], [3, 4]], pa.list_(pa.int64())) > >>> arr.type > ListType(list) > >>> arr.cast(my_type).type > ListType(list) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method
[ https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599617#comment-17599617 ] Ben Harkins commented on ARROW-6772: I'm currently working on this one - planning on adding {{util::EqualityComparable}} to {{{}DataType{}}}, {{{}Field{}}}, and {{{}FieldRef{}}}. Should additional comparison tests be added (in addition to AssertXXXEqual) to type_test.cc or would that be considered redundant? > [C++] Add operator== for interfaces with an Equals() method > --- > > Key: ARROW-6772 > URL: https://issues.apache.org/jira/browse/ARROW-6772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Harkins >Priority: Major > Labels: good-first-issue > > A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The > addition of overloaded equality operators will allow this o be written > {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will > allow more informative assertion failure messages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
[ https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-16326: --- Assignee: Micah Kornfield > [C++][Python] Add GCS Timeout parameter for GCS FileSystem. > --- > > Key: ARROW-16326 > URL: https://issues.apache.org/jira/browse/ARROW-16326 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: good-first-issue, good-second-issue > > Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench > isn't installed properly the failure mode is tests timeouts because the > connection hangs. We should add a timeout parameter to prevent this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
[ https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599613#comment-17599613 ] Micah Kornfield commented on ARROW-16326: - This was actually done in the PR. "retry_limit_seconds" can be passed through to the URI and if no connection is established within that time it will fail. > [C++][Python] Add GCS Timeout parameter for GCS FileSystem. > --- > > Key: ARROW-16326 > URL: https://issues.apache.org/jira/browse/ARROW-16326 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Micah Kornfield >Priority: Major > Labels: good-first-issue, good-second-issue > > Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench > isn't installed properly the failure mode is tests timeouts because the > connection hangs. We should add a timeout parameter to prevent this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
[ https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-16326. - Resolution: Fixed > [C++][Python] Add GCS Timeout parameter for GCS FileSystem. > --- > > Key: ARROW-16326 > URL: https://issues.apache.org/jira/browse/ARROW-16326 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: good-first-issue, good-second-issue > > Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench > isn't installed properly the failure mode is tests timeouts because the > connection hangs. We should add a timeout parameter to prevent this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17033) [C++] Add GCS connection pool size option
[ https://issues.apache.org/jira/browse/ARROW-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599612#comment-17599612 ] Antoine Pitrou commented on ARROW-17033: cc [~benpharkins] > [C++] Add GCS connection pool size option > - > > Key: ARROW-17033 > URL: https://issues.apache.org/jira/browse/ARROW-17033 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 8.0.0 >Reporter: Leonhard Gruenschloss >Priority: Minor > Labels: GCP, good-first-issue, performance > > Multi-threaded read performance in Arrow's GCS file system implementation > currently is relatively low. Given the high latency of cloud blob systems > like GCS, a common strategy is to use many concurrent readers (if the system > has enough memory to support that), e.g. using 100 threads. > The GCS client library offers a [{{ConnectionPoolSize}} > option|https://googleapis.dev/cpp/google-cloud-storage/latest/structgoogle_1_1cloud_1_1storage_1_1v1_1_1ConnectionPoolSizeOption.html]. > If this option is set to a value that's too low, concurrency is throttled. > At the moment, this is not exposed in > [{{GcsOptions}}|https://github.com/apache/arrow/blob/73cdd6a59b52781cc43e097ccd63ac36f705ee2e/cpp/src/arrow/filesystem/gcsfs.h#L59], > consequently limiting multi-threaded throughput. > Instead of exposing this option, an alternative implementation strategy could > be to use the same value as set by {{arrow::io::SetIOThreadPoolCapacity}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment
[ https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599610#comment-17599610 ] Antoine Pitrou edited comment on ARROW-8201 at 9/2/22 4:41 PM: --- [~milesgranger] Perhaps you would be interested in finding out whether this issue still applies, and if so, to come up with a PR? was (Author: pitrou): [~milesgranger] Perhaps you would be interested whether this issue still applies, and if so, to come up with a PR? > [Python][Dataset] Improve ergonomics of FileFragment > > > Key: ARROW-8201 > URL: https://issues.apache.org/jira/browse/ARROW-8201 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Priority: Major > Labels: dataset > > FileFragment can be made more directly useful by adding convenience methods. > For example, a FileFragment could allow underlying file/buffer to be opened > directly: > {code} > def open(self): > """ > Open a NativeFile of the buffer or file viewed by this fragment. > """ > cdef: > CFileSystem* c_filesystem > shared_ptr[CRandomAccessFile] opened > NativeFile out = NativeFile() > buf = self.buffer > if buf is not None: > return pa.io.BufferReader(buf) > with nogil: > c_filesystem = self.file_fragment.source().filesystem() > opened = GetResultValue(c_filesystem.OpenInputFile( > self.file_fragment.source().path())) > out.set_random_access_file(opened) > out.is_readable = True > return out > {code} > Additionally, a ParquetFileFragment's metadata could be introspectable: > {code} > @property > def metadata(self): > from pyarrow._parquet import ParquetReader > reader = ParquetReader() > reader.open(self.open()) > return reader.metadata > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment
[ https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599610#comment-17599610 ] Antoine Pitrou commented on ARROW-8201: --- [~milesgranger] Perhaps you would be interested whether this issue still applies, and if so, to come up with a PR? > [Python][Dataset] Improve ergonomics of FileFragment > > > Key: ARROW-8201 > URL: https://issues.apache.org/jira/browse/ARROW-8201 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Priority: Major > Labels: dataset > > FileFragment can be made more directly useful by adding convenience methods. > For example, a FileFragment could allow underlying file/buffer to be opened > directly: > {code} > def open(self): > """ > Open a NativeFile of the buffer or file viewed by this fragment. > """ > cdef: > CFileSystem* c_filesystem > shared_ptr[CRandomAccessFile] opened > NativeFile out = NativeFile() > buf = self.buffer > if buf is not None: > return pa.io.BufferReader(buf) > with nogil: > c_filesystem = self.file_fragment.source().filesystem() > opened = GetResultValue(c_filesystem.OpenInputFile( > self.file_fragment.source().path())) > out.set_random_access_file(opened) > out.is_readable = True > return out > {code} > Additionally, a ParquetFileFragment's metadata could be introspectable: > {code} > @property > def metadata(self): > from pyarrow._parquet import ParquetReader > reader = ParquetReader() > reader.open(self.open()) > return reader.metadata > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing
[ https://issues.apache.org/jira/browse/ARROW-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599609#comment-17599609 ] Neal Richardson commented on ARROW-17602: - Duplicate of ARROW-17519? > [R] RTools35/R 3.6 CI job is failing > > > Key: ARROW-17602 > URL: https://issues.apache.org/jira/browse/ARROW-17602 > Project: Apache Arrow > Issue Type: Bug >Reporter: Dewey Dunnington >Priority: Major > > After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about > to drop support for R 3.6 with the move to C++17, so instead of debugging > this failure I think we should just disable this check. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16753) [C++] LocalFileSystem cannot list Linux directory recursively when permission to subdirectory contents are denied
[ https://issues.apache.org/jira/browse/ARROW-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16753: --- Labels: good-second-issue pull-request-available (was: pull-request-available) > [C++] LocalFileSystem cannot list Linux directory recursively when permission > to subdirectory contents are denied > - > > Key: ARROW-16753 > URL: https://issues.apache.org/jira/browse/ARROW-16753 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.1 > Environment: Ubuntu 20.04 LTS >Reporter: David Rauschenbach >Assignee: David Rauschenbach >Priority: Major > Labels: good-second-issue, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The following code to list my root directory fails: > > {code:java} > FileSelector file_selector; > file_selector.base_dir = "/"; > file_selector.allow_not_found = true; > file_selector.recursive = true; > auto result = fs.GetFileInfo(file_selector);{code} > The result.ok() value returns {+}false{+}, and then result.status().message() > returns {+}Cannot list directory '/var/run/wpa_supplicant'{+}. > An examination of the /run directory (which /var/run symlinks to) shows: > > {code:java} > $ ls -al /run > drwxr-xr-x 35 root root 1040 Jun 6 06:11 . > drwxr-xr-x 20 root root 4096 May 20 12:42 .. > ... > drwxr-x--- 2 root root 60 Jun 4 12:14 wpa_supplicant{code} > And then attempting to list this directory reveals: > > {code:java} > $ ls -al /run/wpa_supplicant/ > ls: cannot open directory '/run/wpa_supplicant/': Permission denied{code} > > As a user of LocalFileSystem, I should be able to list all of the files that > I have access to. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16746) [C++][Python] S3 tag support on write
[ https://issues.apache.org/jira/browse/ARROW-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16746: --- Labels: good-second-issue (was: ) > [C++][Python] S3 tag support on write > - > > Key: ARROW-16746 > URL: https://issues.apache.org/jira/browse/ARROW-16746 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: André Kelpe >Priority: Major > Labels: good-second-issue > > S3 allows tagging data to better organize ones data > ([https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)] > We use this for efficient downstream processes/inventory management. > Currently arrow/pyarrow does not allow tags to be added on write. This is > causing us to scan the bucket and re-apply the tags after a pyrrow based > process has run. > I looked through the code and think that it could potentially be done via the > metadata mechanism. > The tags need to be added to the CreateMultipartUploadRequest here: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156 > See also > http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
[ https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-16326: -- Assignee: (was: Micah Kornfield) > [C++][Python] Add GCS Timeout parameter for GCS FileSystem. > --- > > Key: ARROW-16326 > URL: https://issues.apache.org/jira/browse/ARROW-16326 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Micah Kornfield >Priority: Major > Labels: good-first-issue, good-second-issue > > Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench > isn't installed properly the failure mode is tests timeouts because the > connection hangs. We should add a timeout parameter to prevent this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
[ https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599607#comment-17599607 ] Antoine Pitrou commented on ARROW-16326: cc [~benpharkins] > [C++][Python] Add GCS Timeout parameter for GCS FileSystem. > --- > > Key: ARROW-16326 > URL: https://issues.apache.org/jira/browse/ARROW-16326 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > > Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench > isn't installed properly the failure mode is tests timeouts because the > connection hangs. We should add a timeout parameter to prevent this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
[ https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16326: --- Labels: good-first-issue good-second-issue (was: ) > [C++][Python] Add GCS Timeout parameter for GCS FileSystem. > --- > > Key: ARROW-16326 > URL: https://issues.apache.org/jira/browse/ARROW-16326 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: good-first-issue, good-second-issue > > Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench > isn't installed properly the failure mode is tests timeouts because the > connection hangs. We should add a timeout parameter to prevent this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16226) [C++] Add better coverage for filesystem tell.
[ https://issues.apache.org/jira/browse/ARROW-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599605#comment-17599605 ] Antoine Pitrou commented on ARROW-16226: cc [~benpharkins] > [C++] Add better coverage for filesystem tell. > -- > > Key: ARROW-16226 > URL: https://issues.apache.org/jira/browse/ARROW-16226 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Micah Kornfield >Priority: Major > > Add a C++ generic file system test that writes wrote N bytes to a file. then > seeks to N/2 and and read the remainder. Verify the remainder bytes are N/2 > and expected from the bytes writter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17601) [C++] Error when creating Expression on Decimal128 types: precision out of range
[ https://issues.apache.org/jira/browse/ARROW-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599604#comment-17599604 ] Neal Richardson commented on ARROW-17601: - Here's a reproducer just using Expressions, no dplyr: {code} expr <- Expression$create("multiply_checked", Expression$field_ref("col1"), Expression$field_ref("col2")) expr # Expression # multiply_checked(col1, col2) expr$type(schema(col1=decimal128(33, 4), col2=decimal128(15, 2))) # Error: Invalid: Decimal precision out of range [1, 38]: 49 {code} Note that the different kernels have different promotion logic. Subtraction, for example, doesn't go out of range with these types. {code} expr <- Expression$create("subtract_checked", Expression$field_ref("col1"), Expression$field_ref("col2")) expr$type(schema(col1=decimal128(33, 4), col2=decimal128(15, 2))) # Decimal128Type # decimal128(34, 4) {code} > [C++] Error when creating Expression on Decimal128 types: precision out of > range > > > Key: ARROW-17601 > URL: https://issues.apache.org/jira/browse/ARROW-17601 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > Reproducer in R: > {code} > library(arrow) > library(dplyr) > tab <- Table$create(col1 = 1:4, col2 = 5:8) > tab <- tab$cast(schema(col1 = decimal128(33, 4), col2 = decimal128(15, 2))) > tab %>% mutate(col1 * col2) > # Error: Invalid: Decimal precision out of range [1, 38]: 49 > # /Users/me/arrow/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc:1078 > DecimalType::Make(left_type.id(), precision, scale) > # /Users/me/arrow/cpp/src/arrow/compute/exec/expression.cc:413 > call.kernel->signature->out_type().Resolve(_context, types) > {code} > With integers and floats, we upcast to a wider size in some compute functions > like multiplication. Should this go up to Decimal256? Or is there a better > way to determine the size required? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17602) [R] RTools35/R 3.6 CI job is failing
Dewey Dunnington created ARROW-17602: Summary: [R] RTools35/R 3.6 CI job is failing Key: ARROW-17602 URL: https://issues.apache.org/jira/browse/ARROW-17602 Project: Apache Arrow Issue Type: Bug Reporter: Dewey Dunnington After ARROW-17436, the Windows/R 3.6 job is consistently failing. We're about to drop support for R 3.6 with the move to C++17, so instead of debugging this failure I think we should just disable this check. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17601) [C++] Error when creating Expression on Decimal128 types: precision out of range
Neal Richardson created ARROW-17601: --- Summary: [C++] Error when creating Expression on Decimal128 types: precision out of range Key: ARROW-17601 URL: https://issues.apache.org/jira/browse/ARROW-17601 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson Reproducer in R: {code} library(arrow) library(dplyr) tab <- Table$create(col1 = 1:4, col2 = 5:8) tab <- tab$cast(schema(col1 = decimal128(33, 4), col2 = decimal128(15, 2))) tab %>% mutate(col1 * col2) # Error: Invalid: Decimal precision out of range [1, 38]: 49 # /Users/me/arrow/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc:1078 DecimalType::Make(left_type.id(), precision, scale) # /Users/me/arrow/cpp/src/arrow/compute/exec/expression.cc:413 call.kernel->signature->out_type().Resolve(_context, types) {code} With integers and floats, we upcast to a wider size in some compute functions like multiplication. Should this go up to Decimal256? Or is there a better way to determine the size required? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17600) [Go] Implement Casting for Complex Types (List/Struct/etc.)
Matthew Topol created ARROW-17600: - Summary: [Go] Implement Casting for Complex Types (List/Struct/etc.) Key: ARROW-17600 URL: https://issues.apache.org/jira/browse/ARROW-17600 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17588) [Go] Casting to BinaryLike types
[ https://issues.apache.org/jira/browse/ARROW-17588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-17588. --- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14027 [https://github.com/apache/arrow/pull/14027] > [Go] Casting to BinaryLike types > > > Key: ARROW-17588 > URL: https://issues.apache.org/jira/browse/ARROW-17588 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method
[ https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Harkins reassigned ARROW-6772: -- Assignee: Ben Harkins > [C++] Add operator== for interfaces with an Equals() method > --- > > Key: ARROW-6772 > URL: https://issues.apache.org/jira/browse/ARROW-6772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Harkins >Priority: Major > Labels: good-first-issue > > A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The > addition of overloaded equality operators will allow this o be written > {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will > allow more informative assertion failure messages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory
[ https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599511#comment-17599511 ] Robert Tidwell commented on ARROW-17595: dnf install gcc-toolset-11 Here are the packages that I have installed: gcc-toolset-11.x86_6411.1-1.el8 @acx-appstream gcc-toolset-11-annobin-docs.noarch 10.23-1.el8 @acx-appstream gcc-toolset-11-annobin-plugin-gcc.x86_64 10.23-1.el8 @acx-appstream gcc-toolset-11-binutils.x86_64 2.36.1-2.0.1.el8 @acx-appstream gcc-toolset-11-dwz.x86_640.14-2.el8 @acx-appstream gcc-toolset-11-dyninst.x86_6411.0.0-2.el8 @acx-appstream gcc-toolset-11-elfutils.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-debuginfod-client.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-libelf.x86_640.185-5.el8 @acx-appstream gcc-toolset-11-elfutils-libs.x86_64 0.185-5.el8 @acx-appstream gcc-toolset-11-gcc.x86_6411.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-c++.x86_6411.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-gdb-plugin.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-gfortran.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gcc-plugin-devel.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-gdb.x86_6410.2-5.0.1.el8 @acx-appstream gcc-toolset-11-libquadmath-devel.x86_64 11.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-libstdc++-devel.x86_6411.2.1-9.1.0.2.el8 @acx-appstream gcc-toolset-11-ltrace.x86_64 0.7.91-1.el8 @acx-appstream gcc-toolset-11-make.x86_64 1:4.3-2.el8 @acx-appstream gcc-toolset-11-make-devel.x86_64 1:4.3-2.el8 @acx-appstream gcc-toolset-11-perftools.x86_64 11.1-1.el8 @acx-appstream gcc-toolset-11-runtime.x86_6411.0-1.el8 @acx-appstream gcc-toolset-11-strace.x86_64 5.13-7.el8 @acx-appstream gcc-toolset-11-systemtap.x86_64 4.5-4.el8 @acx-appstream gcc-toolset-11-systemtap-client.x86_64 4.5-4.el8 @acx-appstream gcc-toolset-11-systemtap-devel.x86_644.5-4.el8 @acx-appstream gcc-toolset-11-systemtap-runtime.x86_64 4.5-4.el8 @acx-appstream gcc-toolset-11-toolchain.x86_64 11.1-1.el8 @acx-appstream gcc-toolset-11-valgrind.x86_64 1:3.17.0-6.el8 @acx-appstream > [C++] Installation Error stdlib.h no such file or directory > --- > > Key: ARROW-17595 > URL: https://issues.apache.org/jira/browse/ARROW-17595 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: OEL 8.5 >Reporter: Robert Tidwell >Priority: Major > > Getting error installing arrow. > {noformat} > /opt/rh/gcc-toolset-11/root/usr/include/c++/11/cstdlib:75:15: fatal error: > stdlib.h: No such file or directory > 75 | #include_next > {noformat} > Using the c\+\+11 toolset. Tried with default c\+\+ on the system and was > getting the same type error. OS is OEL 8.5. > Here is the detailed output: > {noformat} > Installing package into ‘/usr/lib64/R/library’ > (as ‘lib’ is unspecified) > trying URL > 'http://**/prod/cran/src/contrib/arrow_9.0.0.tar.gz' > Content type 'application/octet-stream' length 4900968 bytes (4.7 MB) > == > downloaded 4.7 MB > * installing *source* package ‘arrow’ ... > ** package ‘arrow’ successfully unpacked and
[jira] [Updated] (ARROW-17595) [C++] Installation Error stdlib.h no such file or directory
[ https://issues.apache.org/jira/browse/ARROW-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Tidwell updated ARROW-17595: --- Description: Getting error installing arrow. {noformat} /opt/rh/gcc-toolset-11/root/usr/include/c++/11/cstdlib:75:15: fatal error: stdlib.h: No such file or directory 75 | #include_next {noformat} Using the c\+\+11 toolset. Tried with default c\+\+ on the system and was getting the same type error. OS is OEL 8.5. Here is the detailed output: {noformat} Installing package into ‘/usr/lib64/R/library’ (as ‘lib’ is unspecified) trying URL 'http://**/prod/cran/src/contrib/arrow_9.0.0.tar.gz' Content type 'application/octet-stream' length 4900968 bytes (4.7 MB) == downloaded 4.7 MB * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Found local C++ source: 'tools/cpp' *** Building libarrow from source For build options and troubleshooting, see the install vignette: https://cran.r-project.org/web/packages/arrow/vignettes/install.html *** Building with MAKEFLAGS= -j2 cmake: /bin/cmake arrow with SOURCE_DIR='tools/cpp' BUILD_DIR='/tmp/rmt' DEST_DIR='libarrow/arrow-9.0.0' CMAKE='/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='gcc -m64' CXX='g++ -m64 -std=gnu++11' LDFLAGS='-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' ARROW_S3='OFF' ARROW_GCS='OFF' ARROW_MIMALLOC='OFF' ++ pwd + : /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow + : tools/cpp + : /tmp/rmt + : libarrow/arrow-9.0.0 + : /bin/cmake ++ cd tools/cpp ++ pwd + SOURCE_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp ++ mkdir -p libarrow/arrow-9.0.0 ++ cd libarrow/arrow-9.0.0 ++ pwd + DEST_DIR=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0 + '[' '' '!=' '' ']' + '[' '' = false ']' + ARROW_DEFAULT_PARAM=OFF + mkdir -p /tmp/rmt + pushd /tmp/rmt /tmp/rmt /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow + /bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_PREFIX=/tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/libarrow/arrow-9.0.0 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF -Dxsimd_SOURCE= -G 'Unix Makefiles' /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp -- Building using CMake version: 3.20.2 -- The C compiler identification is GNU 11.2.1 -- The CXX compiler identification is GNU 11.2.1 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - failed -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc - works -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - failed -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ - works -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 9.0.0 (full: '9.0.0') -- Arrow SO version: 900 (full: 900.0.0) -- clang-tidy 12 not found -- clang-format 12 not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found fatal: not a git repository (or any parent up to mount point /) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). -- Found Python3: /bin/python3.6 (found version "3.6.8") found components: Interpreter -- Found cpplint executable at /tmp/Rtmp6bSuH3/R.INSTALL40fe61c13650d/arrow/tools/cpp/build-support/cpplint.py -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Success -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Success -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Success -- Arrow build warning level: PRODUCTION -- Using ld linker -- Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build Type: RELEASE -- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT -- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT - Success -- Using AUTO approach to find dependencies -- ARROW_ABSL_BUILD_VERSION:
[jira] [Updated] (ARROW-17512) [Doc] Updates to crossbow documentation for clarity
[ https://issues.apache.org/jira/browse/ARROW-17512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fiona La updated ARROW-17512: - Description: While setting up a queue repository for testing changes to {{{}crossbow{}}}, we noticed some updates that can be made to help future developers set up their environment. They include: # Clarify Travis CI auto-cancellation default behavior # Fixing broken links # Typos was: While setting up a queue repository for testing changes to {{{}crossbow{}}}, we noticed some updates that can be made to help future developers set up their environment. They include: # Clarify Travis CI auto-cancellation default behavior. # Typos > [Doc] Updates to crossbow documentation for clarity > --- > > Key: ARROW-17512 > URL: https://issues.apache.org/jira/browse/ARROW-17512 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Fiona La >Assignee: Fiona La >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > While setting up a queue repository for testing changes to {{{}crossbow{}}}, > we noticed some updates that can be made to help future developers set up > their environment. > They include: > # Clarify Travis CI auto-cancellation default behavior > # Fixing broken links > # Typos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15691) [Dev] Update archery to work with either master or main as default branch
[ https://issues.apache.org/jira/browse/ARROW-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15691: --- Labels: pull-request-available (was: ) > [Dev] Update archery to work with either master or main as default branch > - > > Key: ARROW-15691 > URL: https://issues.apache.org/jira/browse/ARROW-15691 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Neal Richardson >Assignee: Fiona La >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin closed ARROW-17590. --- Resolution: Duplicate > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > Attachments: sample-1.py, sample.py > > > Hi, > When I read a parquet file (about 23MB with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > When reading the same parquet file for all columns without filtering, the > memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas > dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting > something. > It helps to limit the number of columns read. Read 1 column with filter for 1 > row or more or without filter, it takes about 10MB, which is quite smaller > and better, but still bigger than the size of table or data frame with 1 or > 500 rows of 1 columns (under 1MB) > The filtered column is not a partition key, which functionally works to get > the correct rows. But the memory usage is quite high even when the parquet > file is not really large, partitioned or not. There were some references > similar to this issue, for example: > [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with read > filtering? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us when we need to load all or many columns. > I am not sure what improvement is possible with respect to how the parquet > columnar format works, and if it can be patched somehow in the Pyarrow Python > code, or need to change and build the arrow C++ code. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599489#comment-17599489 ] Yin commented on ARROW-17590: - Yep, pa.total_allocated_bytes 289.74639892578125 MB dt.nbytes 0.0011539459228515625 MB sleep 5 seconds pa.total_allocated_bytes 0.0184326171875 MB dt.nbytes 0.0011539459228515625 MB Thanks Weston. Let me close this jira. > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > Attachments: sample-1.py, sample.py > > > Hi, > When I read a parquet file (about 23MB with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > When reading the same parquet file for all columns without filtering, the > memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas > dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting > something. > It helps to limit the number of columns read. Read 1 column with filter for 1 > row or more or without filter, it takes about 10MB, which is quite smaller > and better, but still bigger than the size of table or data frame with 1 or > 500 rows of 1 columns (under 1MB) > The filtered column is not a partition key, which functionally works to get > the correct rows. But the memory usage is quite high even when the parquet > file is not really large, partitioned or not. There were some references > similar to this issue, for example: > [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with read > filtering? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us when we need to load all or many columns. > I am not sure what improvement is possible with respect to how the parquet > columnar format works, and if it can be patched somehow in the Pyarrow Python > code, or need to change and build the arrow C++ code. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16791) [R] Expose Azure Blob Storage filesystem
[ https://issues.apache.org/jira/browse/ARROW-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599485#comment-17599485 ] Neal Richardson commented on ARROW-16791: - Thanks, yeah, that's being added to the C++ library in ARROW-2034. Once that is merged, we can add R bindings. > [R] Expose Azure Blob Storage filesystem > > > Key: ARROW-16791 > URL: https://issues.apache.org/jira/browse/ARROW-16791 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Affects Versions: 8.0.0 >Reporter: Dean MacGregor >Priority: Critical > > I'd like to see the R arrow package be able to interface with the Azure Blob > Storage file system from the AzureStor package. > > In python, pyarrow and adlfs work together so I'd like for AzureStor and > arrow under R to also work together. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
[ https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17583: --- Labels: pull-request-available (was: ) > [Python] File write visitor throws exception on large parquet file > -- > > Key: ARROW-17583 > URL: https://issues.apache.org/jira/browse/ARROW-17583 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Joost Hoozemans >Assignee: Joost Hoozemans >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws > an exception: > Traceback (most recent call last): > File "pyarrow/_dataset_parquet.pyx", line 165, in > pyarrow._dataset_parquet.ParquetFileFormat._finish_write > File "pyarrow/{_}dataset.pyx", line 2695, in > pyarrow._dataset.WrittenFile.{_}{_}init{_}_ > OverflowError: value too large to convert to int > Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' > The file is written succesfully though. It seems related to this issue > https://issues.apache.org/jira/browse/ARROW-16761. > I would guess the problem is the python field is an int while the C++ code > returns an int64_t > [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16791) [R] Expose Azure Blob Storage filesystem
[ https://issues.apache.org/jira/browse/ARROW-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599464#comment-17599464 ] Dean MacGregor commented on ARROW-16791: I found that in arrow/r/src/filesystem.cpp, on lines 38-40 there's a comment block which says to uncomment them for AzureBlobFileSystem when R6 classes are made. However, when looking further, it seems that there is more in the cpp file for s3 and GCS than just that reference but nothing else for Azure so I guess that comment should be more of a TODO since, unless I'm missing a lot (which is all too possible) there's much more needed for Azure to work than just an R6 class, no? To that end, Microsoft has [https://github.com/Azure/azure-sdk-for-cpp] for using azure in cpp. I don't know cpp enough (really, at all) to be able to extend MS's code into something usable for arrow but just wanted to put the link in here in case it's helpful. > [R] Expose Azure Blob Storage filesystem > > > Key: ARROW-16791 > URL: https://issues.apache.org/jira/browse/ARROW-16791 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Affects Versions: 8.0.0 >Reporter: Dean MacGregor >Priority: Critical > > I'd like to see the R arrow package be able to interface with the Azure Blob > Storage file system from the AzureStor package. > > In python, pyarrow and adlfs work together so I'd like for AzureStor and > arrow under R to also work together. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17599) [C++] ReadRangeCache should not retain data after read
[ https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599444#comment-17599444 ] David Li commented on ARROW-17599: -- You could perhaps store each entry as a custom Buffer subclass (which delegates to the actual buffer) and hand out shared pointers of that. > [C++] ReadRangeCache should not retain data after read > -- > > Key: ARROW-17599 > URL: https://issues.apache.org/jira/browse/ARROW-17599 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > I've added a unit test of the issue here: > https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention > We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes > those files are quite large (gigabytes). The usage is roughly: > for X in num_row_groups: > CacheAllThePiecesWeNeedForRowGroupX > WaitForPiecesToArriveForRowGroupX > ReadThePiecesWeNeedForRowGroupX > However, once we've read in row group X and passed it on to Acero, etc. we do > not release the data for row group X. The read range cache's entries vector > still holds a pointer to the buffer. The data is not released until the file > reader itself is destroyed which only happens when we have finished > processing an entire file. > This leads to excessive memory usage when pre-buffering is enabled. > This could potentially be a little difficult to implement because a single > read range's cache entry could be shared by multiple ranges so we will need > some kind of reference counting to know when we have fully finished with an > entry and can release it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17319) [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes if CPU 0 is not available
[ https://issues.apache.org/jira/browse/ARROW-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599432#comment-17599432 ] Mike Gevaert commented on ARROW-17319: -- ping. It looks like the package isn't building w/ vcpkg - I'm not knowledgeable about that ecosystem, so I'm not sure I can debug this. > [Python] pyarrow seems to set default CPU affinity to 0 on shutdown, crashes > if CPU 0 is not available > -- > > Key: ARROW-17319 > URL: https://issues.apache.org/jira/browse/ARROW-17319 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 > Environment: Ubuntu 20.02 / Python 3.8.10 (default, Jun 22 2022, > 20:18:18) > $ pip list > Package Version > --- --- > numpy 1.23.1 > pandas 1.4.3 > pip 20.0.2 > pkg-resources 0.0.0 > pyarrow 9.0.0 > python-dateutil 2.8.2 > pytz2022.1 > setuptools 44.0.0 > six 1.16.0 >Reporter: Mike Gevaert >Assignee: Mike Gevaert >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > I get the following traceback when exiting python after loading > {{pyarrow.parquet}} > {code} > Python 3.8.10 (default, Jun 22 2022, 20:18:18) > [GCC 9.4.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> os.getpid() > 25106 > >>> import pyarrow.parquet > >>> > Fatal error condition occurred in > /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: > aws_thread_launch(_thread, s_event_loop_destroy_async_thread_fn, > el_group, _options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200af06) > [0x7f831b2b3f06] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x20028e5) > [0x7f831b2ab8e5] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) > [0x7f831b1d0e09] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) > [0x7f831b2b4a3d] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1f25948) > [0x7f831b1ce948] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) > [0x7f831b2b4a3d] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) > [0x7f831b189b46] > /tmp/venv/lib/python3.8/site-packages/pyarrow/libarrow.so.900(+0x194546a) > [0x7f831abee46a] > /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f831c6188a7] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f831c618a60] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f831c5f608a] > {code} > To replicate this; one needs to make sure that CPU 0 isn't available to > schedule tasks on. In HPC our environment, that happens due to slurm using > cgroups to constrain CPU usage. > On a linux workstation, one should be able to: > 1) open python as a normal user > 2) get the pid > 3) as root: > {code} > cd /sys/fs/cgroup/cpuset/ > mkdir pyarrow > cd pyarrow > echo 0 > cpuset.mems > echo 1 > cpuset.cpus # sets the cgroup to only have access to cpu 1 > echo $PID > tasks > {code} > Then, in the python enviroment: > {code} > import pyarrow.parquet > exit() > {code} > Which should trigger the crash. > Sadly, I couldn't track down which {{aws-c-common}} and {{aws-c-io}} are > being used for the 9.0.0 py38 manylinux wheels. (libarrow.so.900 has > BuildID[sha1]=dd6c5a2efd5cacf09657780a58c40f7c930e4df1) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience
[ https://issues.apache.org/jira/browse/ARROW-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17355: --- Labels: pull-request-available (was: ) > [R] Refactor the handle_* utility functions for a better dev experience > --- > > Key: ARROW-17355 > URL: https://issues.apache.org/jira/browse/ARROW-17355 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In ARROW-15260, the utility functions for handling different kinds of reading > errors (handle_parquet_io_error, handle_csv_read_error, and > handle_augmented_field_misuse) were refactored so that multiple ones could be > chained together. An issue with this is that other errors may be swallowed if > they're used without any errors that they don't capture being raised manually > afterwards. We should update the code to prevent this from being possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience
[ https://issues.apache.org/jira/browse/ARROW-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-17355: Assignee: Nicola Crane > [R] Refactor the handle_* utility functions for a better dev experience > --- > > Key: ARROW-17355 > URL: https://issues.apache.org/jira/browse/ARROW-17355 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Fix For: 10.0.0 > > > In ARROW-15260, the utility functions for handling different kinds of reading > errors (handle_parquet_io_error, handle_csv_read_error, and > handle_augmented_field_misuse) were refactored so that multiple ones could be > chained together. An issue with this is that other errors may be swallowed if > they're used without any errors that they don't capture being raised manually > afterwards. We should update the code to prevent this from being possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags
[ https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599308#comment-17599308 ] Alenka Frim commented on ARROW-17580: - I did some research and found: - {{CMAKE_CXX_FLAGS}} in pyarrow include {{CXX_COMMON_FLAGS}} but this is not true for pyarrow c++ *->* I need to add {{set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}")}} to pyarrow c++ CMakeLists file - {{PYARROW_CXXFLAGS}} is ignored by pyarrow c++ and pyarrow *->* I need to add it to the cmake_options in setup.py, then they are read by the cmake - {{ARROW_CXXFLAGS}} are recognised by all three builds {code:bash} arrow c++ -- ARROW_CXXFLAGS: -ggdb -- PYARROW_CXXFLAGS: -- CXXFLAGS: -- CXX_COMMON_FLAGS: -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a --CMAKE_CXX_FLAGS: -Qunused-arguments -fcolor-diagnostics -ggdb -O0 -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a -ggdb pyarrow c++ -- ARROW_CXXFLAGS: -ggdb -- PYARROW_CXXFLAGS: -- CXXFLAGS: -- CXX_COMMON_FLAGS: -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a --CMAKE_CXX_FLAGS: -Qunused-arguments -fcolor-diagnostics -ggdb -O0 pyarrow -- ARROW_CXXFLAGS: -ggdb -- PYARROW_CXXFLAGS: -- CXXFLAGS: -- CXX_COMMON_FLAGS: -Wall -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a -- CMAKE_CXX_FLAGS: -Wall -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a -Qunused-arguments -fcolor-diagnostics -g -O0 -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-missing-declarations -Wno-sometimes-uninitialized -Wno-return-type-c-linkage {code} - I think {{CXXFLAGS}} doesn't get included in the {{CMAKE_CXX_FLAGS}} list. Maybe even setting {{CMAKE_CXX_FLAGS}} in [https://github.com/apache/arrow/blob/413e49385f52b8c76aa02f67588f4dc66d5ed86b/cpp/CMakeLists.txt#L555] makes the variable to be ignnored? {code:bash} (pyarrow-dev-9) (base) alenkafrim@Alenkas-MacBook-Pro arrow % printenv CXXFLAGS -Woverloaded-virtual ../dist/configure (pyarrow-dev-9) (base) alenkafrim@Alenkas-MacBook-Pro arrow % arrow cmake-debug ... -- ARROW_CXXFLAGS: -ggdb -- PYARROW_CXXFLAGS: -- CXXFLAGS: -- CXX_COMMON_FLAGS: -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a -- CMAKE_CXX_FLAGS: -Qunused-arguments -fcolor-diagnostics -ggdb -O0 -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Werror -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a -ggdb -- ** -- Configuring done -- Generating done -- Build files have been written to: /Users/alenkafrim/repos/arrow/cpp/build ~/repos/arrow {code} > [Doc][C++][Python] Unclear how to influence compilation flags > - > > Key: ARROW-17580 > URL: https://issues.apache.org/jira/browse/ARROW-17580 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Antoine Pitrou >Priority: Critical > > Frequently people need to customize compilation flags for C++ and/or C files. > Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find > out the proper way to do this. > For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while > the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). > For PyArrow, I have not found a way to do it. -The {{CXXFLAGS}} environment > variable is ignored, and -the {{PYARROW_CXXFLAGS}} CMake variable has two > problems: > * it is only recognized for Cython-generated files, not for PyArrow C++ > sources > * it only affects linker calls, while it should actually affect compiler > calls (edit: does it?) > *Edit: I was wrong, the {{CXXFLAGS}} environment variable is not ignored.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16431) [C++][Parquet] Improve error message in append_row_groups() when appending disjoint metadata
[ https://issues.apache.org/jira/browse/ARROW-16431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16431: --- Labels: pull-request-available (was: ) > [C++][Parquet] Improve error message in append_row_groups() when appending > disjoint metadata > > > Key: ARROW-16431 > URL: https://issues.apache.org/jira/browse/ARROW-16431 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Michael Milton >Assignee: Miles Granger >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently if you try to append together metadata from row groups with > different schemas (?), you get the following error: > {code:java} > File > "/home/mmilton/.conda/envs/mmilton/envs/driverpipe/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", > line 52, in _append_row_groups > metadata.append_row_groups(md) > File "pyarrow/_parquet.pyx", line 628, in > pyarrow._parquet.FileMetaData.append_row_groups > self._metadata.AppendRowGroups(deref(c_metadata)) > RuntimeError: AppendRowGroups requires equal schemas. > {code} > What would be useful here is to actually pass the schema difference in the > error object in terms of which columns disagree. This information should > _also_ be in the error message. > For example if it said: > {code:java} > RuntimeError: AppendRowGroups requires equal schemas. Column "foo" was > previously an int32 but the latest row group is storing it as an int64 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)