[jira] [Commented] (ARROW-17068) [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen
[ https://issues.apache.org/jira/browse/ARROW-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566637#comment-17566637 ] Alejandro Marco Ramos commented on ARROW-17068: --- Hi Will, thanks for response. Passing `use_legacy_dataset=False` don't fix this situation, the list remains empty. I will follow your recommendation to use the new dataset API. Thanks. > [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing > happen > - > > Key: ARROW-17068 > URL: https://issues.apache.org/jira/browse/ARROW-17068 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Alejandro Marco Ramos >Priority: Minor > > When try to use the callback "file_visitor", nothing happens. > > Example: > {code:java} > import pyarrow as pa > from pyarrow import parquet as pa_parquet > table = pa.table([ > pa.array([1, 2, 3, 4, 5]), > pa.array(["a", "b", "c", "d", "e"]), > pa.array([1.0, 2.0, 3.0, 4.0, 5.0]) > ], names=["col1", "col2", "col3"]) > written_files = [] > pa_parquet.write_to_dataset(table, partition_cols=["col2"], > root_path="tests", file_visitor=lambda x: written_files.append(x.path))) > assert len(written_files) > 0 # This raises, length is 0{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17066) [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Tia updated ARROW-17066: Priority: Critical (was: Blocker) > [C++][Python][Substrait] "ignore_unknown_fields" should be specified when > converting JSON to binary > --- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug >Reporter: Richard Tia >Assignee: Vibhatha Lakmal Abeykoon >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17066) [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17066: --- Labels: pull-request-available (was: ) > [C++][Python][Substrait] "ignore_unknown_fields" should be specified when > converting JSON to binary > --- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug >Reporter: Richard Tia >Assignee: Vibhatha Lakmal Abeykoon >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17066) [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon updated ARROW-17066: - Summary: [C++][Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary (was: [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary) > [C++][Python][Substrait] "ignore_unknown_fields" should be specified when > converting JSON to binary > --- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug >Reporter: Richard Tia >Assignee: Vibhatha Lakmal Abeykoon >Priority: Blocker > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Fix Version/s: 8.0.1 (was: 8.0.2) > [Go] Update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0, 6.0.2, 7.0.1, 8.0.1 > > Time Spent: 4h > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
[ https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones updated ARROW-17069: --- Description: GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply {{anonymous}} as the user: {code:python} import pyarrow.dataset as ds # Fails: dataset = ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") Traceback (most recent call last): File "", line 1, in File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 749, in dataset return _filesystem_dataset(source, **kwargs) File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 441, in _filesystem_dataset fs, paths_or_selector = _ensure_single_source(source, filesystem) File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 408, in _ensure_single_source file_info = filesystem.get_file_info(path) File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info info = GetResultValue(self.fs.GetFileInfo(path)) File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status return check_status(status) File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status raise IOError(message) OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name) # This works fine: >>> dataset = >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") {code} I would expect that we could connect. was: GCSFileSystem will return {{Couldn't resolve host name}} if you don't supply {{anonymous}} as the user: {code:python} import pyarrow.dataset as ds # Fails: dataset = ds.dataset("gs://anonymous@voltrondata-labs-datasets/taxi-data/?retry_limit_seconds=3") # Traceback (most recent call last): # File "", line 1, in # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 749, in dataset # return _filesystem_dataset(source, **kwargs) # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 441, in _filesystem_dataset # fs, paths_or_selector = _ensure_single_source(source, filesystem) # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 417, in _ensure_single_source # raise FileNotFoundError(path) # FileNotFoundError: voltrondata-labs-datasets/taxi-data # This works fine: >>> dataset = >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") {code} I would expect that we could connect. > [Python][R] GCSFIleSystem reports cannot resolve host on public buckets > --- > > Key: ARROW-17069 > URL: https://issues.apache.org/jira/browse/ARROW-17069 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 8.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Critical > Fix For: 9.0.0 > > > GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply > {{anonymous}} as the user: > {code:python} > import pyarrow.dataset as ds > # Fails: > dataset = > ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > Traceback (most recent call last): > File "", line 1, in > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 749, in dataset > return _filesystem_dataset(source, **kwargs) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 441, in _filesystem_dataset > fs, paths_or_selector = _ensure_single_source(source, filesystem) > File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", > line 408, in _ensure_single_source > file_info = filesystem.get_file_info(path) > File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info > info = GetResultValue(self.fs.GetFileInfo(path)) > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > raise IOError(message) > OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in > GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name) > # This works fine: > >>> dataset = > >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") > {code} > I would expect that we could connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
Will Jones created ARROW-17069: -- Summary: [Python][R] GCSFIleSystem reports cannot resolve host on public buckets Key: ARROW-17069 URL: https://issues.apache.org/jira/browse/ARROW-17069 Project: Apache Arrow Issue Type: Bug Components: Python, R Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 GCSFileSystem will return {{Couldn't resolve host name}} if you don't supply {{anonymous}} as the user: {code:python} import pyarrow.dataset as ds # Fails: dataset = ds.dataset("gs://anonymous@voltrondata-labs-datasets/taxi-data/?retry_limit_seconds=3") # Traceback (most recent call last): # File "", line 1, in # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 749, in dataset # return _filesystem_dataset(source, **kwargs) # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 441, in _filesystem_dataset # fs, paths_or_selector = _ensure_single_source(source, filesystem) # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 417, in _ensure_single_source # raise FileNotFoundError(path) # FileNotFoundError: voltrondata-labs-datasets/taxi-data # This works fine: >>> dataset = >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") {code} I would expect that we could connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17066) [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-17066: Assignee: Vibhatha Lakmal Abeykoon > [Python][Substrait] "ignore_unknown_fields" should be specified when > converting JSON to binary > -- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug >Reporter: Richard Tia >Assignee: Vibhatha Lakmal Abeykoon >Priority: Blocker > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16918) [Gandiva][C++] Adding UTC and local time zone conversion functions to Gandiva
[ https://issues.apache.org/jira/browse/ARROW-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-16918: Assignee: Palak Pariawala > [Gandiva][C++] Adding UTC and local time zone conversion functions to Gandiva > - > > Key: ARROW-16918 > URL: https://issues.apache.org/jira/browse/ARROW-16918 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Palak Pariawala >Assignee: Palak Pariawala >Priority: Minor > Labels: newbie, pull-request-available > Original Estimate: 168h > Time Spent: 3h > Remaining Estimate: 165h > > Adding functions in Gandiva to convert timestamps between UTC and local time > zones > to_utc_timestamp(timestamp, timezone name) > from_utc_timestamp(timestamp, timezone name) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data
[ https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566564#comment-17566564 ] Sam Albers commented on ARROW-13062: I have not added this ability. We certainly have the ability to add annotation like this though it does introduce some manual futzing unless we connected the report to this Jira board. > [Dev] Add a way for people to add information to our saved crossbow data > > > Key: ARROW-13062 > URL: https://issues.apache.org/jira/browse/ARROW-13062 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Jonathan Keane >Priority: Major > > We should have a simple + ligthweight way to annotate specific builds with > information like "won't be fixed until dask has a new release" or "this is > supposed to be fixed in ARROW-XXX". > We should find an easy, lightweight way to add this kind of information. > Only relevant in its previous parent: -We *should not* require, ask, or allow > people to add this information to the JSON that is saved as part of > ARROW-13509. That JSON should be kept pristine and not have manual edits. > Instead, we should have a plain-text look up file that matches notes to > specific builds (maybe to specific dates?)- -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17068) [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen
[ https://issues.apache.org/jira/browse/ARROW-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566543#comment-17566543 ] Will Jones commented on ARROW-17068: My guess is that if you pass {{use_legacy_dataset=False}} it should work. This option will become the default in 9.0.0 and we are removing legacy datasets implementation eventually so we might not fix this. If you can, it would be preferable to use the dataset writer in {{pyarrow.dataset}}: {code:python} import pyarrow.dataset as ds ds.write_dataset(table, base_dir="tests", partitioning=["col2"], file_visitor=lambda x: written_files.append(x.path))) {code} > [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing > happen > - > > Key: ARROW-17068 > URL: https://issues.apache.org/jira/browse/ARROW-17068 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Alejandro Marco Ramos >Priority: Minor > > When try to use the callback "file_visitor", nothing happens. > > Example: > {code:java} > import pyarrow as pa > from pyarrow import parquet as pa_parquet > table = pa.table([ > pa.array([1, 2, 3, 4, 5]), > pa.array(["a", "b", "c", "d", "e"]), > pa.array([1.0, 2.0, 3.0, 4.0, 5.0]) > ], names=["col1", "col2", "col3"]) > written_files = [] > pa_parquet.write_to_dataset(table, partition_cols=["col2"], > root_path="tests", file_visitor=lambda x: written_files.append(x.path))) > assert len(written_files) > 0 # This raises, length is 0{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16887) [Doc][R] Document GCSFileSystem for R package
[ https://issues.apache.org/jira/browse/ARROW-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16887: --- Labels: pull-request-available (was: ) > [Doc][R] Document GCSFileSystem for R package > - > > Key: ARROW-16887 > URL: https://issues.apache.org/jira/browse/ARROW-16887 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > We should update the [cloud storage > vignette|https://arrow.apache.org/docs/r/articles/fs.html] and the filesystem > RD to show configuration and usage of GCSFileSystem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts
[ https://issues.apache.org/jira/browse/ARROW-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-13656. -- Resolution: Won't Fix This is an old issue > [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts > --- > > Key: ARROW-13656 > URL: https://issues.apache.org/jira/browse/ARROW-13656 > Project: Apache Arrow > Issue Type: Task > Components: Website >Reporter: Andy Grove >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > https://github.com/apache/arrow-datafusion/issues/881 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16002) [Go] fileBlock.NewMessage should use memory.Allocator
[ https://issues.apache.org/jira/browse/ARROW-16002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-16002. --- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13554 [https://github.com/apache/arrow/pull/13554] > [Go] fileBlock.NewMessage should use memory.Allocator > - > > Key: ARROW-16002 > URL: https://issues.apache.org/jira/browse/ARROW-16002 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 8.0.0 >Reporter: Arjan Topolovec >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current implementation of ipc.FileReader does not use the > memory.Allocator interface. Reading records from a file results in a large > number of allocations since the record body buffer is allocated each time > without reuse. > https://github.com/apache/arrow/blob/master/go/arrow/ipc/metadata.go#L106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-14182) [C++][Compute] Hash Join performance improvement
[ https://issues.apache.org/jira/browse/ARROW-14182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-14182. - Resolution: Fixed Issue resolved by pull request 13493 [https://github.com/apache/arrow/pull/13493] > [C++][Compute] Hash Join performance improvement > > > Key: ARROW-14182 > URL: https://issues.apache.org/jira/browse/ARROW-14182 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 6.0.0 >Reporter: Michal Nowakiewicz >Assignee: Michal Nowakiewicz >Priority: Major > Labels: pull-request-available, query-engine > Fix For: 9.0.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Add micro-benchmarks for hash join exec node. > Write a new implementation of the interface HashJoinImpl making sure that it > is efficient for all types of join. Current implementation, based on > unordered map, trades performance for a simpler code and is likely not as > fast as it could be. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16288) [C++] ValueDescr::SCALAR nearly unused and does not work for projection
[ https://issues.apache.org/jira/browse/ARROW-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li closed ARROW-16288. Resolution: Not A Problem ValueDescr was simply removed. > [C++] ValueDescr::SCALAR nearly unused and does not work for projection > --- > > Key: ARROW-16288 > URL: https://issues.apache.org/jira/browse/ARROW-16288 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > First, there are almost no kernels that actually use this shape. Only the > functions "all", "any", "list_element", "mean", "product", "struct_field", > and "sum" have kernels with this shape. Most kernels that have special logic > for scalars handle it by using {{ValueDescr::ANY}} > Second, when passing an expression to the project node, the expression must > be bound based on the dataset schema. Since the binding happens based on a > schema (and not a batch) the function is bound to ValueDescr::ARRAY > (https://github.com/apache/arrow/blob/a16be6b7b6c8271202ff766b99c199b2e29bdfa8/cpp/src/arrow/compute/exec/expression.cc#L461) > This results in an error if the function has only ValueDescr::SCALAR kernels > and would likely be a problem even if the function had both types of kernels > because it would get bound to the wrong kernel. > This simplest fix may be to just get rid of ValueDescr and change all kernels > to ValueDescr::ANY behavior. If we choose to keep it we will need to figure > out how to handle this kind of binding. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17068) [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen
Alejandro Marco Ramos created ARROW-17068: - Summary: [Python] "pyarrow.parquet.write_to_dataset", option "file_visitor" nothing happen Key: ARROW-17068 URL: https://issues.apache.org/jira/browse/ARROW-17068 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Reporter: Alejandro Marco Ramos When try to use the callback "file_visitor", nothing happens. Example: {code:java} import pyarrow as pa from pyarrow import parquet as pa_parquet table = pa.table([ pa.array([1, 2, 3, 4, 5]), pa.array(["a", "b", "c", "d", "e"]), pa.array([1.0, 2.0, 3.0, 4.0, 5.0]) ], names=["col1", "col2", "col3"]) written_files = [] pa_parquet.write_to_dataset(table, partition_cols=["col2"], root_path="tests", file_visitor=lambda x: written_files.append(x.path))) assert len(written_files) > 0 # This raises, length is 0{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17064) [Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True"
[ https://issues.apache.org/jira/browse/ARROW-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Marco Ramos updated ARROW-17064: -- Summary: [Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" (was: Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True") > [Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" > - > > Key: ARROW-17064 > URL: https://issues.apache.org/jira/browse/ARROW-17064 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Alejandro Marco Ramos >Priority: Major > > When try to copy a local path to s3 remote filesystem using > `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the > system hangs. If use "use_threads=False` the operation must complete ok (but > more slow). > > My code is: > {code:java} > >>> import pyarrow as pa > >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;) > >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", > >>> destination_filesystem=s3fs) > ... (don't return){code} > If check remote s3, all files appear, but the function don't return > > Platform: Windows -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16919) [C++] Flight integration tests fail on verify rc nightly on linux amd64
[ https://issues.apache.org/jira/browse/ARROW-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566462#comment-17566462 ] David Li commented on ARROW-16919: -- This is still happening and I wasn't able to get a backtrace…I'll make another try soon. > [C++] Flight integration tests fail on verify rc nightly on linux amd64 > --- > > Key: ARROW-16919 > URL: https://issues.apache.org/jira/browse/ARROW-16919 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, FlightRPC >Reporter: Raúl Cumplido >Priority: Critical > Labels: Nightly, pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Some of our nightly builds to verify the release are failing: > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-almalinux-8-amd64|https://github.com/ursacomputing/crossbow/runs/7073206980?check_suite_focus=true] > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-ubuntu-18.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073217433?check_suite_focus=true] > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-ubuntu-20.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073210299?check_suite_focus=true] > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-ubuntu-22.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073273051?check_suite_focus=true] > with the following: > {code:java} > # FAILURES # > FAILED TEST: middleware C++ producing, C++ consuming > 1 failures > File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd > output = subprocess.check_output(cmd, stderr=subprocess.STDOUT) > File "/usr/lib/python3.8/subprocess.py", line 411, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/usr/lib/python3.8/subprocess.py", line 512, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command > '['/tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client', > '-host', 'localhost', '-port=36719', '-scenario', 'middleware']' died with > . > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/arrow/dev/archery/archery/integration/runner.py", line 379, in > _run_flight_test_case > consumer.flight_request(port, **client_args) > File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 134, in > flight_request > run_cmd(cmd) > File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd > raise RuntimeError(sio.getvalue()) > RuntimeError: Command failed: > /tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client -host > localhost -port=36719 -scenario middleware > With output: > -- > Headers received successfully on failing call. > Headers received successfully on passing call. > free(): double free detected in tcache 2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17067) Implement Substring_Index
[ https://issues.apache.org/jira/browse/ARROW-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17067: --- Labels: pull-request-available (was: ) > Implement Substring_Index > - > > Key: ARROW-17067 > URL: https://issues.apache.org/jira/browse/ARROW-17067 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Sahaj Gupta >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Adding Substring_index Function. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17067) Implement Substring_Index
Sahaj Gupta created ARROW-17067: --- Summary: Implement Substring_Index Key: ARROW-17067 URL: https://issues.apache.org/jira/browse/ARROW-17067 Project: Apache Arrow Issue Type: New Feature Reporter: Sahaj Gupta Adding Substring_index Function. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17066) [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566446#comment-17566446 ] Richard Tia commented on ARROW-17066: - CC: [~westonpace] > [Python][Substrait] "ignore_unknown_fields" should be specified when > converting JSON to binary > -- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug >Reporter: Richard Tia >Priority: Blocker > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17066) [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Tia updated ARROW-17066: Summary: [Python][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary (was: [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary) > [Python][Substrait] "ignore_unknown_fields" should be specified when > converting JSON to binary > -- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug >Reporter: Richard Tia >Priority: Blocker > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17066) [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
Richard Tia created ARROW-17066: --- Summary: [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary Key: ARROW-17066 URL: https://issues.apache.org/jira/browse/ARROW-17066 Project: Apache Arrow Issue Type: Bug Reporter: Richard Tia [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] When converting a substrait JSON to binary, there are many unknown fields that may exist since substrait is being built every week. ignore_unknown_fields should be specified when doing this conversion. This is resulting in frequent errors similar to this: {code:java} E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) arguments: Cannot find field. pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
[ https://issues.apache.org/jira/browse/ARROW-17051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17051: --- Labels: pull-request-available (was: ) > [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN > - > > Key: ARROW-17051 > URL: https://issues.apache.org/jira/browse/ARROW-17051 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 > C++ ASAN UBSAN* > Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN > will also build with Flight and Flight SQL. This triggers some > arrow-flight-sql-test failures like: > {code:java} > [ RUN ] TestFlightSqlClient.TestGetDbSchemas > unknown file: Failure > Unexpected mock function call - taking default action specified at: > /arrow/cpp/src/arrow/flight/sql/client_test.cc:151: > Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 > 00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 > 00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, > @0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 > 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>) > Returns: (nullptr) > Google Mock tried the following 1 expectation, but it didn't match: > /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, > GetFlightInfo(Ref(call_options_), descriptor))... > Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B > 05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE > BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00> > Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 > 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> > Expected: to be called once > Actual: never called - unsatisfied and active > /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure > Actual function call count doesn't match EXPECT_CALL(sql_client_, > GetFlightInfo(Ref(call_options_), descriptor))... > Expected: to be called once > Actual: never called - unsatisfied and active > [ FAILED ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code} > The error can be seen here: > [https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true] > This is the initial PR that triggered it: > [https://github.com/apache/arrow/pull/13548] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17051) [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN
[ https://issues.apache.org/jira/browse/ARROW-17051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566440#comment-17566440 ] David Li commented on ARROW-17051: -- Ok, it only occurs with bundled (static) Protobuf/gRPC. It's not related to ASAN/UBSAN, this will do it: {noformat} -DARROW_FLIGHT=ON -DARROW_FLIGHT_SQL=ON -DARROW_BUILD_TESTS=ON -DProtobuf_SOURCE=BUNDLED -DgRPC_SOURCE=BUNDLED -DGTest_SOURCE=BUNDLED -DARROW_BUILD_SHARED=ON -DARROW_BUILD_STATIC=OFF {noformat} It also fails differently when only a single test is run. I suspect that gRPC/Protobuf is getting linked twice, which is a common issue. Both libarrow_flight and libarrow_flight_sql contain Protobuf symbols. {{env LD_DEBUG=all}} shows the dynamic linker is not resolving any Protobuf symbols - so presumably each library is using its own copy of Protobuf. But Protobuf has global state. To wit, it passes if we set {{-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON}} instead. So I think the solution here is: change this job to link statically instead of dynamically, and prevent Flight from building shared libraries if Protobuf/gRPC are static dependencies. > [C++][Flight] arrow-flight-sql-test fails with ASAN UBSAN > - > > Key: ARROW-17051 > URL: https://issues.apache.org/jira/browse/ARROW-17051 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Raúl Cumplido >Priority: Major > > The CI job for ASAN UBSAN is based on Ubuntu 20.04: *C++ / AMD64 Ubuntu 20.04 > C++ ASAN UBSAN* > Trying to build Flight and Flight SQL on Ubuntu 20.04 the job for ASAN UBSAN > will also build with Flight and Flight SQL. This triggers some > arrow-flight-sql-test failures like: > {code:java} > [ RUN ] TestFlightSqlClient.TestGetDbSchemas > unknown file: Failure > Unexpected mock function call - taking default action specified at: > /arrow/cpp/src/arrow/flight/sql/client_test.cc:151: > Function call: GetFlightInfo(@0x6157d948 184-byte object <00-00 00-00 > 00-00 F0-BF 40-00 00-00 00-00 00-00 80-4C 06-49 CF-7F 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 > 00-20 00-00 00-00 00-00 ... 01-00 00-04 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, > @0x7fff35794e80 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 > 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>) > Returns: (nullptr) > Google Mock tried the following 1 expectation, but it didn't match: > /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: EXPECT_CALL(sql_client_, > GetFlightInfo(Ref(call_options_), descriptor))... > Expected arg #1: is equal to 64-byte object <02-00 00-00 BE-BE BE-BE C0-6B > 05-00 C0-60 00-00 73-00 00-00 00-00 00-00 73-00 00-00 00-00 00-00 BE-BE BE-BE > BE-BE BE-BE 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00> > Actual: 64-byte object <02-00 00-00 00-00 00-00 C0-45 08-00 B0-60 > 00-00 65-00 00-00 00-00 00-00 65-00 00-00 00-00 00-00 C4-A9 AE-66 00-10 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> > Expected: to be called once > Actual: never called - unsatisfied and active > /arrow/cpp/src/arrow/flight/sql/client_test.cc:152: Failure > Actual function call count doesn't match EXPECT_CALL(sql_client_, > GetFlightInfo(Ref(call_options_), descriptor))... > Expected: to be called once > Actual: never called - unsatisfied and active > [ FAILED ] TestFlightSqlClient.TestGetDbSchemas (1 ms){code} > The error can be seen here: > [https://github.com/apache/arrow/runs/7297442828?check_suite_focus=true] > This is the initial PR that triggered it: > [https://github.com/apache/arrow/pull/13548] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15762) [R] Revisit binding_format_datetime and remove manual casting
[ https://issues.apache.org/jira/browse/ARROW-15762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld reassigned ARROW-15762: Assignee: Dragoș Moldovan-Grünfeld > [R] Revisit binding_format_datetime and remove manual casting > -- > > Key: ARROW-15762 > URL: https://issues.apache.org/jira/browse/ARROW-15762 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > This is a follow-up issue to revisit the casting step in format once > [https://github.com/apache/arrow/pull/12240] gets merged. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16863) [R] open_dataset() silently drops the missing values from a csv file
[ https://issues.apache.org/jira/browse/ARROW-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson closed ARROW-16863. --- Assignee: Neal Richardson Resolution: Not A Problem > [R] open_dataset() silently drops the missing values from a csv file > > > Key: ARROW-16863 > URL: https://issues.apache.org/jira/browse/ARROW-16863 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Zsolt Kegyes-Brassai >Assignee: Neal Richardson >Priority: Major > > The {{open_dataset()}} +silently+ drops the empty/missing values from a csv > file. This empty string was generated when writing a dataframe containing a > NA value using the {{{}write_csv_arrow(){}}}. > > {code:java} > df_numbers <- tibble::tibble(number = c(1, 2, "error", 4, 5, NA, 7, 8)) > arrow::write_csv_arrow(df_numbers, "numbers.csv") > readLines("numbers.csv") > #> [1] "\"number\"" "\"1\"" "\"2\"" "\"error\"" "\"4\"" > #> [6] "\"5\"" "" "\"7\"" "\"8\"" > arrow::open_dataset("numbers.csv", format = "csv") |> dplyr::collect() > #> # A tibble: 7 x 1 > #> number > #> > #> 1 1 > #> 2 2 > #> 3 error > #> 4 4 > #> 5 5 > #> 6 7 > #> 7 8 > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16863) [R] open_dataset() silently drops the missing values from a csv file
[ https://issues.apache.org/jira/browse/ARROW-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566426#comment-17566426 ] Neal Richardson commented on ARROW-16863: - I think this is only an issue because the "csv" just has a single column (no commas involved really). So your missing value shows up as just an extra newline character. This behavior is consistent with base::read.csv() and readr::read_csv(): {code} > read.csv("numbers.csv") number 1 1 2 2 3 error 4 4 5 5 6 7 7 8 > readr::read_csv("numbers.csv") Rows: 7 Columns: 1 ── Column specification ─ Delimiter: "," chr (1): number ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 7 × 1 number 1 1 2 2 3 error 4 4 5 5 6 7 7 8 {code} And if you have more than one column, there is no issue: {code} > df_numbers$num2 <- df_numbers$number > tf <- tempfile() > write_csv_arrow(df_numbers, tf) > open_dataset(tf, format = "csv") %>% collect() # A tibble: 8 × 2 number num2 1 "1" "1" 2 "2" "2" 3 "error" "error" 4 "4" "4" 5 "5" "5" 6 "" "" 7 "7" "7" 8 "8" "8" {code} > [R] open_dataset() silently drops the missing values from a csv file > > > Key: ARROW-16863 > URL: https://issues.apache.org/jira/browse/ARROW-16863 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Zsolt Kegyes-Brassai >Priority: Major > > The {{open_dataset()}} +silently+ drops the empty/missing values from a csv > file. This empty string was generated when writing a dataframe containing a > NA value using the {{{}write_csv_arrow(){}}}. > > {code:java} > df_numbers <- tibble::tibble(number = c(1, 2, "error", 4, 5, NA, 7, 8)) > arrow::write_csv_arrow(df_numbers, "numbers.csv") > readLines("numbers.csv") > #> [1] "\"number\"" "\"1\"" "\"2\"" "\"error\"" "\"4\"" > #> [6] "\"5\"" "" "\"7\"" "\"8\"" > arrow::open_dataset("numbers.csv", format = "csv") |> dplyr::collect() > #> # A tibble: 7 x 1 > #> number > #> > #> 1 1 > #> 2 2 > #> 3 error > #> 4 4 > #> 5 5 > #> 6 7 > #> 7 8 > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17045) [C++] Reject trailing slashes on file path
[ https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17045. Resolution: Fixed Issue resolved by pull request 13577 [https://github.com/apache/arrow/pull/13577] > [C++] Reject trailing slashes on file path > -- > > Key: ARROW-17045 > URL: https://issues.apache.org/jira/browse/ARROW-17045 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 8.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Critical > Labels: breaking-api, pull-request-available > Fix For: 9.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > We had several different behaviors when passing in file paths with trailing > slashes: LocalFileSystem would return IOError, S3 would trim off the trailing > slash, and GCS would keep the trailing slash as part of the file name (later > creating confusion as the file would be labelled a "directory" in list > calls). This PR moves them all to the behavior of LocalFileSystem: return > IOError. > The R filesystem bindings relied on the behavior provided by S3, so they are > now modified to trim the trailing slash before passing down to C++. > Here is an example of the differences in behavior between S3 and GCS: > {code:python} > import pyarrow.fs > from pyarrow.fs import FileSelector > from datetime import timedelta > gcs = pyarrow.fs.GcsFileSystem( > endpoint_override="localhost:9001", > scheme="http", > anonymous=True, > retry_time_limit=timedelta(seconds=1), > ) > gcs.create_dir("py_test") > # Writing to test.txt with and without slash produces a file and a directory!? > with gcs.open_output_stream("py_test/test.txt") as out_stream: > out_stream.write(b"Hello world!") > with gcs.open_output_stream("py_test/test.txt/") as out_stream: > out_stream.write(b"Hello world!") > gcs.get_file_info(FileSelector("py_test")) > # [, for 'py_test/test.txt': type=FileType.Directory>] > s3 = pyarrow.fs.S3FileSystem( > access_key="minioadmin", > secret_key="minioadmin", > scheme="http", > endpoint_override="localhost:9000", > allow_bucket_creation=True, > allow_bucket_deletion=True, > ) > s3.create_dir("py-test") > # Writing to test.txt with and without slash writes to same file > with s3.open_output_stream("py-test/test.txt") as out_stream: > out_stream.write(b"Hello world!") > with s3.open_output_stream("py-test/test.txt/") as out_stream: > out_stream.write(b"Hello world!") > s3.get_file_info(FileSelector("py-test")) > # [] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-11341) [Python] [Gandiva] Check parameters are not None
[ https://issues.apache.org/jira/browse/ARROW-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-11341. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 9289 [https://github.com/apache/arrow/pull/9289] > [Python] [Gandiva] Check parameters are not None > > > Key: ARROW-11341 > URL: https://issues.apache.org/jira/browse/ARROW-11341 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva, Python >Reporter: Will Jones >Assignee: Will Jones >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Most of the functions in Gandiva's Python Expression builder interface > current accept None in their arguments, but will segfault once they are used. > Example: > {code:python} > import pyarrow > import pyarrow.gandiva as gandiva > builder = gandiva.TreeExprBuilder() > field = pyarrow.field('whatever', type=pyarrow.date64()) > date_col = builder.make_field(field) > func = builder.make_function('less_than_or_equal_to', [date_col, None], > pyarrow.bool_()) > condition = builder.make_condition(func) > # Will segfault on this line: > gandiva.make_filter(pyarrow.schema([field]), condition) > {code} > I think this is just a matter of adding {{not None}} to the appropriate > function arguments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16324) [Go] Implement Dictionary Unification
[ https://issues.apache.org/jira/browse/ARROW-16324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-16324. --- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13529 [https://github.com/apache/arrow/pull/13529] > [Go] Implement Dictionary Unification > - > > Key: ARROW-16324 > URL: https://issues.apache.org/jira/browse/ARROW-16324 > Project: Apache Arrow > Issue Type: New Feature > Components: Go >Reporter: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds
[ https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-8043. - Resolution: Fixed Done with the work on https://crossbow.voltrondata.com > [Developer] Provide better visibility for failed nightly builds > --- > > Key: ARROW-8043 > URL: https://issues.apache.org/jira/browse/ARROW-8043 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Sam Albers >Priority: Major > Labels: pull-request-available > > Emails reporting nightly failures are unsatisfactory in two ways: there is a > large click/scroll distance between the links presented in that email and the > actual error message. Worse, once one is there it's not clear what JIRAs have > been made or which of them are in progress. > One solution would be to replace or augment the [NIGHTLY] email with a page > ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows > how many nights it has failed, a shortcut to the actual error line in CI's > logs, and useful views of JIRA. We could accomplish this with: > - dedicated JIRA tags; one for each nightly job so a JIRA can be easily > associated with specific jobs > - A static HTML dashboard with client side JavaScript to > ** scrape JIRA and update the page dynamically as soon as JIRAs are opened > ** show any relationships between failing jobs > ** highlight jobs that have not been addressed, along with a counter of how > many nights it has gone unaddressed > - provide automatic and expedited creation of correctly labelled JIRAs, so > that viewers can quickly organize/take ownership of a failed nightly job. > JIRA supports reading form fields from URL parameters, so this would be > fairly straightforward: > > [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds
[ https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reopened ARROW-8043: --- Assignee: Sam Albers > [Developer] Provide better visibility for failed nightly builds > --- > > Key: ARROW-8043 > URL: https://issues.apache.org/jira/browse/ARROW-8043 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Sam Albers >Priority: Major > Labels: pull-request-available > > Emails reporting nightly failures are unsatisfactory in two ways: there is a > large click/scroll distance between the links presented in that email and the > actual error message. Worse, once one is there it's not clear what JIRAs have > been made or which of them are in progress. > One solution would be to replace or augment the [NIGHTLY] email with a page > ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows > how many nights it has failed, a shortcut to the actual error line in CI's > logs, and useful views of JIRA. We could accomplish this with: > - dedicated JIRA tags; one for each nightly job so a JIRA can be easily > associated with specific jobs > - A static HTML dashboard with client side JavaScript to > ** scrape JIRA and update the page dynamically as soon as JIRAs are opened > ** show any relationships between failing jobs > ** highlight jobs that have not been addressed, along with a counter of how > many nights it has gone unaddressed > - provide automatic and expedited creation of correctly labelled JIRAs, so > that viewers can quickly organize/take ownership of a failed nightly job. > JIRA supports reading form fields from URL parameters, so this would be > fairly straightforward: > > [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-13936) Add a column to show us the number of time that this job is failing
[ https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-13936. -- Resolution: Fixed Done with the work on https://crossbow.voltrondata.com > Add a column to show us the number of time that this job is failing > --- > > Key: ARROW-13936 > URL: https://issues.apache.org/jira/browse/ARROW-13936 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: David Dali Susanibar Arce >Assignee: Sam Albers >Priority: Minor > > Try to use external repository to collect information about jobs name failling -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (ARROW-13936) Add a column to show us the number of time that this job is failing
[ https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reopened ARROW-13936: Assignee: Sam Albers > Add a column to show us the number of time that this job is failing > --- > > Key: ARROW-13936 > URL: https://issues.apache.org/jira/browse/ARROW-13936 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: David Dali Susanibar Arce >Assignee: Sam Albers >Priority: Minor > > Try to use external repository to collect information about jobs name failling -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-12845) [R] [C++] S3 connections for different providers
[ https://issues.apache.org/jira/browse/ARROW-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12845. -- Resolution: Won't Fix > [R] [C++] S3 connections for different providers > > > Key: ARROW-12845 > URL: https://issues.apache.org/jira/browse/ARROW-12845 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Affects Versions: 4.0.0 >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Minor > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Hi > As a part of my thesis, I want to create an S3 bucket on DigitalOcean (what > PUC uses), and while I can write parquet files on my laptop and upload to > DigitalOcean Spaces (i.e. an "S3 + Google Drive") from the browser or by > using rclone, I could work in editing the existing code that allows to > connects to Amazon S3, and provide a function that connects to > DigitalOcean/Linode/IBM/etc. > This could be done in a way that amazon URL is the default and the user could > specify something like `new_s3_fun(..., provider = "Tencent")` and connect > to an S3 that is not Amazon. > Also, this involves the need to write more S3 documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-12862) [CI] Gather + display reliability of crossbow builds
[ https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12862. -- Resolution: Fixed Done with the work on https://crossbow.voltrondata.com > [CI] Gather + display reliability of crossbow builds > > > Key: ARROW-12862 > URL: https://issues.apache.org/jira/browse/ARROW-12862 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jonathan Keane >Assignee: Sam Albers >Priority: Major > > From Wes's suggestion on the mailing list: > Having a website > dashboard showing build health over time along with a ~ weekly e-mail > to dev@ indicating currently broken builds and the reliability of each > build over the trailing 7 or 30 days would be useful. Knowing that a > particular build is only passing 20% of the time would help steer our > efforts. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-12862) [CI] Gather + display reliability of crossbow builds
[ https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-12862: -- Assignee: Sam Albers > [CI] Gather + display reliability of crossbow builds > > > Key: ARROW-12862 > URL: https://issues.apache.org/jira/browse/ARROW-12862 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jonathan Keane >Assignee: Sam Albers >Priority: Major > > From Wes's suggestion on the mailing list: > Having a website > dashboard showing build health over time along with a ~ weekly e-mail > to dev@ indicating currently broken builds and the reliability of each > build over the trailing 7 or 30 days would be useful. Knowing that a > particular build is only passing 20% of the time would help steer our > efforts. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-14378) [R] Make custom extension classes for (some) cols with row-level metadata
[ https://issues.apache.org/jira/browse/ARROW-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-14378. -- Resolution: Won't Fix We ended up supporting geo columns using the geoarrow package + extension types > [R] Make custom extension classes for (some) cols with row-level metadata > - > > Key: ARROW-14378 > URL: https://issues.apache.org/jira/browse/ARROW-14378 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Priority: Major > > The major usecase for this is SF columns which have attributes/metadata for > each element of a column. We originally stored these in our standard > column-level metadata, but that was very fragile and took forever, so we > disabled it ARROW-13189 > This will likely take some steps to accomplish. I've sketched out some in the > subtasks here (though if we have a different approach, we could do that > directly) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-12182) [R] [Dev] new helpers and suggests for testing
[ https://issues.apache.org/jira/browse/ARROW-12182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12182. -- Resolution: Won't Fix > [R] [Dev] new helpers and suggests for testing > -- > > Key: ARROW-12182 > URL: https://issues.apache.org/jira/browse/ARROW-12182 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools, R >Affects Versions: 3.0.0 >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Minor > > _Related to https://issues.apache.org/jira/browse/ARROW-11705_ > While working on the related tickets I've found the next blockers: > 1. Does it make sense to create expect_dplyr_named()? (i.e. to mimic > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L56-L59) > 2. Does it make sense to create expect_dplyr_identical() (i.e. to mimic > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L61-L69 > and > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L83-L91) > 3. Should we need to add glue to Suggests? (i.e. replicate > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L95-L100) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-14624) [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown
[ https://issues.apache.org/jira/browse/ARROW-14624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-14624. -- Resolution: Fixed This was fixed as part of the work to update the version switcher in the docs. > [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown > - > > Key: ARROW-14624 > URL: https://issues.apache.org/jira/browse/ARROW-14624 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Priority: Major > > tabsets are now supported natively in pkgdown (with bootstrap 5) > https://github.com/r-lib/pkgdown/pull/1694 > So we can pull out the hack we have to make that work for our dev docs > vignette -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16076) [R] Bindings for the new TPC-H generator
[ https://issues.apache.org/jira/browse/ARROW-16076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-16076. -- Resolution: Won't Fix Since the TPC-H generator does not generate compliant data, there's not a big need to expose this in R. > [R] Bindings for the new TPC-H generator > > > Key: ARROW-16076 > URL: https://issues.apache.org/jira/browse/ARROW-16076 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > Now that https://github.com/apache/arrow/pull/12537 is merged, we should > implement the R changes needed to make that useable from R. > We should basically do the opposite of > https://github.com/apache/arrow/pull/12537/commits/4b16296b4ef8cd3b3d440e8b7f8af32a89a16788 > But also add in the fixes from weston: > https://github.com/westonpace/arrow/commit/7c4c0e0b4e208918eb195701fab5d631b8c9517a -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data
[ https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566383#comment-17566383 ] Jonathan Keane commented on ARROW-13062: [~boshek] Did you already add this ability? I know it's a slightly different set of tickets than the ones we actually worked, but we should either close it as duplicate, done, or won't fix (and feel free to take credit for it if you did it elsewhere as part of a larger ticket!) > [Dev] Add a way for people to add information to our saved crossbow data > > > Key: ARROW-13062 > URL: https://issues.apache.org/jira/browse/ARROW-13062 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Jonathan Keane >Priority: Major > > We should have a simple + ligthweight way to annotate specific builds with > information like "won't be fixed until dask has a new release" or "this is > supposed to be fixed in ARROW-XXX". > We should find an easy, lightweight way to add this kind of information. > Only relevant in its previous parent: -We *should not* require, ask, or allow > people to add this information to the JSON that is saved as part of > ARROW-13509. That JSON should be kept pristine and not have manual edits. > Instead, we should have a plain-text look up file that matches notes to > specific builds (maybe to specific dates?)- -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
[ https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17041. Resolution: Fixed Issue resolved by pull request 13597 [https://github.com/apache/arrow/pull/13597] > [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind > > > Key: ARROW-17041 > URL: https://issues.apache.org/jira/browse/ARROW-17041 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Raúl Cumplido >Assignee: Antoine Pitrou >Priority: Critical > Labels: Nightly, pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > There seems to be an issue on the arrow-compute-scalar-test as it has been > failing for the last days, example: > [https://github.com/ursacomputing/crossbow/runs/7274655770] > See [https://crossbow.voltrondata.com/] > Error: > {code:java} > ==13125== > ==13125== HEAP SUMMARY: > ==13125== in use at exit: 16,090 bytes in 161 blocks > ==13125== total heap usage: 14,612,979 allocs, 14,612,818 frees, > 2,853,741,784 bytes allocated > ==13125== > ==13125== LEAK SUMMARY: > ==13125==definitely lost: 0 bytes in 0 blocks > ==13125==indirectly lost: 0 bytes in 0 blocks > ==13125== possibly lost: 0 bytes in 0 blocks > ==13125==still reachable: 16,090 bytes in 161 blocks > ==13125== suppressed: 0 bytes in 0 blocks > ==13125== Reachable blocks (those to which a pointer was found) are not shown. > ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all > ==13125== > ==13125== Use --track-origins=yes to see where uninitialised values come from > ==13125== For lists of detected and suppressed errors, rerun with: -s > ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from > 44) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
[ https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17055. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13596 [https://github.com/apache/arrow/pull/13596] > [Java][FlightRPC] flight-core and flight-sql jars delivering same class names > - > > Key: ARROW-17055 > URL: https://issues.apache.org/jira/browse/ARROW-17055 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: Kevin Bambrick >Assignee: Kevin Bambrick >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Hello. I am trying to uptake arrow flight sql. We have a check in out build > to make sure that there are no overlapping class files in our project. When > adding the flight sql dependency to our project the warning throws that > flight-sql and flight-core overlap and the jars deliver the same class files. > {code:java} > Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class > files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: > [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, > org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code} > > It seems that the classes generated by Flight.proto gets generated in both > flight-sql and flight-core jars. Since these classes get generated in > flight-core, and flight-sql is dependent on flight-core, can the generation > of Flight.java and FlightServiceGrpc.java be removed from flight-sql and > instead rely on it to be pulled directly from flight-core? > > thanks in advance! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
[ https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-17055: Assignee: Kevin Bambrick > [Java][FlightRPC] flight-core and flight-sql jars delivering same class names > - > > Key: ARROW-17055 > URL: https://issues.apache.org/jira/browse/ARROW-17055 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: Kevin Bambrick >Assignee: Kevin Bambrick >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Hello. I am trying to uptake arrow flight sql. We have a check in out build > to make sure that there are no overlapping class files in our project. When > adding the flight sql dependency to our project the warning throws that > flight-sql and flight-core overlap and the jars deliver the same class files. > {code:java} > Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class > files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: > [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, > org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code} > > It seems that the classes generated by Flight.proto gets generated in both > flight-sql and flight-core jars. Since these classes get generated in > flight-core, and flight-sql is dependent on flight-core, can the generation > of Flight.java and FlightServiceGrpc.java be removed from flight-sql and > instead rely on it to be pulled directly from flight-core? > > thanks in advance! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16992) [Java][C++] Separate JNI compilation & linking from main arrow CMakeLists
[ https://issues.apache.org/jira/browse/ARROW-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566332#comment-17566332 ] David Dali Susanibar Arce commented on ARROW-16992: --- I agree with all of these points. PoC could help us a lot to have an idea about how the JNI java modules projects are building isolated and then try to call that building execution by Maven side. > [Java][C++] Separate JNI compilation & linking from main arrow CMakeLists > -- > > Key: ARROW-16992 > URL: https://issues.apache.org/jira/browse/ARROW-16992 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Java >Reporter: Larry White >Priority: Major > > We need to separate the JNI elements from CMakeLists, with related > modifications to the CI build scripts likely. Separating the JNI portion > serves two related purposes: > # Simplify building JNI code against precompiled lib arrow C++ code > # Enable control of JNI build through Maven, rather than requiring Java devs > to work with CMake directly > [~dsusanibara] > [~kou] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition
[ https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned ARROW-15938: --- Assignee: Weston Pace > [R][C++] Segfault in left join with empty right table when filtered on > partition > > > Key: ARROW-15938 > URL: https://issues.apache.org/jira/browse/ARROW-15938 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.2 > Environment: ubuntu linux, R4.1.2 >Reporter: Vitalie Spinu >Assignee: Weston Pace >Priority: Major > Labels: query-engine > Fix For: 9.0.0 > > > When the right table in a join is empty as a result of a filtering on a > partition group the join segfaults: > {code:java} > library(arrow) > library(glue) > df <- mutate(iris, id = runif(n())) > dir <- "./tmp/iris" > dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) > dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) > write_parquet(df, glue("{dir}/group=a/part1.parquet")) > write_parquet(df, glue("{dir}/group=b/part2.parquet")) > db1 <- open_dataset(dir) %>% > filter(group == "blabla") > open_dataset(dir) %>% > filter(group == "b") %>% > select(id) %>% > left_join(db1, by = "id") %>% > collect() > {code} > {code:java} > ==24063== Thread 7: > ==24063== Invalid read of size 1 > ==24063== at 0x1FFE606D: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE68CC: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, > int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE84D5: > arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, > arrow::compute::ExecBatch const&) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE8CB4: > arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x200011CF: > arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB580E: > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> > >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FE2B2A0: > std::thread::_State_impl > > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x92844BF: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) > ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) > ==24063== by 0x710D71E: clone (clone.S:95) > ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==24063== *** caught segfault *** > address 0x10, cause 'memory not mapped'Traceback: > 1: Table__from_RecordBatchReader(self) > 2: tab$read_table() > 3: do_exec_plan(x) > 4: doTryCatch(return(expr), name, parentenv, handler) > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 6: tryCatchList(expr, classes, parentenv, handlers) > 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { > handle_csv_read_error(e, x$.data$schema)}) > 8: collect.arrow_dplyr_query(.) > 9: collect(.) > 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% > left_join(db1, by = "id") %>% collect()Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace {code} > This is arrow from the current master ece0e23f1. > It's worth noting that if the right table is filtered on a non-partitioned > variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16523) [C++] Move ExecPlan scheduling into the plan
[ https://issues.apache.org/jira/browse/ARROW-16523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-16523: Labels: pull-request-available query-engine (was: acero pull-request-available) > [C++] Move ExecPlan scheduling into the plan > > > Key: ARROW-16523 > URL: https://issues.apache.org/jira/browse/ARROW-16523 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Weston Pace >Assignee: Sasha Krassovsky >Priority: Major > Labels: pull-request-available, query-engine > Time Spent: 3h > Remaining Estimate: 0h > > Source nodes and pipeline breakers need to schedule new thread tasks. These > tasks run entire fused pipelines (e.g. the thread task could be thought of as > analogous to a "driver" in some other models). > At the moment every node that needs to schedule tasks (scan node, hash-join > node, aggregate node, etc.) handles this independently. The result is a lot > of similar looking code and bugs like ARROW-15221 where one node takes care > of cleanup but another doesn't. > We can centralize this by moving this scheduling into the ExecPlan itself and > giving nodes an ability to schedule tasks via the ExecPlan. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16628) [C++] Support limit operation
[ https://issues.apache.org/jira/browse/ARROW-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-16628: Labels: query-engine (was: acero) > [C++] Support limit operation > - > > Key: ARROW-16628 > URL: https://issues.apache.org/jira/browse/ARROW-16628 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Priority: Major > Labels: query-engine > > Either an option to a SinkNode (TopK already takes a number of results to > keep) or a streaming LimitNode that only lets N rows through. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds
[ https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Dali Susanibar Arce closed ARROW-8043. Resolution: Abandoned Implemented at: https://issues.apache.org/jira/browse/ARROW-16333 > [Developer] Provide better visibility for failed nightly builds > --- > > Key: ARROW-8043 > URL: https://issues.apache.org/jira/browse/ARROW-8043 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Priority: Major > Labels: pull-request-available > > Emails reporting nightly failures are unsatisfactory in two ways: there is a > large click/scroll distance between the links presented in that email and the > actual error message. Worse, once one is there it's not clear what JIRAs have > been made or which of them are in progress. > One solution would be to replace or augment the [NIGHTLY] email with a page > ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows > how many nights it has failed, a shortcut to the actual error line in CI's > logs, and useful views of JIRA. We could accomplish this with: > - dedicated JIRA tags; one for each nightly job so a JIRA can be easily > associated with specific jobs > - A static HTML dashboard with client side JavaScript to > ** scrape JIRA and update the page dynamically as soon as JIRAs are opened > ** show any relationships between failing jobs > ** highlight jobs that have not been addressed, along with a counter of how > many nights it has gone unaddressed > - provide automatic and expedited creation of correctly labelled JIRAs, so > that viewers can quickly organize/take ownership of a failed nightly job. > JIRA supports reading form fields from URL parameters, so this would be > fairly straightforward: > > [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds
[ https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566325#comment-17566325 ] David Dali Susanibar Arce commented on ARROW-8043: -- This ticket cover in a more fashion manner this implementation: https://issues.apache.org/jira/browse/ARROW-16333 > [Developer] Provide better visibility for failed nightly builds > --- > > Key: ARROW-8043 > URL: https://issues.apache.org/jira/browse/ARROW-8043 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Priority: Major > Labels: pull-request-available > > Emails reporting nightly failures are unsatisfactory in two ways: there is a > large click/scroll distance between the links presented in that email and the > actual error message. Worse, once one is there it's not clear what JIRAs have > been made or which of them are in progress. > One solution would be to replace or augment the [NIGHTLY] email with a page > ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows > how many nights it has failed, a shortcut to the actual error line in CI's > logs, and useful views of JIRA. We could accomplish this with: > - dedicated JIRA tags; one for each nightly job so a JIRA can be easily > associated with specific jobs > - A static HTML dashboard with client side JavaScript to > ** scrape JIRA and update the page dynamically as soon as JIRAs are opened > ** show any relationships between failing jobs > ** highlight jobs that have not been addressed, along with a counter of how > many nights it has gone unaddressed > - provide automatic and expedited creation of correctly labelled JIRAs, so > that viewers can quickly organize/take ownership of a failed nightly job. > JIRA supports reading form fields from URL parameters, so this would be > fairly straightforward: > > [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16575) [R] arrow::write_dataset() does nothing with 0 row dataframes in R
[ https://issues.apache.org/jira/browse/ARROW-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566324#comment-17566324 ] Neal Richardson commented on ARROW-16575: - This matches my expectations. write_dataset also won't write files for partitions that don't exist either. If you want a file/dataset with 0 rows and just the schema, you can use the single file writer, write_feather: {code} > write_feather(cars[cars$speed > 1000, ], "test.arrow") > read_feather("test.arrow", as_data_frame=FALSE) Table 0 rows x 2 columns $speed $dist See $metadata for additional Schema metadata {code} > [R] arrow::write_dataset() does nothing with 0 row dataframes in R > -- > > Key: ARROW-16575 > URL: https://issues.apache.org/jira/browse/ARROW-16575 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Environment: Mac OS 12.3, R 4.1 >Reporter: Adam Black >Priority: Minor > > In R a dataframe can have 0 rows. It still has column names and types. > > Expected behavior of arrow::write_dataset > I would expect that it would be possible to have a FileSystemDataset with > zero rows that would contain metadata about the column names and types. > arrow::write_dataset would create the FileSystemDataset metadata when given a > dataframe with zero rows. > > Actual behavior > arrow::write_dataset() does nothing when passed a dataframe with zero rows. > > Reproducible example using the current arrow package on CRAN > {code:java} > arrow::write_dataset(cars, here::here("cars")) > arrow::open_dataset(here::here("cars")) > #> FileSystemDataset with 1 Parquet file > #> speed: double > #> dist: double > #> > #> See $metadata for additional Schema metadata > file.exists(here::here("cars")) > #> [1] TRUE > df <- cars[cars$speed > 1000, ] > nrow(df) > #> [1] 0 > arrow::write_dataset(df, here::here("df"), format = "feather") > arrow::open_dataset(here::here("df")) > #> Error: IOError: Cannot list directory > '/private/var/folders/xx/01v98b6546ldnm1rg1_bvk00gn/T/RtmpGkX0gK/reprex-17c305ed29ad5-nerdy-ram/df'. > Detail: [errno 2] No such file or directory > file.exists(here::here("df")) > #> [1] FALSE{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition
[ https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-15938: Labels: query-engine (was: ) > [R][C++] Segfault in left join with empty right table when filtered on > partition > > > Key: ARROW-15938 > URL: https://issues.apache.org/jira/browse/ARROW-15938 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.2 > Environment: ubuntu linux, R4.1.2 >Reporter: Vitalie Spinu >Priority: Major > Labels: query-engine > > When the right table in a join is empty as a result of a filtering on a > partition group the join segfaults: > {code:java} > library(arrow) > library(glue) > df <- mutate(iris, id = runif(n())) > dir <- "./tmp/iris" > dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) > dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) > write_parquet(df, glue("{dir}/group=a/part1.parquet")) > write_parquet(df, glue("{dir}/group=b/part2.parquet")) > db1 <- open_dataset(dir) %>% > filter(group == "blabla") > open_dataset(dir) %>% > filter(group == "b") %>% > select(id) %>% > left_join(db1, by = "id") %>% > collect() > {code} > {code:java} > ==24063== Thread 7: > ==24063== Invalid read of size 1 > ==24063== at 0x1FFE606D: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE68CC: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, > int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE84D5: > arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, > arrow::compute::ExecBatch const&) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE8CB4: > arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x200011CF: > arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB580E: > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> > >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FE2B2A0: > std::thread::_State_impl > > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x92844BF: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) > ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) > ==24063== by 0x710D71E: clone (clone.S:95) > ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==24063== *** caught segfault *** > address 0x10, cause 'memory not mapped'Traceback: > 1: Table__from_RecordBatchReader(self) > 2: tab$read_table() > 3: do_exec_plan(x) > 4: doTryCatch(return(expr), name, parentenv, handler) > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 6: tryCatchList(expr, classes, parentenv, handlers) > 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { > handle_csv_read_error(e, x$.data$schema)}) > 8: collect.arrow_dplyr_query(.) > 9: collect(.) > 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% > left_join(db1, by = "id") %>% collect()Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace {code} > This is arrow from the current master ece0e23f1. > It's worth noting that if the right table is filtered on a non-partitioned > variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition
[ https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-15938: Fix Version/s: 9.0.0 > [R][C++] Segfault in left join with empty right table when filtered on > partition > > > Key: ARROW-15938 > URL: https://issues.apache.org/jira/browse/ARROW-15938 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.2 > Environment: ubuntu linux, R4.1.2 >Reporter: Vitalie Spinu >Priority: Major > Labels: query-engine > Fix For: 9.0.0 > > > When the right table in a join is empty as a result of a filtering on a > partition group the join segfaults: > {code:java} > library(arrow) > library(glue) > df <- mutate(iris, id = runif(n())) > dir <- "./tmp/iris" > dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) > dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) > write_parquet(df, glue("{dir}/group=a/part1.parquet")) > write_parquet(df, glue("{dir}/group=b/part2.parquet")) > db1 <- open_dataset(dir) %>% > filter(group == "blabla") > open_dataset(dir) %>% > filter(group == "b") %>% > select(id) %>% > left_join(db1, by = "id") %>% > collect() > {code} > {code:java} > ==24063== Thread 7: > ==24063== Invalid read of size 1 > ==24063== at 0x1FFE606D: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE68CC: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, > int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE84D5: > arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, > arrow::compute::ExecBatch const&) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE8CB4: > arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x200011CF: > arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB580E: > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> > >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FE2B2A0: > std::thread::_State_impl > > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x92844BF: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) > ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) > ==24063== by 0x710D71E: clone (clone.S:95) > ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==24063== *** caught segfault *** > address 0x10, cause 'memory not mapped'Traceback: > 1: Table__from_RecordBatchReader(self) > 2: tab$read_table() > 3: do_exec_plan(x) > 4: doTryCatch(return(expr), name, parentenv, handler) > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 6: tryCatchList(expr, classes, parentenv, handlers) > 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { > handle_csv_read_error(e, x$.data$schema)}) > 8: collect.arrow_dplyr_query(.) > 9: collect(.) > 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% > left_join(db1, by = "id") %>% collect()Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace {code} > This is arrow from the current master ece0e23f1. > It's worth noting that if the right table is filtered on a non-partitioned > variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition
[ https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566322#comment-17566322 ] Neal Richardson commented on ARROW-15938: - Confirmed that this is still an issue. > [R][C++] Segfault in left join with empty right table when filtered on > partition > > > Key: ARROW-15938 > URL: https://issues.apache.org/jira/browse/ARROW-15938 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.2 > Environment: ubuntu linux, R4.1.2 >Reporter: Vitalie Spinu >Priority: Major > Labels: query-engine > Fix For: 9.0.0 > > > When the right table in a join is empty as a result of a filtering on a > partition group the join segfaults: > {code:java} > library(arrow) > library(glue) > df <- mutate(iris, id = runif(n())) > dir <- "./tmp/iris" > dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) > dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) > write_parquet(df, glue("{dir}/group=a/part1.parquet")) > write_parquet(df, glue("{dir}/group=b/part2.parquet")) > db1 <- open_dataset(dir) %>% > filter(group == "blabla") > open_dataset(dir) %>% > filter(group == "b") %>% > select(id) %>% > left_join(db1, by = "id") %>% > collect() > {code} > {code:java} > ==24063== Thread 7: > ==24063== Invalid read of size 1 > ==24063== at 0x1FFE606D: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE68CC: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, > int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE84D5: > arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, > arrow::compute::ExecBatch const&) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE8CB4: > arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x200011CF: > arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB580E: > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> > >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FE2B2A0: > std::thread::_State_impl > > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x92844BF: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) > ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) > ==24063== by 0x710D71E: clone (clone.S:95) > ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==24063== *** caught segfault *** > address 0x10, cause 'memory not mapped'Traceback: > 1: Table__from_RecordBatchReader(self) > 2: tab$read_table() > 3: do_exec_plan(x) > 4: doTryCatch(return(expr), name, parentenv, handler) > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 6: tryCatchList(expr, classes, parentenv, handlers) > 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { > handle_csv_read_error(e, x$.data$schema)}) > 8: collect.arrow_dplyr_query(.) > 9: collect(.) > 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% > left_join(db1, by = "id") %>% collect()Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace {code} > This is arrow from the current master ece0e23f1. > It's worth noting that if the right table is filtered on a non-partitioned > variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-13936) Add a column to show us the number of time that this job is failing
[ https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Dali Susanibar Arce closed ARROW-13936. - Resolution: Abandoned > Add a column to show us the number of time that this job is failing > --- > > Key: ARROW-13936 > URL: https://issues.apache.org/jira/browse/ARROW-13936 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: David Dali Susanibar Arce >Priority: Minor > > Try to use external repository to collect information about jobs name failling -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition
[ https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-15938: Component/s: (was: Compute IR) > [R][C++] Segfault in left join with empty right table when filtered on > partition > > > Key: ARROW-15938 > URL: https://issues.apache.org/jira/browse/ARROW-15938 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.2 > Environment: ubuntu linux, R4.1.2 >Reporter: Vitalie Spinu >Priority: Major > > When the right table in a join is empty as a result of a filtering on a > partition group the join segfaults: > {code:java} > library(arrow) > library(glue) > df <- mutate(iris, id = runif(n())) > dir <- "./tmp/iris" > dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) > dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) > write_parquet(df, glue("{dir}/group=a/part1.parquet")) > write_parquet(df, glue("{dir}/group=b/part2.parquet")) > db1 <- open_dataset(dir) %>% > filter(group == "blabla") > open_dataset(dir) %>% > filter(group == "b") %>% > select(id) %>% > left_join(db1, by = "id") %>% > collect() > {code} > {code:java} > ==24063== Thread 7: > ==24063== Invalid read of size 1 > ==24063== at 0x1FFE606D: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE68CC: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, > int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE84D5: > arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, > arrow::compute::ExecBatch const&) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE8CB4: > arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x200011CF: > arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB580E: > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB6444: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, > arrow::compute::MapNode::SubmitTask(std::function > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> > >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FE2B2A0: > std::thread::_State_impl > > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x92844BF: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) > ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) > ==24063== by 0x710D71E: clone (clone.S:95) > ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==24063== *** caught segfault *** > address 0x10, cause 'memory not mapped'Traceback: > 1: Table__from_RecordBatchReader(self) > 2: tab$read_table() > 3: do_exec_plan(x) > 4: doTryCatch(return(expr), name, parentenv, handler) > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 6: tryCatchList(expr, classes, parentenv, handlers) > 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { > handle_csv_read_error(e, x$.data$schema)}) > 8: collect.arrow_dplyr_query(.) > 9: collect(.) > 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% > left_join(db1, by = "id") %>% collect()Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace {code} > This is arrow from the current master ece0e23f1. > It's worth noting that if the right table is filtered on a non-partitioned > variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
[ https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17041: --- Labels: Nightly pull-request-available (was: Nightly) > [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind > > > Key: ARROW-17041 > URL: https://issues.apache.org/jira/browse/ARROW-17041 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Raúl Cumplido >Assignee: Antoine Pitrou >Priority: Critical > Labels: Nightly, pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > There seems to be an issue on the arrow-compute-scalar-test as it has been > failing for the last days, example: > [https://github.com/ursacomputing/crossbow/runs/7274655770] > See [https://crossbow.voltrondata.com/] > Error: > {code:java} > ==13125== > ==13125== HEAP SUMMARY: > ==13125== in use at exit: 16,090 bytes in 161 blocks > ==13125== total heap usage: 14,612,979 allocs, 14,612,818 frees, > 2,853,741,784 bytes allocated > ==13125== > ==13125== LEAK SUMMARY: > ==13125==definitely lost: 0 bytes in 0 blocks > ==13125==indirectly lost: 0 bytes in 0 blocks > ==13125== possibly lost: 0 bytes in 0 blocks > ==13125==still reachable: 16,090 bytes in 161 blocks > ==13125== suppressed: 0 bytes in 0 blocks > ==13125== Reachable blocks (those to which a pointer was found) are not shown. > ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all > ==13125== > ==13125== Use --track-origins=yes to see where uninitialised values come from > ==13125== For lists of detected and suppressed errors, rerun with: -s > ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from > 44) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17041) [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind
[ https://issues.apache.org/jira/browse/ARROW-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-17041: -- Assignee: Antoine Pitrou > [C++][CI] arrow-compute-scalar-test fails on test-conda-cpp-valgrind > > > Key: ARROW-17041 > URL: https://issues.apache.org/jira/browse/ARROW-17041 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Raúl Cumplido >Assignee: Antoine Pitrou >Priority: Critical > Labels: Nightly > Fix For: 9.0.0 > > > There seems to be an issue on the arrow-compute-scalar-test as it has been > failing for the last days, example: > [https://github.com/ursacomputing/crossbow/runs/7274655770] > See [https://crossbow.voltrondata.com/] > Error: > {code:java} > ==13125== > ==13125== HEAP SUMMARY: > ==13125== in use at exit: 16,090 bytes in 161 blocks > ==13125== total heap usage: 14,612,979 allocs, 14,612,818 frees, > 2,853,741,784 bytes allocated > ==13125== > ==13125== LEAK SUMMARY: > ==13125==definitely lost: 0 bytes in 0 blocks > ==13125==indirectly lost: 0 bytes in 0 blocks > ==13125== possibly lost: 0 bytes in 0 blocks > ==13125==still reachable: 16,090 bytes in 161 blocks > ==13125== suppressed: 0 bytes in 0 blocks > ==13125== Reachable blocks (those to which a pointer was found) are not shown. > ==13125== To see them, rerun with: --leak-check=full --show-leak-kinds=all > ==13125== > ==13125== Use --track-origins=yes to see where uninitialised values come from > ==13125== For lists of detected and suppressed errors, rerun with: -s > ==13125== ERROR SUMMARY: 54 errors from 12 contexts (suppressed: 517836 from > 44) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17062) [C#] Support compression in IPC format
[ https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566301#comment-17566301 ] Neal Richardson commented on ARROW-17062: - It looks like the C# implementation does not yet support compression: https://arrow.apache.org/docs/status.html#ipc-format > [C#] Support compression in IPC format > -- > > Key: ARROW-17062 > URL: https://issues.apache.org/jira/browse/ARROW-17062 > Project: Apache Arrow > Issue Type: Bug > Components: C#, R >Affects Versions: 8.0.0 > Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4 >Reporter: Todd West >Priority: Major > Fix For: 8.0.2 > > > Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() > fails with default settings. This is specific to compressed files (see > workaround below) and it looks like what happens is C# correctly decompresses > the batches but provides the caller with the compressed versions of the data > arrays instead of the uncompressed ones. While all of the various Length > properties are set correctly in C#, the data arrays are too short to contain > all of the values in the file, the bytes do not match what the decompressed > bytes should be, and basic data accessors like PrimitiveArray.Values can't > be used because they throw ArgumentOutOfRangeException. Looking through the > C# classes in the github repo it doesn't appear there's a way for the caller > to request decompression. So I'm guessing decompression is supposed to be > automatic but, for some reason, isn't. > > While functionally successful, the workaround of using uncompressed feather > isn't great as the uncompressed files are bigger than .csv. In my application > the resulting disk space penalty is hundreds of megabytes compared to the > footprint of using compressed feather. > > Simple single field repex: > In R (arrow 8.0.0): > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test > lz4.feather")}} > In C# (Apache.Arrow 8.0.0): > {{using Apache.Arrow;}} > {{using Apache.Arrow.Ipc;}} > {{using System.IO;}} > {{using System.Runtime.InteropServices;}} > {{ using FileStream stream = new("test lz4.feather", > FileMode.Open, FileAccess.Read, FileShare.Read);}} > {{ using ArrowFileReader arrowFile = new(stream);}} > {{ for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch > != null; batch = arrowFile.ReadNextRecordBatch())}} > {{ {}} > {{ IArrowArray[] fields = batch.Arrays.ToArray();}} > {{ ReadOnlySpan test = MemoryMarshal.Cast double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values > instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}} > {{ }}} > Workaround in R: > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", > compression = "uncompressed")}} > > Apologies if this is a known issue. I didn't find anything on a Jira search > and this isn't included in the [known issues list on > github|http://example.com/]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17062) [C#] Support compression in IPC format
[ https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-17062: Summary: [C#] Support compression in IPC format (was: [C#] write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch()) > [C#] Support compression in IPC format > -- > > Key: ARROW-17062 > URL: https://issues.apache.org/jira/browse/ARROW-17062 > Project: Apache Arrow > Issue Type: Bug > Components: C#, R >Affects Versions: 8.0.0 > Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4 >Reporter: Todd West >Priority: Major > Fix For: 8.0.2 > > > Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() > fails with default settings. This is specific to compressed files (see > workaround below) and it looks like what happens is C# correctly decompresses > the batches but provides the caller with the compressed versions of the data > arrays instead of the uncompressed ones. While all of the various Length > properties are set correctly in C#, the data arrays are too short to contain > all of the values in the file, the bytes do not match what the decompressed > bytes should be, and basic data accessors like PrimitiveArray.Values can't > be used because they throw ArgumentOutOfRangeException. Looking through the > C# classes in the github repo it doesn't appear there's a way for the caller > to request decompression. So I'm guessing decompression is supposed to be > automatic but, for some reason, isn't. > > While functionally successful, the workaround of using uncompressed feather > isn't great as the uncompressed files are bigger than .csv. In my application > the resulting disk space penalty is hundreds of megabytes compared to the > footprint of using compressed feather. > > Simple single field repex: > In R (arrow 8.0.0): > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test > lz4.feather")}} > In C# (Apache.Arrow 8.0.0): > {{using Apache.Arrow;}} > {{using Apache.Arrow.Ipc;}} > {{using System.IO;}} > {{using System.Runtime.InteropServices;}} > {{ using FileStream stream = new("test lz4.feather", > FileMode.Open, FileAccess.Read, FileShare.Read);}} > {{ using ArrowFileReader arrowFile = new(stream);}} > {{ for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch > != null; batch = arrowFile.ReadNextRecordBatch())}} > {{ {}} > {{ IArrowArray[] fields = batch.Arrays.ToArray();}} > {{ ReadOnlySpan test = MemoryMarshal.Cast double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values > instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}} > {{ }}} > Workaround in R: > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", > compression = "uncompressed")}} > > Apologies if this is a known issue. I didn't find anything on a Jira search > and this isn't included in the [known issues list on > github|http://example.com/]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17062) [C#] write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch()
[ https://issues.apache.org/jira/browse/ARROW-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-17062: Summary: [C#] write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch() (was: write_feather() in R doesn't interop with ArrowFileReader.ReadNextRecordBatch()) > [C#] write_feather() in R doesn't interop with > ArrowFileReader.ReadNextRecordBatch() > > > Key: ARROW-17062 > URL: https://issues.apache.org/jira/browse/ARROW-17062 > Project: Apache Arrow > Issue Type: Bug > Components: C#, R >Affects Versions: 8.0.0 > Environment: Arrow 8.0.0, R 4.2.1, VS 17.2.4 >Reporter: Todd West >Priority: Major > Fix For: 8.0.2 > > > Hello world between write_feather() and ArrowFileReader.ReadNextRecordBatch() > fails with default settings. This is specific to compressed files (see > workaround below) and it looks like what happens is C# correctly decompresses > the batches but provides the caller with the compressed versions of the data > arrays instead of the uncompressed ones. While all of the various Length > properties are set correctly in C#, the data arrays are too short to contain > all of the values in the file, the bytes do not match what the decompressed > bytes should be, and basic data accessors like PrimitiveArray.Values can't > be used because they throw ArgumentOutOfRangeException. Looking through the > C# classes in the github repo it doesn't appear there's a way for the caller > to request decompression. So I'm guessing decompression is supposed to be > automatic but, for some reason, isn't. > > While functionally successful, the workaround of using uncompressed feather > isn't great as the uncompressed files are bigger than .csv. In my application > the resulting disk space penalty is hundreds of megabytes compared to the > footprint of using compressed feather. > > Simple single field repex: > In R (arrow 8.0.0): > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test > lz4.feather")}} > In C# (Apache.Arrow 8.0.0): > {{using Apache.Arrow;}} > {{using Apache.Arrow.Ipc;}} > {{using System.IO;}} > {{using System.Runtime.InteropServices;}} > {{ using FileStream stream = new("test lz4.feather", > FileMode.Open, FileAccess.Read, FileShare.Read);}} > {{ using ArrowFileReader arrowFile = new(stream);}} > {{ for (RecordBatch batch = arrowFile.ReadNextRecordBatch(); batch > != null; batch = arrowFile.ReadNextRecordBatch())}} > {{ {}} > {{ IArrowArray[] fields = batch.Arrays.ToArray();}} > {{ ReadOnlySpan test = MemoryMarshal.Cast double>(((DoubleArray)fields[0]).ValueBuffer.Span); // 15 incorrect values > instead of 21 correctly incrementing ones (0, 0.05, 0.10, ..., 1)}} > {{ }}} > Workaround in R: > {{write_feather(tibble(value = seq(0, 1, length.out = 21)), "test.feather", > compression = "uncompressed")}} > > Apologies if this is a known issue. I didn't find anything on a Jira search > and this isn't included in the [known issues list on > github|http://example.com/]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14889) [C++] GCSFS tests hang if testbench not installed
[ https://issues.apache.org/jira/browse/ARROW-14889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-14889: - Summary: [C++] GCSFS tests hang if testbench not installed (was: [C++] GCFS tests hang if testbench not installed) > [C++] GCSFS tests hang if testbench not installed > - > > Key: ARROW-14889 > URL: https://issues.apache.org/jira/browse/ARROW-14889 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > They should probably error out instead of hanging. > {code} > Running main() from > /home/antoine/arrow/dev/cpp/build-preset/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 22 tests from 2 test suites. > [--] Global test environment set-up. > [--] 13 tests from GcsFileSystem > [ RUN ] GcsFileSystem.OptionsCompare > [ OK ] GcsFileSystem.OptionsCompare (0 ms) > [ RUN ] GcsFileSystem.ToArrowStatusOK > [ OK ] GcsFileSystem.ToArrowStatusOK (0 ms) > [ RUN ] GcsFileSystem.ToArrowStatus > [ OK ] GcsFileSystem.ToArrowStatus (0 ms) > [ RUN ] GcsFileSystem.FileSystemCompare > [ OK ] GcsFileSystem.FileSystemCompare (2 ms) > [ RUN ] GcsFileSystem.ToEncryptionKey > [ OK ] GcsFileSystem.ToEncryptionKey (0 ms) > [ RUN ] GcsFileSystem.ToEncryptionKeyEmpty > [ OK ] GcsFileSystem.ToEncryptionKeyEmpty (0 ms) > [ RUN ] GcsFileSystem.ToKmsKeyName > [ OK ] GcsFileSystem.ToKmsKeyName (0 ms) > [ RUN ] GcsFileSystem.ToKmsKeyNameEmpty > [ OK ] GcsFileSystem.ToKmsKeyNameEmpty (0 ms) > [ RUN ] GcsFileSystem.ToPredefinedAcl > [ OK ] GcsFileSystem.ToPredefinedAcl (0 ms) > [ RUN ] GcsFileSystem.ToPredefinedAclEmpty > [ OK ] GcsFileSystem.ToPredefinedAclEmpty (0 ms) > [ RUN ] GcsFileSystem.ToObjectMetadata > [ OK ] GcsFileSystem.ToObjectMetadata (0 ms) > [ RUN ] GcsFileSystem.ToObjectMetadataEmpty > [ OK ] GcsFileSystem.ToObjectMetadataEmpty (0 ms) > [ RUN ] GcsFileSystem.ToObjectMetadataInvalidCustomTime > [ OK ] GcsFileSystem.ToObjectMetadataInvalidCustomTime (0 ms) > [--] 13 tests from GcsFileSystem (3 ms total) > [--] 9 tests from GcsIntegrationTest > [ RUN ] GcsIntegrationTest.GetFileInfoBucket > /home/antoine/miniconda3/envs/pyarrow/bin/python3: No module named testbench > ^C > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16093) [Python] Address docstrings in Filesystems (Python Implementations)
[ https://issues.apache.org/jira/browse/ARROW-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16093: --- Labels: pull-request-available (was: ) > [Python] Address docstrings in Filesystems (Python Implementations) > --- > > Key: ARROW-16093 > URL: https://issues.apache.org/jira/browse/ARROW-16093 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Ensure docstrings for Filesystem Interface have an {{Examples}} section: > * > [https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html#pyarrow.fs.PyFileSystem] > * > [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystemHandler.html#pyarrow.fs.FileSystemHandler] > * > [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FSSpecHandler.html#pyarrow.fs.FSSpecHandler] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-14889) [C++] GCFS tests hang if testbench not installed
[ https://issues.apache.org/jira/browse/ARROW-14889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-14889. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13520 [https://github.com/apache/arrow/pull/13520] > [C++] GCFS tests hang if testbench not installed > > > Key: ARROW-14889 > URL: https://issues.apache.org/jira/browse/ARROW-14889 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > They should probably error out instead of hanging. > {code} > Running main() from > /home/antoine/arrow/dev/cpp/build-preset/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc > [==] Running 22 tests from 2 test suites. > [--] Global test environment set-up. > [--] 13 tests from GcsFileSystem > [ RUN ] GcsFileSystem.OptionsCompare > [ OK ] GcsFileSystem.OptionsCompare (0 ms) > [ RUN ] GcsFileSystem.ToArrowStatusOK > [ OK ] GcsFileSystem.ToArrowStatusOK (0 ms) > [ RUN ] GcsFileSystem.ToArrowStatus > [ OK ] GcsFileSystem.ToArrowStatus (0 ms) > [ RUN ] GcsFileSystem.FileSystemCompare > [ OK ] GcsFileSystem.FileSystemCompare (2 ms) > [ RUN ] GcsFileSystem.ToEncryptionKey > [ OK ] GcsFileSystem.ToEncryptionKey (0 ms) > [ RUN ] GcsFileSystem.ToEncryptionKeyEmpty > [ OK ] GcsFileSystem.ToEncryptionKeyEmpty (0 ms) > [ RUN ] GcsFileSystem.ToKmsKeyName > [ OK ] GcsFileSystem.ToKmsKeyName (0 ms) > [ RUN ] GcsFileSystem.ToKmsKeyNameEmpty > [ OK ] GcsFileSystem.ToKmsKeyNameEmpty (0 ms) > [ RUN ] GcsFileSystem.ToPredefinedAcl > [ OK ] GcsFileSystem.ToPredefinedAcl (0 ms) > [ RUN ] GcsFileSystem.ToPredefinedAclEmpty > [ OK ] GcsFileSystem.ToPredefinedAclEmpty (0 ms) > [ RUN ] GcsFileSystem.ToObjectMetadata > [ OK ] GcsFileSystem.ToObjectMetadata (0 ms) > [ RUN ] GcsFileSystem.ToObjectMetadataEmpty > [ OK ] GcsFileSystem.ToObjectMetadataEmpty (0 ms) > [ RUN ] GcsFileSystem.ToObjectMetadataInvalidCustomTime > [ OK ] GcsFileSystem.ToObjectMetadataInvalidCustomTime (0 ms) > [--] 13 tests from GcsFileSystem (3 ms total) > [--] 9 tests from GcsIntegrationTest > [ RUN ] GcsIntegrationTest.GetFileInfoBucket > /home/antoine/miniconda3/envs/pyarrow/bin/python3: No module named testbench > ^C > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17003) [Java][Docs] Document JDBC module
[ https://issues.apache.org/jira/browse/ARROW-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17003. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13543 [https://github.com/apache/arrow/pull/13543] > [Java][Docs] Document JDBC module > - > > Key: ARROW-17003 > URL: https://issues.apache.org/jira/browse/ARROW-17003 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > The arrow-jdbc submodule could use its own documentation page. > In particular, we should document the type mapping it uses (and the rationale > where applicable) and how to customize it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17055) [Java][FlightRPC] flight-core and flight-sql jars delivering same class names
[ https://issues.apache.org/jira/browse/ARROW-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17055: --- Labels: pull-request-available (was: ) > [Java][FlightRPC] flight-core and flight-sql jars delivering same class names > - > > Key: ARROW-17055 > URL: https://issues.apache.org/jira/browse/ARROW-17055 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: Kevin Bambrick >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hello. I am trying to uptake arrow flight sql. We have a check in out build > to make sure that there are no overlapping class files in our project. When > adding the flight sql dependency to our project the warning throws that > flight-sql and flight-core overlap and the jars deliver the same class files. > {code:java} > Sample Warning: WARNING: CLASSPATH OVERLAP: These jars deliver the same class > files: [flight-sql-7.0.0.jar, flight-core-7.0.0.jar] files: > [org/apache/arrow/flight/impl/Flight$FlightDescriptor$1.class, > org/apache/arrow/flight/impl/Flight$ActionOrBuilder.class{code} > > It seems that the classes generated by Flight.proto gets generated in both > flight-sql and flight-core jars. Since these classes get generated in > flight-core, and flight-sql is dependent on flight-core, can the generation > of Flight.java and FlightServiceGrpc.java be removed from flight-sql and > instead rely on it to be pulled directly from flight-core? > > thanks in advance! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17065) [Python] Allow using subclassed ExtensionScalar in ExtensionType
[ https://issues.apache.org/jira/browse/ARROW-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17065: --- Labels: pull-request-available (was: ) > [Python] Allow using subclassed ExtensionScalar in ExtensionType > > > Key: ARROW-17065 > URL: https://issues.apache.org/jira/browse/ARROW-17065 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is a follow-up to ARROW-13612. > [See > discussion.|https://github.com/apache/arrow/pull/13454#issuecomment-1177140141] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17065) [Python] Allow using subclassed ExtensionScalar in ExtensionType
[ https://issues.apache.org/jira/browse/ARROW-17065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-17065: -- Assignee: Rok Mihevc > [Python] Allow using subclassed ExtensionScalar in ExtensionType > > > Key: ARROW-17065 > URL: https://issues.apache.org/jira/browse/ARROW-17065 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Fix For: 9.0.0 > > > This is a follow-up to ARROW-13612. > [See > discussion.|https://github.com/apache/arrow/pull/13454#issuecomment-1177140141] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17065) [Python] Allow using subclassed ExtensionScalar in ExtensionType
Rok Mihevc created ARROW-17065: -- Summary: [Python] Allow using subclassed ExtensionScalar in ExtensionType Key: ARROW-17065 URL: https://issues.apache.org/jira/browse/ARROW-17065 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Rok Mihevc Fix For: 9.0.0 This is a follow-up to ARROW-13612. [See discussion.|https://github.com/apache/arrow/pull/13454#issuecomment-1177140141] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17056) [C++] Bump version of bundled substrait
[ https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido closed ARROW-17056. - Fix Version/s: (was: 9.0.0) Resolution: Won't Fix This is not required as seen on the ticket comments. We will update substrait once an arrow feature requires it. At the moment substrait is evolving rapidly and we don't require to keep up with the latest version > [C++] Bump version of bundled substrait > --- > > Key: ARROW-17056 > URL: https://issues.apache.org/jira/browse/ARROW-17056 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Raúl Cumplido >Priority: Major > > There has been a new substrait version released: > https://github.com/substrait-io/substrait/releases/tag/v0.7.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17056) [C++] Bump version of bundled substrait
[ https://issues.apache.org/jira/browse/ARROW-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566243#comment-17566243 ] Raúl Cumplido commented on ARROW-17056: --- Ok, I'll close this one then as we will update based on feature need at the moment. > [C++] Bump version of bundled substrait > --- > > Key: ARROW-17056 > URL: https://issues.apache.org/jira/browse/ARROW-17056 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Raúl Cumplido >Priority: Major > Fix For: 9.0.0 > > > There has been a new substrait version released: > https://github.com/substrait-io/substrait/releases/tag/v0.7.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17064) Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True"
[ https://issues.apache.org/jira/browse/ARROW-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Marco Ramos updated ARROW-17064: -- Description: When try to copy a local path to s3 remote filesystem using `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the system hangs. If use "use_threads=False` the operation must complete ok (but more slow). My code is: {code:java} >>> import pyarrow as pa >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;) >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", >>> destination_filesystem=s3fs) ... (don't return){code} If check remote s3, all files appear, but the function don't return Platform: Windows was: When try to copy a local path to s3 remote filesystem using `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the system hangs. If use "use_threads=False` the operation must complete ok (but more slow). My code is: {code:java} >>> import pyarrow as pa >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;) >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", >>> destination_filesystem=s3fs) ... (don't return){code} If check remote s3, all files appear, but the function don't return > Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True" > > > Key: ARROW-17064 > URL: https://issues.apache.org/jira/browse/ARROW-17064 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Alejandro Marco Ramos >Priority: Major > > When try to copy a local path to s3 remote filesystem using > `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the > system hangs. If use "use_threads=False` the operation must complete ok (but > more slow). > > My code is: > {code:java} > >>> import pyarrow as pa > >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;) > >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", > >>> destination_filesystem=s3fs) > ... (don't return){code} > If check remote s3, all files appear, but the function don't return > > Platform: Windows -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16093) [Python] Address docstrings in Filesystems (Python Implementations)
[ https://issues.apache.org/jira/browse/ARROW-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim reassigned ARROW-16093: --- Assignee: Alenka Frim > [Python] Address docstrings in Filesystems (Python Implementations) > --- > > Key: ARROW-16093 > URL: https://issues.apache.org/jira/browse/ARROW-16093 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > > Ensure docstrings for Filesystem Interface have an {{Examples}} section: > * > [https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html#pyarrow.fs.PyFileSystem] > * > [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystemHandler.html#pyarrow.fs.FileSystemHandler] > * > [https://arrow.apache.org/docs/python/generated/pyarrow.fs.FSSpecHandler.html#pyarrow.fs.FSSpecHandler] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16091) [Python] Continuation of improving Classes and Methods Docstrings
[ https://issues.apache.org/jira/browse/ARROW-16091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim reassigned ARROW-16091: --- Assignee: Alenka Frim > [Python] Continuation of improving Classes and Methods Docstrings > -- > > Key: ARROW-16091 > URL: https://issues.apache.org/jira/browse/ARROW-16091 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > > Continuation of the initiative aimed at improving methods and classes > docstrings, especially from the point of view of ensuring they have an > {{Examples}} section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-14494) [C++] signal cancel test fails occasionaly on windows
[ https://issues.apache.org/jira/browse/ARROW-14494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai closed ARROW-14494. Resolution: Cannot Reproduce > [C++] signal cancel test fails occasionaly on windows > - > > Key: ARROW-14494 > URL: https://issues.apache.org/jira/browse/ARROW-14494 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.0 >Reporter: Yibo Cai >Priority: Major > > Log: > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/41278276/job/j9k4897e9ppwt2q4#L782 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16665) [Release] Update 03-binary-submit.sh to comment on PR and track binary submission with badges
[ https://issues.apache.org/jira/browse/ARROW-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido reassigned ARROW-16665: - Assignee: Raúl Cumplido > [Release] Update 03-binary-submit.sh to comment on PR and track binary > submission with badges > - > > Key: ARROW-16665 > URL: https://issues.apache.org/jira/browse/ARROW-16665 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Major > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-15793) [C++][FlightRPC] DoPutLargeBatch test sometimes stuck for 10 seconds
[ https://issues.apache.org/jira/browse/ARROW-15793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai closed ARROW-15793. Resolution: Not A Bug > [C++][FlightRPC] DoPutLargeBatch test sometimes stuck for 10 seconds > > > Key: ARROW-15793 > URL: https://issues.apache.org/jira/browse/ARROW-15793 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Reporter: Yibo Cai >Priority: Major > > Normally the test finishes in 100ms. But it often costs 10s on my test > machine. > Debug build is good. > I did brief debug, looks it's related to > [https://github.com/apache/arrow/pull/12302]. > It stuck 10 seconds in destructing grpc::Server at code > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/server.cc#L863] > To reproduce: > {code:bash} > $ cmake -GNinja -DARROW_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo > -DARROW_FLIGHT=ON .. > $ ninja arrow-flight-test > $ relwithdebinfo/arrow-flight-test --gtest_filter="*DoPutLargeBatch*" > [==] Running 1 test from 1 test suite. > [--] Global test environment set-up. > [--] 1 test from TestDoPut > [ RUN ] TestDoPut.DoPutLargeBatch > [ OK ] TestDoPut.DoPutLargeBatch (10017 ms) > [--] 1 test from TestDoPut (10017 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test suite ran. (10017 ms total) > [ PASSED ] 1 test. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16667) [Release] Post merge script for release should not be necessary with the new workflow
[ https://issues.apache.org/jira/browse/ARROW-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido reassigned ARROW-16667: - Assignee: Raúl Cumplido > [Release] Post merge script for release should not be necessary with the new > workflow > - > > Key: ARROW-16667 > URL: https://issues.apache.org/jira/browse/ARROW-16667 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In discussion with Krisztián we think that post-01-merge.sh is not required > as we should be using archery cherry-pick on the maintenance branch instead > of creating branches and cherry picking manually for patch releases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16667) [Release] Post merge script for release should not be necessary with the new workflow
[ https://issues.apache.org/jira/browse/ARROW-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16667: --- Labels: pull-request-available (was: ) > [Release] Post merge script for release should not be necessary with the new > workflow > - > > Key: ARROW-16667 > URL: https://issues.apache.org/jira/browse/ARROW-16667 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In discussion with Krisztián we think that post-01-merge.sh is not required > as we should be using archery cherry-pick on the maintenance branch instead > of creating branches and cherry picking manually for patch releases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16667) [Release] Post merge script for release should not be necessary with the new workflow
[ https://issues.apache.org/jira/browse/ARROW-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16667: -- Description: In discussion with Krisztián we think that post-01-merge.sh is not required as we should be using archery cherry-pick on the maintenance branch instead of creating branches and cherry picking manually for patch releases. (was: In discussion with Krisztián we think that post-01-merge.sh is not required if we fix post-12-bump-versions.sh to support minor releases. Investigate, fix and remove if not necessary.) > [Release] Post merge script for release should not be necessary with the new > workflow > - > > Key: ARROW-16667 > URL: https://issues.apache.org/jira/browse/ARROW-16667 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Raúl Cumplido >Priority: Major > Fix For: 9.0.0 > > > In discussion with Krisztián we think that post-01-merge.sh is not required > as we should be using archery cherry-pick on the maintenance branch instead > of creating branches and cherry picking manually for patch releases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Fix Version/s: 8.0.2 > [Go] Update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0, 8.0.2, 6.0.2, 7.0.1 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Fix Version/s: 7.0.1 > [Go] Update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0, 6.0.2, 7.0.1 > > Time Spent: 3.5h > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Fix Version/s: 6.0.2 (was: 6.0.3) > [Go] Update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0, 6.0.2 > > Time Spent: 3.5h > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Fix Version/s: 6.0.2 > [Go] Update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 6.0.2, 9.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] Update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Summary: [Go] Update testify to fix securiy vulnerability (was: [Go] update testify to fix securiy vulnerability) > [Go] Update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16759) [Go] update testify to fix securiy vulnerability
[ https://issues.apache.org/jira/browse/ARROW-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-16759: - Summary: [Go] update testify to fix securiy vulnerability (was: [Go]) > [Go] update testify to fix securiy vulnerability > > > Key: ARROW-16759 > URL: https://issues.apache.org/jira/browse/ARROW-16759 > Project: Apache Arrow > Issue Type: Task > Components: Go >Affects Versions: 7.0.0, 8.0.0 >Reporter: Dominic Barnes >Assignee: Dominic Barnes >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > The packges under github.com/apache/arrow/go currently have a dependency on > github.com/stretchr/testify v1.7.0 which has a dependency on gopkg.in/yaml.v3 > that has an outstanding security vulnerability. > ([CVE-2022-28948|https://github.com/advisories/GHSA-hp87-p4gw-j4gq]) > While testify is only used during tests, this is not distinguished by the go > toolchain and other tools like Snyk which scan the dependency chain for > vulnerabilities. Unfortunately, due to Go's [Minimal version > selection|[https://go.dev/ref/mod#minimal-version-selection],] this ends up > requiring us to visit our dependencies to ensure this security vulnerability > is addressed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17064) Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True"
Alejandro Marco Ramos created ARROW-17064: - Summary: Python hangs when use pyarrow.fs.copy_files is used with "used_threads=True" Key: ARROW-17064 URL: https://issues.apache.org/jira/browse/ARROW-17064 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Reporter: Alejandro Marco Ramos When try to copy a local path to s3 remote filesystem using `pyarrow.fs.copy_files` and using default parameter `use_threads=True`, the system hangs. If use "use_threads=False` the operation must complete ok (but more slow). My code is: {code:java} >>> import pyarrow as pa >>> s3fs=pa.fs.S3FileSystem(endpoint_override="http://xx;) >>> pa.fs.copy_files("tests/data/payments", "bucket/payments", >>> destination_filesystem=s3fs) ... (don't return){code} If check remote s3, all files appear, but the function don't return -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17049) [C++] arrow-compute-expression-benchmark aborts with sanity check failure
[ https://issues.apache.org/jira/browse/ARROW-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566202#comment-17566202 ] Antoine Pitrou commented on ARROW-17049: Yes! > [C++] arrow-compute-expression-benchmark aborts with sanity check failure > - > > Key: ARROW-17049 > URL: https://issues.apache.org/jira/browse/ARROW-17049 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 9.0.0 > > > {code} > $ ./build-release/relwithdebinfo/arrow-compute-expression-benchmark > 2022-07-12T11:56:06+02:00 > Running ./build-release/relwithdebinfo/arrow-compute-expression-benchmark > Run on (24 X 3800 MHz CPU s) > CPU Caches: > L1 Data 32 KiB (x12) > L1 Instruction 32 KiB (x12) > L2 Unified 512 KiB (x12) > L3 Unified 16384 KiB (x4) > Load Average: 0.44, 3.87, 2.60 > ***WARNING*** CPU scaling is enabled, the benchmark real time measurements > may be noisy and will incur extra overhead. > - > Benchmark > Time CPU Iterations > - > SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_simple >5734 ns 5733 ns 122775 > SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_simple >9094 ns 9092 ns76172 > SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_dictionary > 12992 ns12989 ns53601 > SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_dictionary > 16395 ns16392 ns42601 > SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_simple >5756 ns 5755 ns 120485 > SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_simple >9197 ns 9195 ns76168 > SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_dictionary > 12875 ns12872 ns54240 > SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_dictionary > 16567 ns16563 ns42539 > BindAndEvaluate/simple_array > 255 ns 255 ns 2748813 > BindAndEvaluate/simple_scalar > 252 ns 252 ns 2765200 > BindAndEvaluate/nested_array >2251 ns 2251 ns 310424 > BindAndEvaluate/nested_scalar >2687 ns 2686 ns 261939 > -- Arrow Fatal Error -- > Invalid: Value lengths differed from ExecBatch length > Abandon > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17049) [C++] arrow-compute-expression-benchmark aborts with sanity check failure
[ https://issues.apache.org/jira/browse/ARROW-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-17049. -- Fix Version/s: (was: 9.0.0) Resolution: Duplicate > [C++] arrow-compute-expression-benchmark aborts with sanity check failure > - > > Key: ARROW-17049 > URL: https://issues.apache.org/jira/browse/ARROW-17049 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Antoine Pitrou >Priority: Blocker > > {code} > $ ./build-release/relwithdebinfo/arrow-compute-expression-benchmark > 2022-07-12T11:56:06+02:00 > Running ./build-release/relwithdebinfo/arrow-compute-expression-benchmark > Run on (24 X 3800 MHz CPU s) > CPU Caches: > L1 Data 32 KiB (x12) > L1 Instruction 32 KiB (x12) > L2 Unified 512 KiB (x12) > L3 Unified 16384 KiB (x4) > Load Average: 0.44, 3.87, 2.60 > ***WARNING*** CPU scaling is enabled, the benchmark real time measurements > may be noisy and will incur extra overhead. > - > Benchmark > Time CPU Iterations > - > SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_simple >5734 ns 5733 ns 122775 > SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_simple >9094 ns 9092 ns76172 > SimplifyFilterWithGuarantee/negative_filter_simple_guarantee_dictionary > 12992 ns12989 ns53601 > SimplifyFilterWithGuarantee/negative_filter_cast_guarantee_dictionary > 16395 ns16392 ns42601 > SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_simple >5756 ns 5755 ns 120485 > SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_simple >9197 ns 9195 ns76168 > SimplifyFilterWithGuarantee/positive_filter_simple_guarantee_dictionary > 12875 ns12872 ns54240 > SimplifyFilterWithGuarantee/positive_filter_cast_guarantee_dictionary > 16567 ns16563 ns42539 > BindAndEvaluate/simple_array > 255 ns 255 ns 2748813 > BindAndEvaluate/simple_scalar > 252 ns 252 ns 2765200 > BindAndEvaluate/nested_array >2251 ns 2251 ns 310424 > BindAndEvaluate/nested_scalar >2687 ns 2686 ns 261939 > -- Arrow Fatal Error -- > Invalid: Value lengths differed from ExecBatch length > Abandon > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15213) [Ruby] Add bindings for between kernel
[ https://issues.apache.org/jira/browse/ARROW-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Muite reassigned ARROW-15213: Assignee: Benson Muite > [Ruby] Add bindings for between kernel > -- > > Key: ARROW-15213 > URL: https://issues.apache.org/jira/browse/ARROW-15213 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Benson Muite >Assignee: Benson Muite >Priority: Major > > Ruby bindings for between kernel. Follow on to > https://issues.apache.org/jira/browse/ARROW-9843 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-6322) [C#] Implement a plasma client
[ https://issues.apache.org/jira/browse/ARROW-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566190#comment-17566190 ] Kouhei Sutou commented on ARROW-6322: - [~eerhardt] Can we close this because Plasma is deprecated? > [C#] Implement a plasma client > -- > > Key: ARROW-6322 > URL: https://issues.apache.org/jira/browse/ARROW-6322 > Project: Apache Arrow > Issue Type: New Feature > Components: C# >Reporter: Eric Erhardt >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > We should create a C# plasma client, so .NET code can get and put objects > into the plasma store. > An easy-ish way of implementing this would be to build on the c_glib C APIs > already exposed for the plasma client. Unfortunately, I haven't found a > decent C# GObject generator, so I think the C bindings will need to be > written by hand, but there isn't too many of them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-10911) [C++] Improve *_SOURCE CMake variables naming
[ https://issues.apache.org/jira/browse/ARROW-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-10911: Assignee: Kouhei Sutou > [C++] Improve *_SOURCE CMake variables naming > - > > Key: ARROW-10911 > URL: https://issues.apache.org/jira/browse/ARROW-10911 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > > https://github.com/apache/arrow/pull/8908#issuecomment-744780934 > {quote} > > This change also renamed our Boost dependency name to "Boost" from > "BOOST". It means that users need to use -DBoost_SOURCE not > -DBOOST_SOURCE. To keep backward compatibility, -DBOOST_SOURCE is > still accepted when -DBoost_SOURCE isn't specified. > > Users also need to use -Dre2_SOURCE not -DRE2_SOURCE. To keep backward > compatibility, -DRE2_SOURCE is still accepted when -Dre2_SOURCE isn't > specified. > I would love to have this kind of case-insensitive handling for all > dependencies. This has tripped me up many times and it is difficult to > explain to others why everything else is ALL_CAPS but these dependencies are > a mix. > {quote} > https://github.com/apache/arrow/pull/8908#issuecomment-744898897 > {quote} > OK. How about using `ARROW_${UPPERCASE_DEPENDENCY_NAME}_SOURCE` CMake > variables for them like `ARROW_*_USE_SHARED`? > If it sounds reasonable, we can work on it as a separated task. > {quote} > https://github.com/apache/arrow/pull/8908#issuecomment-744954917 > {quote} > Why does it need the `ARROW_` namespace prefix? > I'm fine with anything that is intuitive and trivial to document. > {quote} > https://github.com/apache/arrow/pull/8908#issuecomment-745005158 > {quote} > Because of consistency. > If we use `ARROW_${UPPERCASE_DEPENDENCY_NAME}_SOURCE` not > `${UPPERCASE_DEPENDENCY_NAME}_SOURCE`, we can explain that you can customize > how to use `${DEPENDENCY}` by > `ARROW_${UPPERCASE_DEPENDENCY_NAME}_{SOURCE,USE_SHARED}` CMake variables. > It'll more intuitive than using `${UPPERCASE_DEPENDENCY_NAME}_SOURCE` and > `ARROW_${UPPERCASE_DEPENDENCY_NAME}_USE_SHARED`. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17061) [Python][Substrait] Acero consumer is unable to consume count function from substrait query plan
[ https://issues.apache.org/jira/browse/ARROW-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-17061: Assignee: Vibhatha Lakmal Abeykoon > [Python][Substrait] Acero consumer is unable to consume count function from > substrait query plan > > > Key: ARROW-17061 > URL: https://issues.apache.org/jira/browse/ARROW-17061 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Richard Tia >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > SQL > {code:java} > SELECT > o_orderpriority, > count(*) AS order_count > FROM > orders > GROUP BY > o_orderpriority{code} > The substrait plan generated from SQL, using Isthmus. > > substrait count: > [https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml] > > Running the substrait plan with Acero returns this error: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.aggregate.measures[0].measure) > arguments: Cannot find field. {code} > > From substrait query plan: > relations[0].root.input.aggregate.measures[0].measure > {code:java} > "measure": { > "functionReference": 0, > "args": [], > "sorts": [], > "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT", > "outputType": { > "i64": { > "typeVariationReference": 0, > "nullability": "NULLABILITY_REQUIRED" > } > }, > "invocation": "AGGREGATION_INVOCATION_ALL", > "arguments": [] > }{code} > {code:java} > "extensions": [{ > "extensionFunction": { > "extensionUriReference": 1, > "functionAnchor": 0, > "name": "count:opt" > } > }],{code} > Count is a unary function and should be consumable, but isn't in this case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-7131) [GLib][CI] Fail to execute lua examples in the MacOS build
[ https://issues.apache.org/jira/browse/ARROW-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou closed ARROW-7131. --- Resolution: Won't Do We don't need this. > [GLib][CI] Fail to execute lua examples in the MacOS build > -- > > Key: ARROW-7131 > URL: https://issues.apache.org/jira/browse/ARROW-7131 > Project: Apache Arrow > Issue Type: Improvement > Components: CI, Continuous Integration, GLib >Reporter: Krisztian Szucs >Priority: Major > > Fails to load 'lgi.corelgilua51' despite that lgi is installed in the macOS > build. > References: > - https://github.com/apache/arrow/blob/master/.github/workflows/ruby.yml#L77 > - https://github.com/apache/arrow/blob/master/ci/scripts/c_glib_test.sh#L35 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-14316) [CI] extends is removed from docker v3
[ https://issues.apache.org/jira/browse/ARROW-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou closed ARROW-14316. Resolution: Invalid We can use extends again with recent docker-compose. > [CI] extends is removed from docker v3 > -- > > Key: ARROW-14316 > URL: https://issues.apache.org/jira/browse/ARROW-14316 > Project: Apache Arrow > Issue Type: Bug >Reporter: Benson Muite >Priority: Minor > > As explained in [https://github.com/docker/compose/issues/4315] extends has > been removed from docker compose v3 schema, it should therefore be removed > from schema used in Arrow. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-3902) [Gandiva] [C++] Remove static c++ linked in Gandiva.
[ https://issues.apache.org/jira/browse/ARROW-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou closed ARROW-3902. --- Resolution: Invalid Now, we need to do nothing for this. > [Gandiva] [C++] Remove static c++ linked in Gandiva. > > > Key: ARROW-3902 > URL: https://issues.apache.org/jira/browse/ARROW-3902 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Affects Versions: 0.12.0 >Reporter: Praveen Kumar >Priority: Major > > Hi, > [~wesm_impala_7e40], I am looking into switching Gandiva Redhat developer > toolchain. We are not too familiar with it and not sure the effort required > there. > In the meanwhile for the short term, can we turn get Crossbow builds to only > do static linking for Dremio builds (through a travis env variable)? and > Arrow can ship Gandiva linked to std-c++ dynamically? > We can then move to redhat toolchain for 0.13 version of Arrow? > Thx. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14360) [Ruby] Add DSL to build expression
[ https://issues.apache.org/jira/browse/ARROW-14360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-14360: Assignee: Kouhei Sutou > [Ruby] Add DSL to build expression > -- > > Key: ARROW-14360 > URL: https://issues.apache.org/jira/browse/ARROW-14360 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)