[jira] [Created] (ARROW-12070) [GLib] Drop support for GNU Autotools
Kouhei Sutou created ARROW-12070: Summary: [GLib] Drop support for GNU Autotools Key: ARROW-12070 URL: https://issues.apache.org/jira/browse/ARROW-12070 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou If we drop support for GNU Autotools, we can simplify our source archive release process. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12069) Implement IN expressions for Decimal types
João Victor Huguenin created ARROW-12069: Summary: Implement IN expressions for Decimal types Key: ARROW-12069 URL: https://issues.apache.org/jira/browse/ARROW-12069 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: João Victor Huguenin Implement support to search if a Decimal value is IN an Arrow's decimal field independently of its precision or scale -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12068) [Python] Stop using distutils
Antoine Pitrou created ARROW-12068: -- Summary: [Python] Stop using distutils Key: ARROW-12068 URL: https://issues.apache.org/jira/browse/ARROW-12068 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Antoine Pitrou Fix For: 5.0.0 According to [PEP 632|https://www.python.org/dev/peps/pep-0632/], distutils will be deprecated in Python 3.10 and removed in 3.12. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12067) [Python][Doc] Document pyarrow_(un)wrap_scalar
Antoine Pitrou created ARROW-12067: -- Summary: [Python][Doc] Document pyarrow_(un)wrap_scalar Key: ARROW-12067 URL: https://issues.apache.org/jira/browse/ARROW-12067 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12066) [Python] Dataset API seg fault when filtering string column for None
Thomas Blauth created ARROW-12066: - Summary: [Python] Dataset API seg fault when filtering string column for None Key: ARROW-12066 URL: https://issues.apache.org/jira/browse/ARROW-12066 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Environment: macOS 10.15.7 Reporter: Thomas Blauth Trying to load a parquet file using the dataset api leads to a segmentation fault when filtering string columns for None values. Minimal reproducing example: {code:python} import pyarrow as pa import pyarrow.dataset import pyarrow.parquet import pandas as pd path = "./test.parquet" df = pd.DataFrame({"A": ("a", "b", None)}) pa.parquet.write_table(pa.table(df), path) ds = pa.dataset.dataset(path, format="parquet") filter = pa.dataset.field("A") == pa.dataset.scalar(None) table = ds.to_table(filter=filter) {code} Backtrace: {code:bash} (lldb) target create "/usr/local/mambaforge/envs/xxx/bin/python" Current executable set to '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64). (lldb) settings set -- target.run-args "./tmp.py" (lldb) r Process 35235 launched: '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64) Process 35235 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9) frame #0: 0x00010314be48 libarrow.300.0.0.dylib`arrow::Status arrow::VisitScalarInline(arrow::Scalar const&, arrow::ScalarHashImpl*) + 104 libarrow.300.0.0.dylib`arrow::VisitScalarInline: -> 0x10314be48 <+104>: cmpb $0x0, 0x9(%rax) 0x10314be4c <+108>: je 0x10314c0bc ; <+732> 0x10314be52 <+114>: movq 0x10(%rax), %rdi 0x10314be56 <+118>: movq 0x20(%rax), %rsi Target 0: (python) stopped. (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9) * frame #0: 0x00010314be48 libarrow.300.0.0.dylib`arrow::Status arrow::VisitScalarInline(arrow::Scalar const&, arrow::ScalarHashImpl*) + 104 frame #1: 0x00010314bd4f libarrow.300.0.0.dylib`arrow::ScalarHashImpl::AccumulateHashFrom(arrow::Scalar const&) + 111 frame #2: 0x000103134bca libarrow.300.0.0.dylib`arrow::Scalar::Hash::hash(arrow::Scalar const&) + 42 frame #3: 0x000132fa0ea8 libarrow_dataset.300.0.0.dylib`arrow::dataset::Expression::hash() const + 264 frame #4: 0x000132fc913c libarrow_dataset.300.0.0.dylib`std::__1::__hash_const_iterator*> std::__1::__hash_table, std::__1::allocator >::find(arrow::dataset::Expression const&) const + 28 frame #5: 0x000132faca9b libarrow_dataset.300.0.0.dylib`arrow::Result arrow::dataset::Modify(arrow::dataset::Expression, arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*)::$_1 const&, arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*)::$_9 const&) + 123 frame #6: 0x000132fac623 libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*) + 131 frame #7: 0x000132fac76d libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression, arrow::compute::ExecContext*) + 461 frame #8: 0x000132fb00cb libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression, arrow::dataset::Expression const&)::$_10::operator()() const + 75 frame #9: 0x000132faf6b5 libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression, arrow::dataset::Expression const&) + 517 frame #10: 0x000132f893f8 libarrow_dataset.300.0.0.dylib`arrow::dataset::Dataset::GetFragments(arrow::dataset::Expression) + 88 frame #11: 0x000132f8d25c libarrow_dataset.300.0.0.dylib`arrow::dataset::GetFragmentsFromDatasets(std::__1::vector, std::__1::allocator > > const&, arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr)::operator()(std::__1::shared_ptr) const + 76 frame #12: 0x000132f8cd6c libarrow_dataset.300.0.0.dylib`arrow::MapIterator, std::__1::allocator > > const&, arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr), std::__1::shared_ptr, arrow::Iterator > >::Next() + 316 frame #13: 0x000132f8cb27 libarrow_dataset.300.0.0.dylib`arrow::Result > > arrow::Iterator > >::Next, std::__1::allocator > > const&, arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr), std::__1::shared_ptr, arrow::Iterator > > >(void*) + 39 frame #14: 0x000132f8dcdb libarrow_dataset.300.0.0.dylib`arrow::Iterator > >::Next() + 43 frame #15: 0x000132f8d692 libarrow_dataset.300.0.0.dylib`arrow::FlattenIterator >::Next() + 258 frame #16: 0x000132f8d477 libarrow_dataset.300.0.0.dylib`arrow::Result > arrow::Iterator >::Next > >(void*) + 39 frame #17: 0x000132f8de0b libarrow_dataset.300.0.0.dylib`arrow::It
[jira] [Created] (ARROW-12065) segfault in pyarrow read_json
Patrick created ARROW-12065: --- Summary: segfault in pyarrow read_json Key: ARROW-12065 URL: https://issues.apache.org/jira/browse/ARROW-12065 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Environment: arch linux, 31G ram Reporter: Patrick I noticed this when doing some analysis on a not very complex, but reasonably large json file and I've simplified it to a fairly minimal reproduction: ``` import pyarrow.json pyarrow.json.read_json('test.json') ``` and `test.json` is ``` {"A":"<0 repeated 1.6 million times>"} {"B":[]} ``` this seems like it shouldn't be too large to load into memory all-at-once, so I'm surprised there is a segfault running via gdb and getting a backtrace gives ``` (gdb) bt #0 0x75c1965d in std::__shared_ptr::__shared_ptr(std::__shared_ptr const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300 #1 0x75ca8d9e in arrow::json::ChunkedListArrayBuilder::Insert(long, std::shared_ptr const&, std::shared_ptr const&) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300 #2 0x75cabcc8 in arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300 #3 0x75c1fc16 in arrow::json::TableReaderImpl::Read() () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/libarrow.so.300 #4 0x7fffcf73da69 in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, _object*) () from /home/patrick/.local/lib/python3.9/site-packages/pyarrow/_json.cpython-39-x86_64-linux-gnu.so #5 0x77d35a43 in ?? () from /usr/lib/libpython3.9.so.1.0 #6 0x77d1be6d in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0 #7 0x77d17b3a in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0 #8 0x77d119ad in ?? () from /usr/lib/libpython3.9.so.1.0 #9 0x77d11371 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.9.so.1.0 #10 0x77dd3f83 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0 #11 0x77de43dd in ?? () from /usr/lib/libpython3.9.so.1.0 #12 0x77ddfc7b in ?? () from /usr/lib/libpython3.9.so.1.0 #13 0x77cf38ab in ?? () from /usr/lib/libpython3.9.so.1.0 #14 0x77cf3a63 in PyRun_InteractiveLoopFlags () from /usr/lib/libpython3.9.so.1.0 #15 0x77c81f6b in PyRun_AnyFileExFlags () from /usr/lib/libpython3.9.so.1.0 #16 0x77c7665c in ?? () from /usr/lib/libpython3.9.so.1.0 #17 0x77dc6fa9 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0 #18 0x77a43b25 in __libc_start_main () from /usr/lib/libc.so.6 #19 0x504e in _start () (gdb) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] pitrou merged pull request #60: ARROW-11838: fix offset buffer in golden file.
pitrou merged pull request #60: URL: https://github.com/apache/arrow-testing/pull/60 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-12064) [Rust] [DataFusion] Make DataFrame extensible
Andy Grove created ARROW-12064: -- Summary: [Rust] [DataFusion] Make DataFrame extensible Key: ARROW-12064 URL: https://issues.apache.org/jira/browse/ARROW-12064 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove The DataFrame implementation currently has two types of logic: # Logic for building a logical query plan # Logic for executing a query using the DataFusion context We can make DataFrame more extensible by having it always delegate to the context for execution, allowing the same DataFrame logic to be used for local and distributed execution. We will likely need to introduce a new ExecutionContext trait with different implementations for DataFusion and Ballista. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12063) [C++] Add nulls position option to sort functions
Ian Cook created ARROW-12063: Summary: [C++] Add nulls position option to sort functions Key: ARROW-12063 URL: https://issues.apache.org/jira/browse/ARROW-12063 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 3.0.0 Reporter: Ian Cook Currently in the [sort functions|https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions], nulls are considered greater than any other value and are sorted at the the end of the array. Add an option to enable users to sort nulls at the beginning if they wish. This option is common in analytic data systems, e.g. SQL {{NULLS FIRST}} and {{NULLS LAST}}, pandas {{na_position}}, R {{na.last}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12061) Proxy set as environment variables doesn't seems to be picked up
Karthikeyan Janakiraman created ARROW-12061: --- Summary: Proxy set as environment variables doesn't seems to be picked up Key: ARROW-12061 URL: https://issues.apache.org/jira/browse/ARROW-12061 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 3.0.0 Environment: QA Reporter: Karthikeyan Janakiraman I am using arrow 3.0.0 to read parquet from AWS S3. I have set the proxy in R terminal using Sys.setenv however it doesn't seemed to have picked up when arrow tries to hit the S3 bucket. Appreciate any help here please. {code:java} > Sys.setenv(http_proxy="http://proxy:9099";) > Sys.setenv(https_proxy="http://proxy:9099";) > Sys.setenv(HTTPS_PROXY="http://proxy:9099";) > Sys.setenv(HTTP_PROXY="http://proxy:9099";) > Sys.getenv("http_proxy") [1] "http://proxy:9099"; > df <- > read_parquet("s3://my_bucket/test-parquet/refinement.parquet?region=eu-west-1") Error: IOError: When reading information for key 'test-parquet/refinement.parquet' in bucket 'my_bucket': AWS Error [code 99]: Unable to connect to endpoint with address : 52.218.57.8 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12062) Proxy set as environment variables doesn't seems to be picked up
Karthikeyan Janakiraman created ARROW-12062: --- Summary: Proxy set as environment variables doesn't seems to be picked up Key: ARROW-12062 URL: https://issues.apache.org/jira/browse/ARROW-12062 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 3.0.0 Environment: QA Reporter: Karthikeyan Janakiraman I am using arrow 3.0.0 to read parquet from AWS S3. I have set the proxy in R terminal using Sys.setenv however it doesn't seemed to have picked up when arrow tries to hit the S3 bucket. Appreciate any help here please. {code:java} > Sys.setenv(http_proxy="http://proxy:9099";) > Sys.setenv(https_proxy="http://proxy:9099";) > Sys.setenv(HTTPS_PROXY="http://proxy:9099";) > Sys.setenv(HTTP_PROXY="http://proxy:9099";) > Sys.getenv("http_proxy") [1] "http://proxy:9099"; > df <- > read_parquet("s3://my_bucket/test-parquet/refinement.parquet?region=eu-west-1") Error: IOError: When reading information for key 'test-parquet/refinement.parquet' in bucket 'my_bucket': AWS Error [code 99]: Unable to connect to endpoint with address : 52.218.57.8 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12060) [Python] Enable calling compute functions on Expressions
Joris Van den Bossche created ARROW-12060: - Summary: [Python] Enable calling compute functions on Expressions Key: ARROW-12060 URL: https://issues.apache.org/jira/browse/ARROW-12060 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 4.0.0 To expose the full power of dataset (projection/filter) expressions, we should ensure that all compute kernels can be used in combination with expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12059) [R] Accept format-specific scan options in collect()
David Li created ARROW-12059: Summary: [R] Accept format-specific scan options in collect() Key: ARROW-12059 URL: https://issues.apache.org/jira/browse/ARROW-12059 Project: Apache Arrow Issue Type: Task Components: R Affects Versions: 4.0.0 Reporter: David Li Fix For: 5.0.0 ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most natural place to accept these is in collect(), but this isn't yet done. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12058) [Python] Enable arithmetic operations on Expressions
Joris Van den Bossche created ARROW-12058: - Summary: [Python] Enable arithmetic operations on Expressions Key: ARROW-12058 URL: https://issues.apache.org/jira/browse/ARROW-12058 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 4.0.0 To make the dataset (projection) expressions more usable, we can add some more dunder methods to the class (like we already can do {{expr == 1}} for comparison operations, we can also enable arithmetic python operators. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12057) [Python] Remove direct usage of pandas' Block subclasses
Joris Van den Bossche created ARROW-12057: - Summary: [Python] Remove direct usage of pandas' Block subclasses Key: ARROW-12057 URL: https://issues.apache.org/jira/browse/ARROW-12057 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 4.0.0 The {{CategoricalBlock}} was removed in pandas (https://github.com/pandas-dev/pandas/pull/40527), which breaks the nightly tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12056) [C++] Create sequencing operator
Weston Pace created ARROW-12056: --- Summary: [C++] Create sequencing operator Key: ARROW-12056 URL: https://issues.apache.org/jira/browse/ARROW-12056 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace ARROW-7001 needs a sequencing operator to reorder fragments & scan tasks that arrive out of order. This AsyncGenerator would poll source and buffer results until the "next" result arrives. For example, given a source of 6,2,1,3,4,5 the operator would return 1,2,3,4,5,6 and would need to buffer 2 items (6 & 2 at the beginning). The Next(T t) will be configurable via function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12055) [R] is.na() evaluates to FALSE on Arrow NaN values
Ian Cook created ARROW-12055: Summary: [R] is.na() evaluates to FALSE on Arrow NaN values Key: ARROW-12055 URL: https://issues.apache.org/jira/browse/ARROW-12055 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 3.0.0 Reporter: Ian Cook {code:java} > is.na(NaN) [1] TRUE > is.na(Scalar$create(NaN)) [1] FALSE{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12054) [C++] Parquet statistics incorrect for decimal128
Weston Pace created ARROW-12054: --- Summary: [C++] Parquet statistics incorrect for decimal128 Key: ARROW-12054 URL: https://issues.apache.org/jira/browse/ARROW-12054 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace {code:java} import decimal import pyarrow as pa import pyarrow.parquet as pq dtype = pa.decimal128(12, 4) ctx = decimal.Context(prec=12) arr = pa.array([0, ctx.create_decimal(3.99)], dtype) table = pa.Table.from_arrays([arr], ["foo"]) pq.write_table(table, '/tmp/foo.pq') meta = pq.read_metadata('/tmp/foo.pq') print(meta.row_group(0).column(0).statistics) {code} Expected 0 to be the min and 3.99 to be the max but got the reverse. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12053) Implement aggregate compute functions for decimal datatype
Taras Kuzyo created ARROW-12053: --- Summary: Implement aggregate compute functions for decimal datatype Key: ARROW-12053 URL: https://issues.apache.org/jira/browse/ARROW-12053 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 3.0.0 Reporter: Taras Kuzyo When I try to run an aggregate function on decimal array I get the following errors: pyarrow.lib.ArrowNotImplementedError: Function min_max has no kernel matching input types (array[decimal(12, 4)]) pyarrow.lib.ArrowNotImplementedError: Function sum has no kernel matching input types (array[decimal(12, 4)]) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12052) Implement child data in Arrow Rust C FFI
Ritchie created ARROW-12052: --- Summary: Implement child data in Arrow Rust C FFI Key: ARROW-12052 URL: https://issues.apache.org/jira/browse/ARROW-12052 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Ritchie Assignee: Ritchie -- This message was sent by Atlassian Jira (v8.3.4#803005)