[jira] [Updated] (ARROW-13603) [GLib] GARROW_VERSION_CHECK() always returns false
[ https://issues.apache.org/jira/browse/ARROW-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13603: --- Labels: pull-request-available (was: ) > [GLib] GARROW_VERSION_CHECK() always returns false > -- > > Key: ARROW-13603 > URL: https://issues.apache.org/jira/browse/ARROW-13603 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13603) [GLib] GARROW_VERSION_CHECK() always returns false
Kouhei Sutou created ARROW-13603: Summary: [GLib] GARROW_VERSION_CHECK() always returns false Key: ARROW-13603 URL: https://issues.apache.org/jira/browse/ARROW-13603 Project: Apache Arrow Issue Type: Bug Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13601) [C++] Tests maybe uninitialized compiler warnings
[ https://issues.apache.org/jira/browse/ARROW-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Kraus updated ARROW-13601: Description: Using gcc9 I get the following: {code} /home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc: In function 'arrow::Status arrow::testing::json::{anonymous}::GetField(const Value&, arrow::ipc::internal::FieldPosition, arrow::ipc::DictionaryMemo*, std::shared_ptr*)': /home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc:1128:31: warning: 'is_ordered' may be used uninitialized in this function [-Wmaybe-uninitialized] 1128 | type = ::arrow::dictionary(index_type, type, is_ordered); |~~~^~ {code} {code} In file included from /home/keith/git/arrow/cpp/src/arrow/util/bit_run_reader.h:26, from /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:45: /home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h: In member function 'void arrow::TestBitmapUInt64Reader::AssertWords(const arrow::Buffer&, int64_t, int64_t, const std::vector&)': /home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h:99:16: warning: 'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' may be used uninitialized in this function [-Wmaybe-uninitialized] 99 | uint64_t word = carry_bits_ | (next_word << num_carry_bits_); |^~~~ /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:242:34: note: 'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' was declared here 242 | internal::BitmapUInt64Reader reader(buffer.data(), start_offset, length); {code} {code} In file included from /home/keith/git/arrow/cpp/src/arrow/util/async_generator.h:30, from /home/keith/git/arrow/cpp/src/arrow/util/iterator_test.cc:30: /home/keith/git/arrow/cpp/src/arrow/util/iterator.h: In member function 'arrow::Result arrow::TransformIterator::Next() [with T = std::shared_ptr; V = std::shared_ptr]': /home/keith/git/arrow/cpp/src/arrow/util/iterator.h:288:12: warning: '*((void*)(& next)+24).std::__shared_count<>::_M_pi' may be used uninitialized in this function [-Wmaybe-uninitialized] 288 | auto next = *next_res; |^~~~ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 'arrow::util::UTF8FindIf_Basics_Test::TestBody()::': /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:463:35: warning: 'right' may be used uninitialized in this function [-Wmaybe-uninitialized] 463 | EXPECT_EQ(offset_right, right - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:462:33: warning: 'left' may be used uninitialized in this function [-Wmaybe-uninitialized] 462 | EXPECT_EQ(offset_left, left - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 'arrow::util::UTF8FindIf_Basics_Test::TestBody()::': /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:447:35: warning: 'right' may be used uninitialized in this function [-Wmaybe-uninitialized] 447 | EXPECT_EQ(offset_right, right - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:446:33: warning: 'left' may be used uninitialized in this function [-Wmaybe-uninitialized] 446 | EXPECT_EQ(offset_left, left - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc: In member function 'virtual void arrow::util::{anonymous}::VariantTest_Visit_Test::TestBody()': /home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:260:3: warning: '*((void*)& +8)' may be used uninitialized in this function [-Wmaybe-uninitialized] 260 | for (auto Assert : | ^~~ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:230:8: warning: '.arrow::util::{anonymous}::AssertVisitOne >, const char*>, std::vector >::expected_' may be used uninitialized in this function [-Wmaybe-uninitialized] 230 | struct AssertVisitOne { |^~ {code} {code} /home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc: In function 'void arrow::cuda::CudaBufferWriterBenchmark(benchmark::State&, int64_t, int64_t, int64_t)': /home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc:42:35: warning: 'manager' may be used uninitialized in this function [-Wmaybe-uninitialized] 42 | ABORT_NOT_OK(manager->GetContext(kGpuNumber).Value()); | ^ {code} {code} /home/keith/git/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc: In function 'arrow::Result > parquet::arrow::RootToTreeLeafLevels(const parquet::arrow::SchemaManifest&, int)': /home/keith/git/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:1343:16:
[jira] [Created] (ARROW-13602) [C++] Tests dereferencing type-punned pointer compiler warnings
Keith Kraus created ARROW-13602: --- Summary: [C++] Tests dereferencing type-punned pointer compiler warnings Key: ARROW-13602 URL: https://issues.apache.org/jira/browse/ARROW-13602 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Keith Kraus Using gcc9: {code} In file included from /home/keith/miniconda3/envs/dev/include/gtest/gtest.h:375, from /home/keith/miniconda3/envs/dev/include/gmock/internal/gmock-internal-utils.h:47, from /home/keith/miniconda3/envs/dev/include/gmock/gmock-actions.h:51, from /home/keith/miniconda3/envs/dev/include/gmock/gmock.h:59, from /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:30: /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc: In member function 'virtual void arrow::BitUtil_ByteSwap_Test::TestBody()': /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1835:32: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 1835 | EXPECT_EQ(BitUtil::ByteSwap(*reinterpret_cast()), |^ /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1836:14: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 1836 | *reinterpret_cast()); | ^~ /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1838:32: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 1838 | EXPECT_EQ(BitUtil::ByteSwap(*reinterpret_cast()), |^~ /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1839:14: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 1839 | *reinterpret_cast()); | ^~~ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13601) [C++] Tests maybe uninitialized compiler warnings
Keith Kraus created ARROW-13601: --- Summary: [C++] Tests maybe uninitialized compiler warnings Key: ARROW-13601 URL: https://issues.apache.org/jira/browse/ARROW-13601 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Keith Kraus Using gcc9 I get the following: {code} /home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc: In function 'arrow::Status arrow::testing::json::{anonymous}::GetField(const Value&, arrow::ipc::internal::FieldPosition, arrow::ipc::DictionaryMemo*, std::shared_ptr*)': /home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc:1128:31: warning: 'is_ordered' may be used uninitialized in this function [-Wmaybe-uninitialized] 1128 | type = ::arrow::dictionary(index_type, type, is_ordered); |~~~^~ {code} {code} In file included from /home/keith/git/arrow/cpp/src/arrow/util/bit_run_reader.h:26, from /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:45: /home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h: In member function 'void arrow::TestBitmapUInt64Reader::AssertWords(const arrow::Buffer&, int64_t, int64_t, const std::vector&)': /home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h:99:16: warning: 'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' may be used uninitialized in this function [-Wmaybe-uninitialized] 99 | uint64_t word = carry_bits_ | (next_word << num_carry_bits_); |^~~~ /home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:242:34: note: 'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' was declared here 242 | internal::BitmapUInt64Reader reader(buffer.data(), start_offset, length); {code} {code} In file included from /home/keith/git/arrow/cpp/src/arrow/util/async_generator.h:30, from /home/keith/git/arrow/cpp/src/arrow/util/iterator_test.cc:30: /home/keith/git/arrow/cpp/src/arrow/util/iterator.h: In member function 'arrow::Result arrow::TransformIterator::Next() [with T = std::shared_ptr; V = std::shared_ptr]': /home/keith/git/arrow/cpp/src/arrow/util/iterator.h:288:12: warning: '*((void*)(& next)+24).std::__shared_count<>::_M_pi' may be used uninitialized in this function [-Wmaybe-uninitialized] 288 | auto next = *next_res; |^~~~ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/iterator.h:288:12: warning: '*((void*)(& next)+16).std::__shared_ptr::_M_ptr' may be used uninitialized in this function [-Wmaybe-uninitialized] [359/671] Building CXX object src/arrow/util/CMakeFiles/arrow-utility-test.dir/utf8_util_test.cc.o /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 'arrow::util::UTF8FindIf_Basics_Test::TestBody()::': /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:463:35: warning: 'right' may be used uninitialized in this function [-Wmaybe-uninitialized] 463 | EXPECT_EQ(offset_right, right - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:462:33: warning: 'left' may be used uninitialized in this function [-Wmaybe-uninitialized] 462 | EXPECT_EQ(offset_left, left - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 'arrow::util::UTF8FindIf_Basics_Test::TestBody()::': /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:447:35: warning: 'right' may be used uninitialized in this function [-Wmaybe-uninitialized] 447 | EXPECT_EQ(offset_right, right - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:446:33: warning: 'left' may be used uninitialized in this function [-Wmaybe-uninitialized] 446 | EXPECT_EQ(offset_left, left - begin); | ^ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc: In member function 'virtual void arrow::util::{anonymous}::VariantTest_Visit_Test::TestBody()': /home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:260:3: warning: '*((void*)& +8)' may be used uninitialized in this function [-Wmaybe-uninitialized] 260 | for (auto Assert : | ^~~ {code} {code} /home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:230:8: warning: '.arrow::util::{anonymous}::AssertVisitOne >, const char*>, std::vector >::expected_' may be used uninitialized in this function [-Wmaybe-uninitialized] 230 | struct AssertVisitOne { |^~ {code} {code} /home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc: In function 'void arrow::cuda::CudaBufferWriterBenchmark(benchmark::State&, int64_t, int64_t, int64_t)': /home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc:42:35: warning: 'manager'
[jira] [Updated] (ARROW-13600) [C++] Maybe uninitialized warnings
[ https://issues.apache.org/jira/browse/ARROW-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13600: --- Labels: pull-request-available (was: ) > [C++] Maybe uninitialized warnings > -- > > Key: ARROW-13600 > URL: https://issues.apache.org/jira/browse/ARROW-13600 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Keith Kraus >Assignee: Keith Kraus >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code} > Building CXX object > src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/key_hash.cc.o > /home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc: In static > member function 'static void > arrow::compute::Hashing::hash_varlen_helper(uint32_t, const uint8_t*, > uint32_t*)': > /home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc:202:16: warning: > 'last_stripe' may be used uninitialized in this function > [-Wmaybe-uninitialized] > 202 | uint32_t lane = reinterpret_cast uint32_t*>(last_stripe)[j]; > |^~~~ > {code} > {code} > Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor.cc.o > /home/keith/git/arrow/cpp/src/arrow/tensor.cc: In member function > 'arrow::Result arrow::Tensor::CountNonZero() const': > /home/keith/git/arrow/cpp/src/arrow/tensor.cc:337:18: warning: '*((void*)& > counter +8)' may be used uninitialized in this function > [-Wmaybe-uninitialized] > 337 | NonZeroCounter counter(*this); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13600) [C++] Maybe uninitialized warnings
Keith Kraus created ARROW-13600: --- Summary: [C++] Maybe uninitialized warnings Key: ARROW-13600 URL: https://issues.apache.org/jira/browse/ARROW-13600 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Keith Kraus Assignee: Keith Kraus {code} Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/key_hash.cc.o /home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc: In static member function 'static void arrow::compute::Hashing::hash_varlen_helper(uint32_t, const uint8_t*, uint32_t*)': /home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc:202:16: warning: 'last_stripe' may be used uninitialized in this function [-Wmaybe-uninitialized] 202 | uint32_t lane = reinterpret_cast(last_stripe)[j]; |^~~~ {code} {code} Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor.cc.o /home/keith/git/arrow/cpp/src/arrow/tensor.cc: In member function 'arrow::Result arrow::Tensor::CountNonZero() const': /home/keith/git/arrow/cpp/src/arrow/tensor.cc:337:18: warning: '*((void*)& counter +8)' may be used uninitialized in this function [-Wmaybe-uninitialized] 337 | NonZeroCounter counter(*this); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13518) Identify selected row when using filters
[ https://issues.apache.org/jira/browse/ARROW-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396886#comment-17396886 ] Weston Pace commented on ARROW-13518: - ARROW-13599 is somewhat related. > Identify selected row when using filters > > > Key: ARROW-13518 > URL: https://issues.apache.org/jira/browse/ARROW-13518 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Parquet, Python >Reporter: Yair Lenga >Priority: Major > > I created a proposed enhancement to speed up reading of specific rows > arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517 > proposing extending the functions that provides filter parquet.read_table > ([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table]) > to support returning actual row numbers (e.g, row_group and row_index). > with the proposed enhancement, this can provide for faster reading of the > data (e.g. by caching the return indices, and reading the full data when > needed). > proposed implementation will be to add 2 pseudo columns, which can be > requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, > ‘dealid’, …] or similar. > * $row_group - 0 based row group index > * $row_index - 0 based position within the row group > * $row_file_index - 0 based position in the file (not critical), can be > constructed from the other two > > not sure if this requires change to the c++ interface, or just to the python > part of pyarrow. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13599) [C++] [Dataset] Add optional scan type that tags batches with locational information
Weston Pace created ARROW-13599: --- Summary: [C++] [Dataset] Add optional scan type that tags batches with locational information Key: ARROW-13599 URL: https://issues.apache.org/jira/browse/ARROW-13599 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Currently there are two types of scans: * Ordered scan - Yields batches in order (includes batch index and fragment index) * Unordered scan - Yields batches in any order (no batch index or fragment index) There is a third type of scan (Tagged scan? Indexed scan?) which could tag each batch with the starting row # of the batch. Certain file types (like parquet & IPC) should be able to support this with similar performance to an unordered scan (since the # of rows is in the metadata). Other file types (like CSV) could fall back to an ordered scan or do something like a two pass approach to count the # of newlines in a file and then scan the file itself (not sure if this makes sense yet). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13013) [C++][Compute][Python] Move (majority of) kernel unit tests to python
[ https://issues.apache.org/jira/browse/ARROW-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned ARROW-13013: --- Assignee: Weston Pace > [C++][Compute][Python] Move (majority of) kernel unit tests to python > - > > Key: ARROW-13013 > URL: https://issues.apache.org/jira/browse/ARROW-13013 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Ben Kietzman >Assignee: Weston Pace >Priority: Major > > mailing list discussion: > [https://lists.apache.org/thread.html/r09e0e0fbb8b655bbec8cf5662d224f3dfc4fba894a312900f73ae3bf%40%3Cdev.arrow.apache.org%3E] > Writing unit tests for compute functions in c++ is laborious, entails a lot > of boilerplate, and slows iteration since it requires recompilation when > adding new tests. The majority of these test cases need not be written in C++ > at all and could instead be made part of the pyarrow test suite. > In order to make the kernels' C++ implementations easily debuggable from unit > tests, we'll have to expose a c++ function named {{AssertCallFunction}} or > so. {{AssertCallFunction}} will invoke the named compute::Function and > compare actual results to expected without crossing the C++/python boundary, > allowing a developer to step through all relevant code with a single > breakpoint in GDB. Construction of scalars/arrays/function options and any > other inputs to the function is amply supported by {{pyarrow}}, and will > happen outside the scope of {{AssertCallFunction}}. > {{AssertCallFunction}} should not try to derive additional assertions from > its arguments - for example {{CheckScalar("add", > {left, right} > , expected)}} will first assert that {{left + right == expected}} then > {{left.slice(1) + right.slice(1) == expected.slice(1)}} to ensure that > offsets are handled correctly. This has value but can be easily expressed in > Python and configuration of such behavior would overcomplicate the interface > of {{AssertCallFunction}}. > Unit tests for kernels would then be written in > {{arrow/python/pyarrow/test/kernels/test_*.py}}. The C++ unit test for > [addition with implicit > casts|https://github.com/apache/arrow/blob/b38ab81cb96e393a026d05a22e5a2f62ff6c23d7/cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc#L897-L918] > could then be rewritten as > {code:python} > def test_addition_implicit_casts(): > AssertCallFunction("add", [[ 0,1, 2,None], >[ 0.25, 1.5, 2.75, None]], >expected=[0.25, 1.5, 2.75, None]) > # ... > {code} > NB: Some unit tests will probably still reside in C++ since we'll need to > test things we don't wish to expose in a user facing API, such as "whether a > boolean kernel avoids clobbering bits when outputting into a slice". These > should be far more manageable since they won't need to assert correct logic > across all possible input types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12922) [Java][FlightSQL] Create stubbed APIs for Flight SQL
[ https://issues.apache.org/jira/browse/ARROW-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396858#comment-17396858 ] Kyle Porter commented on ARROW-12922: - Work has now moved to https://github.com/apache/arrow/pull/10906. > [Java][FlightSQL] Create stubbed APIs for Flight SQL > > > Key: ARROW-12922 > URL: https://issues.apache.org/jira/browse/ARROW-12922 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Tiffany Lam >Assignee: Kyle Porter >Priority: Major > Labels: FlightSQL, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > This task is to create stubbed APIs for a Flight SQL client, server and > sample application. > * Contain TODOs referencing implementation Jira tasks. > * Should also be accompanied by Javadocs describing behaviour of the > methods/APIs. > * TODO: breakdown poc PR > > *Acceptance Criteria* > * TODO -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12922) [Java][FlightSQL] Create stubbed APIs for Flight SQL
[ https://issues.apache.org/jira/browse/ARROW-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12922: --- Labels: FlightSQL pull-request-available (was: FlightSQL) > [Java][FlightSQL] Create stubbed APIs for Flight SQL > > > Key: ARROW-12922 > URL: https://issues.apache.org/jira/browse/ARROW-12922 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Tiffany Lam >Assignee: Kyle Porter >Priority: Major > Labels: FlightSQL, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This task is to create stubbed APIs for a Flight SQL client, server and > sample application. > * Contain TODOs referencing implementation Jira tasks. > * Should also be accompanied by Javadocs describing behaviour of the > methods/APIs. > * TODO: breakdown poc PR > > *Acceptance Criteria* > * TODO -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9226) [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available
[ https://issues.apache.org/jira/browse/ARROW-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396856#comment-17396856 ] Itamar Turner-Trauring commented on ARROW-9226: --- Looking through the code— **The deprecated API:** 1. `pyarrow.hdfs` imports from `_hdfsio.pyx`. 2. This is thin wrapper around `CIOHadoopFileSystem` and `HdfsConnectionConfig`. 3. The former is wrapper around `arrow::io::HadoopFileSystem` (see `libarrow.pxd`). **The new API:** 1. `pyarrow.fs` imports from `_hdfs.pyx` 2. This builds on Cython classes `CHdfsOptions` and `CHadoopFileSystem`, with very small amount of wrapper code. 3. These are synonyms for C++ classes `arrow::fs::HdfsOptions` and `arrow::fs::HadoopFilesystem` (see `libarrow_fs.pxd`). --- Looking through the old code, the connection code path is in `cpp/src/arrow/io/hdfs.cc`, and mostly interacts with the driver, which is in either from `libhdfs` or `libhdfs3` via a shim (https://github.com/apache/arrow/blob/7b66f97330215fe020ec536671ee50f41aa1af35/cpp/src/arrow/io/hdfs_internal.h), which `dlopen()`s the underlying `libhdfs` library at least. ... and `libhdfs` then calls into Java implementation. Further digging suggests that the new code path (`arrow::fs::HadoopFilesystem`) still uses the `libhdfs/libhdfs3` drivers from `arrow::io`: https://github.com/apache/arrow/blob/7b66f97330215fe020ec536671ee50f41aa1af35/cpp/src/arrow/filesystem/hdfs.cc#L56 Given all the heavy lifting seems to be done by the underlying libraries, this suggests this functionality could be exposed again, and the issue is less implementing the logic, and more just exposing the underlying API again. I am probably going to try to do this. > [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or > hdfs-site.xml if available > > > Key: ARROW-9226 > URL: https://issues.apache.org/jira/browse/ARROW-9226 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Bruno Quinart >Priority: Minor > Labels: hdfs > > 'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from > the hadoop configuration files. > The new pyarrow.fs.HadoopFileSystem requires the host to be specified. > Inferring this info from "the environment" makes it easier to deploy > pipelines. > But more important, for HA namenodes it is almost impossible to know for sure > what to specify. If a rolling restart is ongoing, the namenode is changing. > There is no guarantee on which will be active in a HA setup. > I tried connecting to the standby namenode. The connection gets established, > but when writing a file an error is raised that standby namenodes are not > allowed to write to. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9226) [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available
[ https://issues.apache.org/jira/browse/ARROW-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396847#comment-17396847 ] Antoine Pitrou commented on ARROW-9226: --- [~itamarst] Could you explain how this logic can be exposed? Most of us here don't have any deep HDFS knowledge. > [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or > hdfs-site.xml if available > > > Key: ARROW-9226 > URL: https://issues.apache.org/jira/browse/ARROW-9226 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Bruno Quinart >Priority: Minor > Labels: hdfs > > 'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from > the hadoop configuration files. > The new pyarrow.fs.HadoopFileSystem requires the host to be specified. > Inferring this info from "the environment" makes it easier to deploy > pipelines. > But more important, for HA namenodes it is almost impossible to know for sure > what to specify. If a rolling restart is ongoing, the namenode is changing. > There is no guarantee on which will be active in a HA setup. > I tried connecting to the standby namenode. The connection gets established, > but when writing a file an error is raised that standby namenodes are not > allowed to write to. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9226) [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available
[ https://issues.apache.org/jira/browse/ARROW-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396846#comment-17396846 ] Itamar Turner-Trauring commented on ARROW-9226: --- Digging through the code, it doesn't seem like this logic was ever implemented in Arrow itself; deep down enough, it's logic from `libhdfs`/`libhdfs3`. If I read this correctly, since the new API still uses those underneath, it's probably just a matter of (re)exposing the low-level logic in the Arrow wrapper. > [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or > hdfs-site.xml if available > > > Key: ARROW-9226 > URL: https://issues.apache.org/jira/browse/ARROW-9226 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.17.1 >Reporter: Bruno Quinart >Priority: Minor > Labels: hdfs > > 'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from > the hadoop configuration files. > The new pyarrow.fs.HadoopFileSystem requires the host to be specified. > Inferring this info from "the environment" makes it easier to deploy > pipelines. > But more important, for HA namenodes it is almost impossible to know for sure > what to specify. If a rolling restart is ongoing, the namenode is changing. > There is no guarantee on which will be active in a HA setup. > I tried connecting to the standby namenode. The connection gets established, > but when writing a file an error is raised that standby namenodes are not > allowed to write to. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13574) [C++] Add 'count all' option to count (hash) aggregate kernel
[ https://issues.apache.org/jira/browse/ARROW-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated ARROW-13574: - Description: The current "count" hash aggregate kernel counts either all non-null or all null values, but doesn't count all values regardless of nullity (i.e. SQL "count(\*)" or Pandas size()). (was: The current "count" hash aggregate kernel counts either all non-null or all null values, but doesn't count all values regardless of nullity (i.e. SQL "count(*)" or Pandas size()).) > [C++] Add 'count all' option to count (hash) aggregate kernel > - > > Key: ARROW-13574 > URL: https://issues.apache.org/jira/browse/ARROW-13574 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The current "count" hash aggregate kernel counts either all non-null or all > null values, but doesn't count all values regardless of nullity (i.e. SQL > "count(\*)" or Pandas size()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13541) [C++][Python] Implement ExtensionScalar
[ https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13541: --- Labels: pull-request-available (was: ) > [C++][Python] Implement ExtensionScalar > --- > > Key: ARROW-13541 > URL: https://issues.apache.org/jira/browse/ARROW-13541 > Project: Apache Arrow > Issue Type: Task > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, ExtensionScalar is just an empty shell around the base Scalar > class. > It should have a ValueType member, and support the various usual operations > (hashing, equality, validation, GetScalar, etc.). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13541) [C++] Implement ExtensionScalar
[ https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13541: --- Component/s: Python > [C++] Implement ExtensionScalar > --- > > Key: ARROW-13541 > URL: https://issues.apache.org/jira/browse/ARROW-13541 > Project: Apache Arrow > Issue Type: Task > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Fix For: 6.0.0 > > > Currently, ExtensionScalar is just an empty shell around the base Scalar > class. > It should have a ValueType member, and support the various usual operations > (hashing, equality, validation, GetScalar, etc.). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13541) [C++][Python] Implement ExtensionScalar
[ https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13541: --- Summary: [C++][Python] Implement ExtensionScalar (was: [C++] Implement ExtensionScalar) > [C++][Python] Implement ExtensionScalar > --- > > Key: ARROW-13541 > URL: https://issues.apache.org/jira/browse/ARROW-13541 > Project: Apache Arrow > Issue Type: Task > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Fix For: 6.0.0 > > > Currently, ExtensionScalar is just an empty shell around the base Scalar > class. > It should have a ValueType member, and support the various usual operations > (hashing, equality, validation, GetScalar, etc.). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry
[ https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-13597: - Fix Version/s: 6.0.0 > [C++] [R] ExecNode factory named source not present in registry > --- > > Key: ARROW-13597 > URL: https://issues.apache.org/jira/browse/ARROW-13597 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Jonathan Keane >Assignee: Ben Kietzman >Priority: Major > Fix For: 6.0.0 > > > {code} > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 860 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 861 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 862 > Backtrace: > 863 > █ > 864 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 865 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 866 > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 867 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 868 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 869 > Backtrace: > 870 > █ > 871 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 872 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 873 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 874 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 875 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 876 > Backtrace: > 877 > █ > 878 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 879 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 880 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 881 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 882 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 883 > Backtrace: > 884 > █ > 885 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 886 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 887 > 888 > [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ] > {code} > Link to an example: > https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875 > https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701 > might be the culprit, it was merged the day before the failures -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry
[ https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-13597: Assignee: Ben Kietzman > [C++] [R] ExecNode factory named source not present in registry > --- > > Key: ARROW-13597 > URL: https://issues.apache.org/jira/browse/ARROW-13597 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Jonathan Keane >Assignee: Ben Kietzman >Priority: Major > > {code} > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 860 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 861 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 862 > Backtrace: > 863 > █ > 864 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 865 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 866 > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 867 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 868 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 869 > Backtrace: > 870 > █ > 871 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 872 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 873 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 874 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 875 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 876 > Backtrace: > 877 > █ > 878 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 879 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 880 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 881 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 882 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 883 > Backtrace: > 884 > █ > 885 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 886 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 887 > 888 > [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ] > {code} > Link to an example: > https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875 > https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701 > might be the culprit, it was merged the day before the failures -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13574) [C++] Add 'count all' option to count (hash) aggregate kernel
[ https://issues.apache.org/jira/browse/ARROW-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-13574: - Summary: [C++] Add 'count all' option to count (hash) aggregate kernel (was: [C++] Implement "count(*)" hash aggregate kernel) > [C++] Add 'count all' option to count (hash) aggregate kernel > - > > Key: ARROW-13574 > URL: https://issues.apache.org/jira/browse/ARROW-13574 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The current "count" hash aggregate kernel counts either all non-null or all > null values, but doesn't count all values regardless of nullity (i.e. SQL > "count(*)" or Pandas size()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13595) [C++] Add debug mode check for compute kernel output type
[ https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13595: --- Labels: pull-request-available (was: ) > [C++] Add debug mode check for compute kernel output type > - > > Key: ARROW-13595 > URL: https://issues.apache.org/jira/browse/ARROW-13595 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As discovered in https://github.com/apache/arrow/pull/10890, it's currently > possible for a kernel to declare an output type and actually return another > one. > It would be useful to add debug-mode check in the kernel or function > execution machinery to validate the concrete output type returned by the > kernel exec function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13541) [C++] Implement ExtensionScalar
[ https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-13541: -- Assignee: Antoine Pitrou > [C++] Implement ExtensionScalar > --- > > Key: ARROW-13541 > URL: https://issues.apache.org/jira/browse/ARROW-13541 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Fix For: 6.0.0 > > > Currently, ExtensionScalar is just an empty shell around the base Scalar > class. > It should have a ValueType member, and support the various usual operations > (hashing, equality, validation, GetScalar, etc.). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h
[ https://issues.apache.org/jira/browse/ARROW-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranda Perera closed ARROW-13596. -- Resolution: Invalid > [C++] Remove util/logging.h in compute/exec/util.h > -- > > Key: ARROW-13596 > URL: https://issues.apache.org/jira/browse/ARROW-13596 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Niranda Perera >Assignee: Niranda Perera >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > util/logging.h is included in the compute/exec/util.h. Remove it and move > TempVectorStack and AtomicCounter impl to util.cc > > [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
[ https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Percy Camilo Triveño Aucahuasi updated ARROW-9773: -- Labels: kernel (was: ) > [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array > --- > > Key: ARROW-9773 > URL: https://issues.apache.org/jira/browse/ARROW-9773 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.0 >Reporter: David Li >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > Labels: kernel > Fix For: 6.0.0 > > > Take() currently concatenates ChunkedArrays first. However, this breaks down > when calling Take() from a ChunkedArray or Table where concatenating the > arrays would result in an array that's too large. While inconvenient to > implement, it would be useful if this case were handled. > This could be done as a higher-level wrapper around Take(), perhaps. > Example in Python: > {code:python} > >>> import pyarrow as pa > >>> pa.__version__ > '1.0.0' > >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"]) > >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"]) > >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema) > >>> table.take([1, 0]) > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take > File > "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", > line 268, in take > return call_function('take', [data, indices], options) > File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function > File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call > File "pyarrow/error.pxi", line 122, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays > {code} > In this example, it would be useful if Take() or a higher-level wrapper could > generate multiple record batches as output. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
[ https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Percy Camilo Triveño Aucahuasi updated ARROW-9773: -- Fix Version/s: 6.0.0 > [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array > --- > > Key: ARROW-9773 > URL: https://issues.apache.org/jira/browse/ARROW-9773 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.0 >Reporter: David Li >Assignee: Percy Camilo Triveño Aucahuasi >Priority: Major > Fix For: 6.0.0 > > > Take() currently concatenates ChunkedArrays first. However, this breaks down > when calling Take() from a ChunkedArray or Table where concatenating the > arrays would result in an array that's too large. While inconvenient to > implement, it would be useful if this case were handled. > This could be done as a higher-level wrapper around Take(), perhaps. > Example in Python: > {code:python} > >>> import pyarrow as pa > >>> pa.__version__ > '1.0.0' > >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"]) > >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"]) > >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema) > >>> table.take([1, 0]) > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take > File > "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", > line 268, in take > return call_function('take', [data, indices], options) > File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function > File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call > File "pyarrow/error.pxi", line 122, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays > {code} > In this example, it would be useful if Take() or a higher-level wrapper could > generate multiple record batches as output. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13595) [C++] Add debug mode check for compute kernel output type
[ https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13595: --- Summary: [C++] Add debug mode check for compute kernel output type (was: [C++] Add debug mode check for compile kernel output type) > [C++] Add debug mode check for compute kernel output type > - > > Key: ARROW-13595 > URL: https://issues.apache.org/jira/browse/ARROW-13595 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: 6.0.0 > > > As discovered in https://github.com/apache/arrow/pull/10890, it's currently > possible for a kernel to declare an output type and actually return another > one. > It would be useful to add debug-mode check in the kernel or function > execution machinery to validate the concrete output type returned by the > kernel exec function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13595) [C++] Add debug mode check for compile kernel output type
[ https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13595: --- Fix Version/s: 6.0.0 > [C++] Add debug mode check for compile kernel output type > - > > Key: ARROW-13595 > URL: https://issues.apache.org/jira/browse/ARROW-13595 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: 6.0.0 > > > As discovered in https://github.com/apache/arrow/pull/10890, it's currently > possible for a kernel to declare an output type and actually return another > one. > It would be useful to add debug-mode check in the kernel or function > execution machinery to validate the concrete output type returned by the > kernel exec function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13595) [C++] Add debug mode check for compile kernel output type
[ https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-13595: -- Assignee: Antoine Pitrou > [C++] Add debug mode check for compile kernel output type > - > > Key: ARROW-13595 > URL: https://issues.apache.org/jira/browse/ARROW-13595 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > > As discovered in https://github.com/apache/arrow/pull/10890, it's currently > possible for a kernel to declare an output type and actually return another > one. > It would be useful to add debug-mode check in the kernel or function > execution machinery to validate the concrete output type returned by the > kernel exec function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13598) [C++] Deprecate Datum::COLLECTION
[ https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13598: --- Description: It looks like "collection" datums are not used anywhere. Where we want to return several pieces of data, we generally return a Struct array or scalar wrapping them. Perhaps we should simply deprecate or even remove them. was: It looks like "collection" datums are not used anywhere. Where we want to return several pieces of data, we generally return a Struct array or scalar wrapping them. Perhaps we should simply deprecate them. > [C++] Deprecate Datum::COLLECTION > - > > Key: ARROW-13598 > URL: https://issues.apache.org/jira/browse/ARROW-13598 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 6.0.0 > > > It looks like "collection" datums are not used anywhere. Where we want to > return several pieces of data, we generally return a Struct array or scalar > wrapping them. > Perhaps we should simply deprecate or even remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13598) [C++] Deprecate Datum::COLLECTION
[ https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396745#comment-17396745 ] Antoine Pitrou commented on ARROW-13598: [~wesm] [~bkietz] What do you think? > [C++] Deprecate Datum::COLLECTION > - > > Key: ARROW-13598 > URL: https://issues.apache.org/jira/browse/ARROW-13598 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 6.0.0 > > > It looks like "collection" datums are not used anywhere. Where we want to > return several pieces of data, we generally return a Struct array or scalar > wrapping them. > Perhaps we should simply deprecate or even remove them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13598) [C++] Deprecate Datum::COLLECTION
Antoine Pitrou created ARROW-13598: -- Summary: [C++] Deprecate Datum::COLLECTION Key: ARROW-13598 URL: https://issues.apache.org/jira/browse/ARROW-13598 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Fix For: 6.0.0 It looks like "collection" datums are not used anywhere. Where we want to return several pieces of data, we generally return a Struct array or scalar wrapping them. Perhaps we should simply deprecate them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry
[ https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13597: --- Summary: [C++] [R] ExecNode factory named source not present in registry (was: [R] ExecNode factory named source not present in registry) > [C++] [R] ExecNode factory named source not present in registry > --- > > Key: ARROW-13597 > URL: https://issues.apache.org/jira/browse/ARROW-13597 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Priority: Major > > {code} > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 860 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 861 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 862 > Backtrace: > 863 > █ > 864 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 865 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 866 > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 867 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 868 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 869 > Backtrace: > 870 > █ > 871 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 872 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 873 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 874 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 875 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 876 > Backtrace: > 877 > █ > 878 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 879 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 880 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 881 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 882 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 883 > Backtrace: > 884 > █ > 885 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 886 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 887 > 888 > [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ] > {code} > Link to an example: > https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875 > https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701 > might be the culprit, it was merged the day before the failures -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry
[ https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13597: --- Component/s: C++ > [C++] [R] ExecNode factory named source not present in registry > --- > > Key: ARROW-13597 > URL: https://issues.apache.org/jira/browse/ARROW-13597 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Jonathan Keane >Priority: Major > > {code} > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 860 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 861 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 862 > Backtrace: > 863 > █ > 864 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 865 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 866 > ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate > > 867 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 868 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 869 > Backtrace: > 870 > █ > 871 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 > 872 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 873 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 874 > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` generated warnings: > 875 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 876 > Backtrace: > 877 > █ > 878 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 879 > 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 > 880 > ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate > > 881 > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` generated warnings: > 882 > * Error : Key error: ExecNode factory named source not present in registry.; > pulling data into R > 883 > Backtrace: > 884 > █ > 885 > 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 > 886 > 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 > 887 > 888 > [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ] > {code} > Link to an example: > https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875 > https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701 > might be the culprit, it was merged the day before the failures -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h
[ https://issues.apache.org/jira/browse/ARROW-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13596: --- Labels: pull-request-available (was: ) > [C++] Remove util/logging.h in compute/exec/util.h > -- > > Key: ARROW-13596 > URL: https://issues.apache.org/jira/browse/ARROW-13596 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Niranda Perera >Assignee: Niranda Perera >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > util/logging.h is included in the compute/exec/util.h. Remove it and move > TempVectorStack and AtomicCounter impl to util.cc > > [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h
[ https://issues.apache.org/jira/browse/ARROW-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranda Perera updated ARROW-13596: --- Description: util/logging.h is included in the compute/exec/util.h. Remove it and move TempVectorStack and AtomicCounter impl to util.cc [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31] was: util/logging.h is included in the compute/exec/util.h. Remove it and move AtomicCounter impl to util.cc https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31 > [C++] Remove util/logging.h in compute/exec/util.h > -- > > Key: ARROW-13596 > URL: https://issues.apache.org/jira/browse/ARROW-13596 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Niranda Perera >Assignee: Niranda Perera >Priority: Major > > util/logging.h is included in the compute/exec/util.h. Remove it and move > TempVectorStack and AtomicCounter impl to util.cc > > [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13575) [C++] Implement product aggregate & hash aggregate kernels
[ https://issues.apache.org/jira/browse/ARROW-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-13575. Resolution: Fixed Issue resolved by pull request 10890 [https://github.com/apache/arrow/pull/10890] > [C++] Implement product aggregate & hash aggregate kernels > -- > > Key: ARROW-13575 > URL: https://issues.apache.org/jira/browse/ARROW-13575 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Like Pandas > [prod|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.prod.html]. > Note that Pandas has a min_count option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13597) [R] ExecNode factory named source not present in registry
Jonathan Keane created ARROW-13597: -- Summary: [R] ExecNode factory named source not present in registry Key: ARROW-13597 URL: https://issues.apache.org/jira/browse/ARROW-13597 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Jonathan Keane {code} ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 860 `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = record_batch(tbl` generated warnings: 861 * Error : Key error: ExecNode factory named source not present in registry.; pulling data into R 862 Backtrace: 863 █ 864 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 865 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 866 ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 867 `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = Table$create(tbl` generated warnings: 868 * Error : Key error: ExecNode factory named source not present in registry.; pulling data into R 869 Backtrace: 870 █ 871 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2 872 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 873 ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 874 `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = record_batch(tbl` generated warnings: 875 * Error : Key error: ExecNode factory named source not present in registry.; pulling data into R 876 Backtrace: 877 █ 878 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 879 2. └─testthat::expect_warning(...) helper-expectation.R:101:4 880 ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 881 `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = Table$create(tbl` generated warnings: 882 * Error : Key error: ExecNode factory named source not present in registry.; pulling data into R 883 Backtrace: 884 █ 885 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2 886 2. └─testthat::expect_warning(...) helper-expectation.R:114:4 887 888 [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ] {code} Link to an example: https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875 https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701 might be the culprit, it was merged the day before the failures -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h
Niranda Perera created ARROW-13596: -- Summary: [C++] Remove util/logging.h in compute/exec/util.h Key: ARROW-13596 URL: https://issues.apache.org/jira/browse/ARROW-13596 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Niranda Perera Assignee: Niranda Perera util/logging.h is included in the compute/exec/util.h. Remove it and move AtomicCounter impl to util.cc https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13588) [R] Empty character attributes not stored
[ https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396714#comment-17396714 ] Charlie Gao commented on ARROW-13588: - Hi Neal, I guess that would likely be the issue. In R, the empty character vector "" is different to NULL. An R attribute cannot be stored as NULL - setting an attribute to NULL removes it. The wider issue is that although you _can_ remove the tzone attribute for dates, certain object classes (including those popular for time series e.g. 'xts') _require_ the tzone attribute to be set. > [R] Empty character attributes not stored > - > > Key: ARROW-13588 > URL: https://issues.apache.org/jira/browse/ARROW-13588 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 5.0.0 > Environment: Ubuntu 20.04 R 4.1 release >Reporter: Charlie Gao >Priority: Minor > Labels: attributes, feather > Fix For: 6.0.0 > > > Date-times in the POSIXct format have a 'tzone' attribute that by default is > set to "", an empty character vector (not NULL) when created. > This however is not stored in the Arrow feather file. When the file is read > back, the original and restored dataframes are not identical as per the below > reprex. > I am thinking that this should not be the intention? My workaround at the > moment is making a check when reading back to write the empty string if the > tzone attribute does not exist. > Just to confirm, the attribute is stored correctly when it is not empty. > Thanks. > {code:java} > ``` r > dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02")) > attributes(dates) > #> $class > #> [1] "POSIXct" "POSIXt" > #> > #> $tzone > #> [1] "" > values <- c(1:3) > original <- data.frame(dates, values) > original > #> dates values > #> 1 2020-01-01 1 > #> 2 2020-01-02 2 > #> 3 2020-01-02 3 > tempfile <- tempfile() > arrow::write_feather(original, tempfile) > restored <- arrow::read_feather(tempfile) > identical(original, restored) > #> [1] FALSE > waldo::compare(original, restored) > #> `attr(old$dates, 'tzone')` is a character vector ('') > #> `attr(new$dates, 'tzone')` is absent > unlink(tempfile) > ``` > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)
[ https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-1888: --- Labels: analytics kernel (was: analytics) > [C++] Implement casts from one struct type to another (with same field names > and number of fields) > -- > > Key: ARROW-1888 > URL: https://issues.apache.org/jira/browse/ARROW-1888 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Fernando Rodriguez >Priority: Major > Labels: analytics, kernel > Fix For: 6.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)
[ https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fernando Rodriguez reassigned ARROW-1888: - Assignee: Fernando Rodriguez (was: David Li) > [C++] Implement casts from one struct type to another (with same field names > and number of fields) > -- > > Key: ARROW-1888 > URL: https://issues.apache.org/jira/browse/ARROW-1888 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Fernando Rodriguez >Priority: Major > Labels: analytics > Fix For: 6.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13595) [C++] Add debug mode check for compile kernel output type
Antoine Pitrou created ARROW-13595: -- Summary: [C++] Add debug mode check for compile kernel output type Key: ARROW-13595 URL: https://issues.apache.org/jira/browse/ARROW-13595 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou As discovered in https://github.com/apache/arrow/pull/10890, it's currently possible for a kernel to declare an output type and actually return another one. It would be useful to add debug-mode check in the kernel or function execution machinery to validate the concrete output type returned by the kernel exec function. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-7179: --- Assignee: David Li > [C++][Compute] Consolidate fill_null and coalesce > - > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: David Li >Priority: Major > Labels: analytics, kernel > Fix For: 6.0.0 > > > fill_null and coalesce are essentially the same kernel, except the former is > binary and doesn't support an array fill value, and the latter is variadic > and supports scalar and array fill values. > We should consolidate them into one kernel, picking the faster implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396656#comment-17396656 ] Joris Van den Bossche commented on ARROW-7179: -- Yes, that sounds good > [C++][Compute] Consolidate fill_null and coalesce > - > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics, kernel > Fix For: 6.0.0 > > > fill_null and coalesce are essentially the same kernel, except the former is > binary and doesn't support an array fill value, and the latter is variadic > and supports scalar and array fill values. > We should consolidate them into one kernel, picking the faster implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-7179: Description: fill_null and coalesce are essentially the same kernel, except the former is binary and doesn't support an array fill value, and the latter is variadic and supports scalar and array fill values. We should consolidate them into one kernel, picking the faster implementation. was: Add kernels to support which replacing null values in an array with values taken from corresponding slots in another array: {code} fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3] {code} > [C++][Compute] Consolidate fill_null and coalesce > - > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics > > fill_null and coalesce are essentially the same kernel, except the former is > binary and doesn't support an array fill value, and the latter is variadic > and supports scalar and array fill values. > We should consolidate them into one kernel, picking the faster implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-7179: Labels: analytics kernel (was: analytics) > [C++][Compute] Consolidate fill_null and coalesce > - > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics, kernel > > fill_null and coalesce are essentially the same kernel, except the former is > binary and doesn't support an array fill value, and the latter is variadic > and supports scalar and array fill values. > We should consolidate them into one kernel, picking the faster implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-7179: Fix Version/s: 6.0.0 > [C++][Compute] Consolidate fill_null and coalesce > - > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics, kernel > Fix For: 6.0.0 > > > fill_null and coalesce are essentially the same kernel, except the former is > binary and doesn't support an array fill value, and the latter is variadic > and supports scalar and array fill values. > We should consolidate them into one kernel, picking the faster implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-7179) [C++][Compute] Array support for fill_null
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reopened ARROW-7179: - Assignee: (was: David Li) > [C++][Compute] Array support for fill_null > -- > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics > Fix For: 5.0.0 > > > Add kernels to support which replacing null values in an array with values > taken from corresponding slots in another array: > {code} > fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-7179: Summary: [C++][Compute] Consolidate fill_null and coalesce (was: [C++][Compute] Array support for fill_null) > [C++][Compute] Consolidate fill_null and coalesce > - > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics > > Add kernels to support which replacing null values in an array with values > taken from corresponding slots in another array: > {code} > fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7179) [C++][Compute] Array support for fill_null
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-7179: Fix Version/s: (was: 5.0.0) > [C++][Compute] Array support for fill_null > -- > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Priority: Major > Labels: analytics > > Add kernels to support which replacing null values in an array with values > taken from corresponding slots in another array: > {code} > fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7179) [C++][Compute] Array support for fill_null
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396652#comment-17396652 ] David Li commented on ARROW-7179: - Ah, I see, thanks Joris. It looks like coalesce has slightly more complete type support; it's also variadic instead of binary. Maybe the path here then should be to benchmark the common implementations and choose the faster one, then consolidate both into one kernel. (And maybe provide aliases from one to the other?) > [C++][Compute] Array support for fill_null > -- > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: David Li >Priority: Major > Labels: analytics > Fix For: 5.0.0 > > > Add kernels to support which replacing null values in an array with values > taken from corresponding slots in another array: > {code} > fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13590) [C++] Ensure dataset writing applies back pressure
[ https://issues.apache.org/jira/browse/ARROW-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13590: Summary: [C++] Ensure dataset writing applies back pressure (was: Ensure dataset writing applies back pressure) > [C++] Ensure dataset writing applies back pressure > -- > > Key: ARROW-13590 > URL: https://issues.apache.org/jira/browse/ARROW-13590 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: query-engine > Fix For: 6.0.0 > > > Dataset writing via exec plan (ARROW-13542) does not apply back pressure > currently and will take up far more RAM than it should when writing a large > dataset. The node should be applying back pressure. However, the preferred > back pressure method (via scheduling) will need to wait for ARROW-13576. > Once those two are finished this can be studied in more detail. Also, the > vm.dirty_ratio might be experimented with. In theory we should be applying > our own back pressure and have no need of dirty pages. In practice, it may > be more work than we want to tackle right now and we just let it do its thing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored
[ https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-13588: Fix Version/s: 6.0.0 > [R] Empty character attributes not stored > - > > Key: ARROW-13588 > URL: https://issues.apache.org/jira/browse/ARROW-13588 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 5.0.0 > Environment: Ubuntu 20.04 R 4.1 release >Reporter: Charlie Gao >Priority: Minor > Labels: attributes, feather > Fix For: 6.0.0 > > > Date-times in the POSIXct format have a 'tzone' attribute that by default is > set to "", an empty character vector (not NULL) when created. > This however is not stored in the Arrow feather file. When the file is read > back, the original and restored dataframes are not identical as per the below > reprex. > I am thinking that this should not be the intention? My workaround at the > moment is making a check when reading back to write the empty string if the > tzone attribute does not exist. > Just to confirm, the attribute is stored correctly when it is not empty. > Thanks. > {code:java} > ``` r > dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02")) > attributes(dates) > #> $class > #> [1] "POSIXct" "POSIXt" > #> > #> $tzone > #> [1] "" > values <- c(1:3) > original <- data.frame(dates, values) > original > #> dates values > #> 1 2020-01-01 1 > #> 2 2020-01-02 2 > #> 3 2020-01-02 3 > tempfile <- tempfile() > arrow::write_feather(original, tempfile) > restored <- arrow::read_feather(tempfile) > identical(original, restored) > #> [1] FALSE > waldo::compare(original, restored) > #> `attr(old$dates, 'tzone')` is a character vector ('') > #> `attr(new$dates, 'tzone')` is absent > unlink(tempfile) > ``` > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13588) [R] Empty character attributes not stored
[ https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396644#comment-17396644 ] Neal Richardson commented on ARROW-13588: - I am guessing that the issue is that in C++, the empty string "" means that the field is not set. Provided that there is no different meaning of {{tzone = NULL}} from {{tzone = ""}} in R, we can handle this field specially on the round trip. > [R] Empty character attributes not stored > - > > Key: ARROW-13588 > URL: https://issues.apache.org/jira/browse/ARROW-13588 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 5.0.0 > Environment: Ubuntu 20.04 R 4.1 release >Reporter: Charlie Gao >Priority: Minor > Labels: attributes, feather > > Date-times in the POSIXct format have a 'tzone' attribute that by default is > set to "", an empty character vector (not NULL) when created. > This however is not stored in the Arrow feather file. When the file is read > back, the original and restored dataframes are not identical as per the below > reprex. > I am thinking that this should not be the intention? My workaround at the > moment is making a check when reading back to write the empty string if the > tzone attribute does not exist. > Just to confirm, the attribute is stored correctly when it is not empty. > Thanks. > {code:java} > ``` r > dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02")) > attributes(dates) > #> $class > #> [1] "POSIXct" "POSIXt" > #> > #> $tzone > #> [1] "" > values <- c(1:3) > original <- data.frame(dates, values) > original > #> dates values > #> 1 2020-01-01 1 > #> 2 2020-01-02 2 > #> 3 2020-01-02 3 > tempfile <- tempfile() > arrow::write_feather(original, tempfile) > restored <- arrow::read_feather(tempfile) > identical(original, restored) > #> [1] FALSE > waldo::compare(original, restored) > #> `attr(old$dates, 'tzone')` is a character vector ('') > #> `attr(new$dates, 'tzone')` is absent > unlink(tempfile) > ``` > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12959) [C++][R] Option for is_null(NaN) to evaluate to true
[ https://issues.apache.org/jira/browse/ARROW-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-12959: Labels: kernel pull-request-available (was: pull-request-available) > [C++][R] Option for is_null(NaN) to evaluate to true > > > Key: ARROW-12959 > URL: https://issues.apache.org/jira/browse/ARROW-12959 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Ian Cook >Assignee: Christian Cordova >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > (This is the flip side of ARROW-12960.) > Currently the Arrow compute kernel {{is_null}} always treats {{NaN}} as a > non-missing value, returning {{false}} at positions of the input datum with > value {{NaN}}. > It would be helpful to be able to control this behavior with an option. The > option could be named {{nan_is_null}} or something similar. It would default > to {{false}}, consistent with current behavior. When set to {{true}}, it > should check if the input datum has a floating point data type, and if so, > return {{true}} at positions where the input is {{NaN}}. If the input datum > has some other type, the option should be silently ignored. > Among other things, this would enable the {{arrow}} R package to evaluate > {{is.na()}} consistently with the way base R does. In base R, {{is.na()}} > returns {{TRUE}} on {{NaN}}. But in the {{arrow}} R package, it returns > {{FALSE}}: > {code:r} > is.na(c(3.14, NA, NaN)) > ## [1] FALSE TRUE TRUE > as.vector(is.na(Array$create(c(3.14, NA, NaN > ## [1] FALSE TRUE FALSE{code} > I think solving this with an option in the C++ kernel is the best solution, > because I suspect there are other cases in which users might want to treat > {{NaN}} as a missing value. However, it would also be possible to solve this > just in the R package, by defining a mapping of {{is.na}} in the R package > that checks if the input {{x}} has a floating point data type, and if so, > evaluates {{is.na\(x\) | is.nan\(x\)}}. If we choose to go that route, we > should change this Jira issue summary to "[R] Make is.na(NaN) consistent with > base R". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13479) [Format] Make requirement around dense union offsets less ambiguous
[ https://issues.apache.org/jira/browse/ARROW-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396606#comment-17396606 ] Antoine Pitrou commented on ARROW-13479: Same opinion. We should probably be conservative. [~wesm] What do you say? > [Format] Make requirement around dense union offsets less ambiguous > --- > > Key: ARROW-13479 > URL: https://issues.apache.org/jira/browse/ARROW-13479 > Project: Apache Arrow > Issue Type: Task > Components: Format >Reporter: Antoine Pitrou >Priority: Major > Fix For: 6.0.0 > > > Currently, the spec states that dense union offsets for each child array must > be "in order / increasing". There is an ambiguity: should they be strictly > increasing, or are equal values supported? > The C++ implementation currently considers that equal offsets are acceptable. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-13578) [Python] Inconsistent handling of integer-valued partitions in dataset filters API
[ https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396598#comment-17396598 ] Joris Van den Bossche edited comment on ARROW-13578 at 8/10/21, 10:34 AM: -- [~mnizol] thanks a lot for the clear and detailed report. What's causing the confusion here is that we have both a legacy Python implementation of ParquetDataset and a new generic Datasets API, and that we are still in the middle of moving to the new implementation: {{ParquetDataset}} still uses the legacy implementation by default (but you can use {{use_legacy_dataset=False}} to opt in to the new), while {{pq.read_table}} (which is what {{pd.read_parquet}} uses under the hood) is already defaulting to the new implementation (but you can fall back to the old with {{use_legacy_dataset=True}}). In the legacy ParquetDataset implementation, all partition keys are indeed parsed as strings as you show with the output of {{ParquetDataset.partitions.levels}}. So when passing {{use_legacy_dataset=True}} to the read function, using a string actually works: {code} In [19]: pd.read_parquet('./test.parquet', filters=[('key1','=','1')], use_legacy_dataset=True) Out[19]: data key1 key2 key3 key4 key5 key6 0 bar11b 2.2 False 2021-06-03 00:00:00 {code} BTW, also using an integer works here ({{('key1', '=', 1)}}), because the legacy implementation will try to interpret the value with the type of the partition levels. In the new Datasets API, the parsing of the directory paths currently supports int32 and strings (when inferring, you can use other types when explicitly passing the schema for the partition keys). So when creating a ParquetDataset with use_legacy_dataset=False, we see: {code} In [21]: ds = pq.ParquetDataset('test_partitions', use_legacy_dataset=False) In [22]: ds._dataset.partitioning Out[22]: In [23]: ds._dataset.partitioning.schema Out[23]: key1: dictionary key2: dictionary key3: dictionary key4: dictionary key5: dictionary key6: dictionary -- schema metadata -- pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 340 In [24]: ds._dataset.partitioning.dictionaries Out[24]: [ [ 0, 1, 2 ], [ 0, 1, 2 ], [ "a", "b", "c" ], [ "1.1", "2.2", "3.3" ], [ "True", "False" ], [ "2021-06-02 00:00:00", "2021-06-03 00:00:00", "2021-06-04 00:00:00" ]] {code} So the first two partition keys are inferred as int, the others as string. And that's also the reason that for this case, you actually need to specify the filter using an integer (we decided to not do such automatic casting here in the new implementation). Sidenote: I am using {{ds._dataset.partitioning}} above, but this will become {{ds.partitioning}} after ARROW-13525. So with an integer value in the filter this works (adding {{use_legacy_dataset=False}} explicitly, but so this is the default in {{pq.read_table}} / {{pd.read_parquet}}): {code} In [20]: pd.read_parquet('./test_partitions/', filters=[('key1','=', 1)], use_legacy_dataset=False) Out[20]: data key1 key2 key3 key4 key5 key6 0 bar11b 2.2 False 2021-06-03 00:00:00 {code} Using the new datasets API directly, this looks like: {code} In [25]: import pyarrow.dataset as ds In [26]: dataset = ds.dataset("test_partitions/", format="parquet", partitioning="hive") In [28]: dataset.to_table(filter=ds.field("key1") == 1).to_pandas() Out[28]: data key1 key2 key3 key4 key5 key6 0 bar 1 1b 2.2 False 2021-06-03 00:00:00 {code} was (Author: jorisvandenbossche): [~mnizol] thanks a lot for the clear and detailed report. What's causing the confusion here is that we have both a legacy Python implementation of ParquetDataset and a new generic Datasets API, and that we are still in the middle of moving to the new implementation: {{ParquetDataset}} still uses the legacy implementation by default (but you can use {{use_legacy_dataset=False}} to opt in to the new), while {{pq.read_table}} (which is what {{pd.read_parquet}} uses under the hood) is already defaulting to the new implementation (but you can fall back to the old with {{use_legacy_dataset=True}}). In the legacy ParquetDataset implementation, all partition keys are indeed parsed as strings as you show with the output of {{ParquetDataset.partitions.levels}}. So when passing {{use_legacy_dataset=True}} to the read function, using a string actually works: {code} In [19]: pd.read_parquet('./test.parquet', filters=[('key1','=','1')], use_legacy_dataset=True) Out[19]: data key1 key2 key3 key4 key5 key6 0 bar11b 2.2 False 2021-06-03 00:00:00 {code} BTW, also using an integer works here ({{('key1', '=', 1)}}), because the legacy implementation will try to interpret the
[jira] [Updated] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes
[ https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-13525: -- Description: Follow-up on ARROW-13074. We should maybe also expose the {{partitioning}} attribute on ParquetDataset (if constructed with {{use_legacy_dataset=False}}), as I did for the {{filesystem}}/{{files}}/{{fragments}} attributes. was:Follow-up on ARROW-13074. Better mention the alternatives (eg pieces -> fragments > [Python] Mention alternatives in deprecation message of ParquetDataset > attributes > - > > Key: ARROW-13525 > URL: https://issues.apache.org/jira/browse/ARROW-13525 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 5.0.1 > > > Follow-up on ARROW-13074. > We should maybe also expose the {{partitioning}} attribute on ParquetDataset > (if constructed with {{use_legacy_dataset=False}}), as I did for the > {{filesystem}}/{{files}}/{{fragments}} attributes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes
[ https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-13525: -- Description: Follow-up on ARROW-13074. Better mention the alternatives (eg pieces -> fragments (was: Follow-up on ARROW-13074) > [Python] Mention alternatives in deprecation message of ParquetDataset > attributes > - > > Key: ARROW-13525 > URL: https://issues.apache.org/jira/browse/ARROW-13525 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 5.0.1 > > > Follow-up on ARROW-13074. Better mention the alternatives (eg pieces -> > fragments -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes
[ https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-13525: - Assignee: Joris Van den Bossche > [Python] Mention alternatives in deprecation message of ParquetDataset > attributes > - > > Key: ARROW-13525 > URL: https://issues.apache.org/jira/browse/ARROW-13525 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 5.0.1 > > > Follow-up on ARROW-13074 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13578) [Python] Inconsistent handling of integer-valued partitions in dataset filters API
[ https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-13578: -- Component/s: Python > [Python] Inconsistent handling of integer-valued partitions in dataset > filters API > -- > > Key: ARROW-13578 > URL: https://issues.apache.org/jira/browse/ARROW-13578 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matt Nizol >Priority: Minor > > When creating a partitioned data set via the pandas.to_parquet() method, > partition columns are ostensibly cast to strings in the partition metadata. > When reading specific partitions via the filters parameter in > pandas.read_parquet(), string values must be used for filter operands _except > when_ the partition column has an integer value. > Consider the following example: > {code:python} > import datetime > import pandas as pd > df = pd.DataFrame({ > "key1": ['0', '1', '2'], > "key2": [0, 1, 2], > "key3": ['a', 'b', 'c'], > "key4": [1.1, 2.2, 3.3], > "key5": [True, False, True], > "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), > datetime.date(2021, 6, 4)], > "data": ["foo", "bar", "baz"] > }) > df['key6'] = pd.to_datetime(df['key6']) > df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', > 'key4', 'key5', 'key6']) > {code} > Reading into a ParquetDataset and inspecting the partition levels suggests > that partition keys have been cast to string, regardless of the original type: > {code:python} > import pyarrow.parquet as pq > ds = pq.ParquetDataset('./test.parquet') > for level in ds.partitions.levels: > print(f"{level.name}: {level.keys}") > {code} > Output: > {noformat} > key1: ['0', '1', '2'] > key2: ['0', '1', '2'] > key3: ['a', 'b', 'c'] > key4: ['1.1', '2.2', '3.3'] > key5: ['True', 'False'] > key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 > 00:00:00']{noformat} > Filtering the dataset using any of the non-integer partition keys along with > string-valued operands works as expected: > {code:python} > df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', > '=', 'True')]) > df2.head() > {code} > Output: > {noformat} > datakey1key2key3key4key5key6 > 0 foo 0 0 a 1.1 True2021-06-02 00:00:00 > {noformat} > However, filtering the dataset using either of the integer-valued partition > keys with a string-valued operand raises an exception, *even when the > original column's data type is string*: > {code:python} > df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')]) > df2.head() > {code} > {noformat} > ArrowNotImplementedError: Function equal has no kernel matching input types > (array[int32], scalar[string]) > {noformat} > It would seem to be less surprising / more consistent if filter operands > either (a) are always cast to string, or (b) always retain their original > type. > Note, this issue may be related to ARROW-12114. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13578) [Python] Inconsistent handling of integer-valued partitions in dataset filters API
[ https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-13578: -- Summary: [Python] Inconsistent handling of integer-valued partitions in dataset filters API (was: Inconsistent handling of integer-valued partitions in dataset filters API) > [Python] Inconsistent handling of integer-valued partitions in dataset > filters API > -- > > Key: ARROW-13578 > URL: https://issues.apache.org/jira/browse/ARROW-13578 > Project: Apache Arrow > Issue Type: Bug >Reporter: Matt Nizol >Priority: Minor > > When creating a partitioned data set via the pandas.to_parquet() method, > partition columns are ostensibly cast to strings in the partition metadata. > When reading specific partitions via the filters parameter in > pandas.read_parquet(), string values must be used for filter operands _except > when_ the partition column has an integer value. > Consider the following example: > {code:python} > import datetime > import pandas as pd > df = pd.DataFrame({ > "key1": ['0', '1', '2'], > "key2": [0, 1, 2], > "key3": ['a', 'b', 'c'], > "key4": [1.1, 2.2, 3.3], > "key5": [True, False, True], > "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), > datetime.date(2021, 6, 4)], > "data": ["foo", "bar", "baz"] > }) > df['key6'] = pd.to_datetime(df['key6']) > df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', > 'key4', 'key5', 'key6']) > {code} > Reading into a ParquetDataset and inspecting the partition levels suggests > that partition keys have been cast to string, regardless of the original type: > {code:python} > import pyarrow.parquet as pq > ds = pq.ParquetDataset('./test.parquet') > for level in ds.partitions.levels: > print(f"{level.name}: {level.keys}") > {code} > Output: > {noformat} > key1: ['0', '1', '2'] > key2: ['0', '1', '2'] > key3: ['a', 'b', 'c'] > key4: ['1.1', '2.2', '3.3'] > key5: ['True', 'False'] > key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 > 00:00:00']{noformat} > Filtering the dataset using any of the non-integer partition keys along with > string-valued operands works as expected: > {code:python} > df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', > '=', 'True')]) > df2.head() > {code} > Output: > {noformat} > datakey1key2key3key4key5key6 > 0 foo 0 0 a 1.1 True2021-06-02 00:00:00 > {noformat} > However, filtering the dataset using either of the integer-valued partition > keys with a string-valued operand raises an exception, *even when the > original column's data type is string*: > {code:python} > df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')]) > df2.head() > {code} > {noformat} > ArrowNotImplementedError: Function equal has no kernel matching input types > (array[int32], scalar[string]) > {noformat} > It would seem to be less surprising / more consistent if filter operands > either (a) are always cast to string, or (b) always retain their original > type. > Note, this issue may be related to ARROW-12114. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13578) Inconsistent handling of integer-valued partitions in dataset filters API
[ https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396598#comment-17396598 ] Joris Van den Bossche commented on ARROW-13578: --- [~mnizol] thanks a lot for the clear and detailed report. What's causing the confusion here is that we have both a legacy Python implementation of ParquetDataset and a new generic Datasets API, and that we are still in the middle of moving to the new implementation: {{ParquetDataset}} still uses the legacy implementation by default (but you can use {{use_legacy_dataset=False}} to opt in to the new), while {{pq.read_table}} (which is what {{pd.read_parquet}} uses under the hood) is already defaulting to the new implementation (but you can fall back to the old with {{use_legacy_dataset=True}}). In the legacy ParquetDataset implementation, all partition keys are indeed parsed as strings as you show with the output of {{ParquetDataset.partitions.levels}}. So when passing {{use_legacy_dataset=True}} to the read function, using a string actually works: {code} In [19]: pd.read_parquet('./test.parquet', filters=[('key1','=','1')], use_legacy_dataset=True) Out[19]: data key1 key2 key3 key4 key5 key6 0 bar11b 2.2 False 2021-06-03 00:00:00 {code} BTW, also using an integer works here ({{('key1', '=', 1)}}), because the legacy implementation will try to interpret the value with the type of the partition levels. In the new Datasets API, the parsing of the directory paths currently supports int32 and strings (when inferring, you can use other types when explicitly passing the schema for the partition keys). So when creating a ParquetDataset with use_legacy_dataset=False, we see: {code} In [21]: ds = pq.ParquetDataset('test_partitions', use_legacy_dataset=False) In [22]: ds._dataset.partitioning Out[22]: In [23]: ds._dataset.partitioning.schema Out[23]: key1: dictionary key2: dictionary key3: dictionary key4: dictionary key5: dictionary key6: dictionary -- schema metadata -- pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 340 In [24]: ds._dataset.partitioning.dictionaries Out[24]: [ [ 0, 1, 2 ], [ 0, 1, 2 ], [ "a", "b", "c" ], [ "1.1", "2.2", "3.3" ], [ "True", "False" ], [ "2021-06-02 00:00:00", "2021-06-03 00:00:00", "2021-06-04 00:00:00" ]] {code} So the first two partition keys are inferred as int, the others as string. And that's also the reason that for this case, you actually need to specify the filter using an integer (we decided to not do such automatic casting here in the new implementation). Sidenote: I am using {{ds._dataset.partitioning}} above, but this will become {{ds.partitioning}} after ARROW-13525. Using the new datasets API directly, this looks like: {code} In [25]: import pyarrow.dataset as ds In [26]: dataset = ds.dataset("test_partitions/", format="parquet", partitioning="hive") In [28]: dataset.to_table(filter=ds.field("key1") == 1).to_pandas() Out[28]: data key1 key2 key3 key4 key5 key6 0 bar 1 1b 2.2 False 2021-06-03 00:00:00 {code} > Inconsistent handling of integer-valued partitions in dataset filters API > - > > Key: ARROW-13578 > URL: https://issues.apache.org/jira/browse/ARROW-13578 > Project: Apache Arrow > Issue Type: Bug >Reporter: Matt Nizol >Priority: Minor > > When creating a partitioned data set via the pandas.to_parquet() method, > partition columns are ostensibly cast to strings in the partition metadata. > When reading specific partitions via the filters parameter in > pandas.read_parquet(), string values must be used for filter operands _except > when_ the partition column has an integer value. > Consider the following example: > {code:python} > import datetime > import pandas as pd > df = pd.DataFrame({ > "key1": ['0', '1', '2'], > "key2": [0, 1, 2], > "key3": ['a', 'b', 'c'], > "key4": [1.1, 2.2, 3.3], > "key5": [True, False, True], > "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), > datetime.date(2021, 6, 4)], > "data": ["foo", "bar", "baz"] > }) > df['key6'] = pd.to_datetime(df['key6']) > df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', > 'key4', 'key5', 'key6']) > {code} > Reading into a ParquetDataset and inspecting the partition levels suggests > that partition keys have been cast to string, regardless of the original type: > {code:python} > import pyarrow.parquet as pq > ds = pq.ParquetDataset('./test.parquet') > for level in ds.partitions.levels: > print(f"{level.name}: {level.keys}") > {code} > Output: > {noformat} > key1: ['0', '1', '2'] > key2: ['0', '1', '2'] > key3: ['a',
[jira] [Commented] (ARROW-7179) [C++][Compute] Array support for fill_null
[ https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396577#comment-17396577 ] Joris Van den Bossche commented on ARROW-7179: -- We actually already have a "fill_null" kernel for filling with a scalar (added in ), so this issue was about expanding that kernel to also support array fill values. That's indeed the same as coalesce. But, "fill_null" is currently a fully separate implementation (scalar_fill_null.cc), while I suppose "coalesce" might also have a specialized path for scalar fill values? That could potentially be consolidated? > [C++][Compute] Array support for fill_null > -- > > Key: ARROW-7179 > URL: https://issues.apache.org/jira/browse/ARROW-7179 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: David Li >Priority: Major > Labels: analytics > Fix For: 5.0.0 > > > Add kernels to support which replacing null values in an array with values > taken from corresponding slots in another array: > {code} > fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13594) [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs
[ https://issues.apache.org/jira/browse/ARROW-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13594: --- Component/s: Python > [CI] Turbodbc integration builds are failing due to use of deprecated/removed > APIs > -- > > Key: ARROW-13594 > URL: https://issues.apache.org/jira/browse/ARROW-13594 > Project: Apache Arrow > Issue Type: Test > Components: C++, Continuous Integration, Python >Reporter: Joris Van den Bossche >Priority: Major > > See eg https://github.com/ursacomputing/crossbow/runs/3277446055 > ARROW-13552 removed some deprecated C++ APIs, and turbodbc was still using > some of those. See https://github.com/apache/arrow/pull/10868/files#r685800679 > cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-13581) [Python] pyarrow array equals return False if there's nan
[ https://issues.apache.org/jira/browse/ARROW-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-13581. - Resolution: Duplicate > [Python] pyarrow array equals return False if there's nan > -- > > Key: ARROW-13581 > URL: https://issues.apache.org/jira/browse/ARROW-13581 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Zhang >Priority: Major > > pyarrow array / chunked array / table `.equals` would return False if there > is nan value(s) in the data. > Example: > {code:java} > pa.array([1, np.nan]).equals(pa.array([1, np.nan])) {code} > Above will return False instead of True -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13581) [Python] pyarrow array equals return False if there's nan
[ https://issues.apache.org/jira/browse/ARROW-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396553#comment-17396553 ] Joris Van den Bossche commented on ARROW-13581: --- See the discussion in ARROW-6043. This is currently somewhat deliberate, but the plan is to add an option to consider NaNs as equal if they occur in the same location. Closing this as a duplicate of ARROW-6043. > [Python] pyarrow array equals return False if there's nan > -- > > Key: ARROW-13581 > URL: https://issues.apache.org/jira/browse/ARROW-13581 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Zhang >Priority: Major > > pyarrow array / chunked array / table `.equals` would return False if there > is nan value(s) in the data. > Example: > {code:java} > pa.array([1, np.nan]).equals(pa.array([1, np.nan])) {code} > Above will return False instead of True -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13581) [Python] pyarrow array equals return False if there's nan
[ https://issues.apache.org/jira/browse/ARROW-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-13581: -- Summary: [Python] pyarrow array equals return False if there's nan (was: pyarrow array equals return False if there's nan ) > [Python] pyarrow array equals return False if there's nan > -- > > Key: ARROW-13581 > URL: https://issues.apache.org/jira/browse/ARROW-13581 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Zhang >Priority: Major > > pyarrow array / chunked array / table `.equals` would return False if there > is nan value(s) in the data. > Example: > {code:java} > pa.array([1, np.nan]).equals(pa.array([1, np.nan])) {code} > Above will return False instead of True -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13594) [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs
[ https://issues.apache.org/jira/browse/ARROW-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396535#comment-17396535 ] Joris Van den Bossche commented on ARROW-13594: --- I opened https://github.com/blue-yonder/turbodbc/issues/317 on the turbodbc side. > [CI] Turbodbc integration builds are failing due to use of deprecated/removed > APIs > -- > > Key: ARROW-13594 > URL: https://issues.apache.org/jira/browse/ARROW-13594 > Project: Apache Arrow > Issue Type: Test > Components: C++, Continuous Integration >Reporter: Joris Van den Bossche >Priority: Major > > See eg https://github.com/ursacomputing/crossbow/runs/3277446055 > ARROW-13552 removed some deprecated C++ APIs, and turbodbc was still using > some of those. See https://github.com/apache/arrow/pull/10868/files#r685800679 > cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13594) [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs
Joris Van den Bossche created ARROW-13594: - Summary: [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs Key: ARROW-13594 URL: https://issues.apache.org/jira/browse/ARROW-13594 Project: Apache Arrow Issue Type: Test Components: C++, Continuous Integration Reporter: Joris Van den Bossche See eg https://github.com/ursacomputing/crossbow/runs/3277446055 ARROW-13552 removed some deprecated C++ APIs, and turbodbc was still using some of those. See https://github.com/apache/arrow/pull/10868/files#r685800679 cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13593) [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API
Maya Anderson created ARROW-13593: - Summary: [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API Key: ARROW-13593 URL: https://issues.apache.org/jira/browse/ARROW-13593 Project: Apache Arrow Issue Type: New Feature Components: C++, Parquet Reporter: Maya Anderson Assignee: Maya Anderson In order for the new Dataset API to fully support PME, the same writer properties that include file_encryption_properties shouldn’t be used for the whole dataset. file_encryption_properties should be per file, for example in order to support key rotation https://issues.apache.org/jira/browse/ARROW-9960 . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6908) [C++] Add support for Bazel
[ https://issues.apache.org/jira/browse/ARROW-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-6908. - Assignee: (was: Micah Kornfield) > [C++] Add support for Bazel > --- > > Key: ARROW-6908 > URL: https://issues.apache.org/jira/browse/ARROW-6908 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Aryan Naraghi >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > I would like to use Arrow in a C++ project that uses Bazel. > > Would it be possible to add support for building Arrow using Bazel? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
[ https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangdapeng updated ARROW-13592: --- Description: One of the key libraries in myproject uses "-D_GLIBCXX_USE_CXX11_ABI=0", and there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0". env: GCC 7.5 cmake 3.16 Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was not recognized, and the file was truncated commands: cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' make arrow make parquet was: env: GCC 7.5 cmake 3.16 Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was not recognized, and the file was truncated commands: cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' make arrow make parquet > How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow > -- > > Key: ARROW-13592 > URL: https://issues.apache.org/jira/browse/ARROW-13592 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: wangdapeng >Priority: Blocker > > One of the key libraries in myproject uses "-D_GLIBCXX_USE_CXX11_ABI=0", and > there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow > with "-D_GLIBCXX_USE_CXX11_ABI=0". > env: GCC 7.5 cmake 3.16 > Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was > not recognized, and the file was truncated > commands: > cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON > -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON > -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' > make arrow > make parquet -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
[ https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangdapeng updated ARROW-13592: --- Description: One of the key libraries in my project uses "-D_GLIBCXX_USE_CXX11_ABI=0", and there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0". env: GCC 7.5 cmake 3.16 Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was not recognized, and the file was truncated commands: cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' make arrow make parquet was: One of the key libraries in myproject uses "-D_GLIBCXX_USE_CXX11_ABI=0", and there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0". env: GCC 7.5 cmake 3.16 Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was not recognized, and the file was truncated commands: cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' make arrow make parquet > How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow > -- > > Key: ARROW-13592 > URL: https://issues.apache.org/jira/browse/ARROW-13592 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: wangdapeng >Priority: Blocker > > One of the key libraries in my project uses "-D_GLIBCXX_USE_CXX11_ABI=0", and > there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow > with "-D_GLIBCXX_USE_CXX11_ABI=0". > env: GCC 7.5 cmake 3.16 > Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was > not recognized, and the file was truncated > commands: > cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON > -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON > -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' > make arrow > make parquet -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
[ https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangdapeng updated ARROW-13592: --- Summary: How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow (was: How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0") > How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow > -- > > Key: ARROW-13592 > URL: https://issues.apache.org/jira/browse/ARROW-13592 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: wangdapeng >Priority: Blocker > > env: GCC 7.5 cmake 3.16 > Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was > not recognized, and the file was truncated > commands: > cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON > -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON > -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' > make arrow > make parquet -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13592) How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"
[ https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangdapeng updated ARROW-13592: --- Description: env: GCC 7.5 cmake 3.16 Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was not recognized, and the file was truncated commands: cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' make arrow make parquet was: Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was not recognized, and the file was truncated commands: cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' make arrow make parquet > How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0" > -- > > Key: ARROW-13592 > URL: https://issues.apache.org/jira/browse/ARROW-13592 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 5.0.0 >Reporter: wangdapeng >Priority: Blocker > > env: GCC 7.5 cmake 3.16 > Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was > not recognized, and the file was truncated > commands: > cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON > -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON > -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0' > make arrow > make parquet -- This message was sent by Atlassian Jira (v8.3.4#803005)