[jira] [Resolved] (ARROW-16296) [GLib][Parquet] Add missing casts for GArrowRoundMode
[ https://issues.apache.org/jira/browse/ARROW-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-16296. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12971 [https://github.com/apache/arrow/pull/12971] > [GLib][Parquet] Add missing casts for GArrowRoundMode > - > > Key: ARROW-16296 > URL: https://issues.apache.org/jira/browse/ARROW-16296 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib, Parquet >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16296) [GLib][Parquet] Add missing casts for GArrowRoundMode
[ https://issues.apache.org/jira/browse/ARROW-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16296: --- Labels: pull-request-available (was: ) > [GLib][Parquet] Add missing casts for GArrowRoundMode > - > > Key: ARROW-16296 > URL: https://issues.apache.org/jira/browse/ARROW-16296 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib, Parquet >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16296) [GLib][Parquet] Add missing casts for GArrowRoundMode
Kouhei Sutou created ARROW-16296: Summary: [GLib][Parquet] Add missing casts for GArrowRoundMode Key: ARROW-16296 URL: https://issues.apache.org/jira/browse/ARROW-16296 Project: Apache Arrow Issue Type: Improvement Components: GLib, Parquet Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16295) [CI][Release] verify-rc-source-windows still uses windows-2016
[ https://issues.apache.org/jira/browse/ARROW-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16295: --- Labels: pull-request-available (was: ) > [CI][Release] verify-rc-source-windows still uses windows-2016 > -- > > Key: ARROW-16295 > URL: https://issues.apache.org/jira/browse/ARROW-16295 > Project: Apache Arrow > Issue Type: Test > Components: Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > windows-2016 is deprecated: > https://github.com/actions/virtual-environments/issues/4312 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16295) [CI][Release] verify-rc-source-windows still uses windows-2016
Kouhei Sutou created ARROW-16295: Summary: [CI][Release] verify-rc-source-windows still uses windows-2016 Key: ARROW-16295 URL: https://issues.apache.org/jira/browse/ARROW-16295 Project: Apache Arrow Issue Type: Test Components: Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou windows-2016 is deprecated: https://github.com/actions/virtual-environments/issues/4312 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16293) [CI][GLib] Tests are unstable
[ https://issues.apache.org/jira/browse/ARROW-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-16293. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12964 [https://github.com/apache/arrow/pull/12964] > [CI][GLib] Tests are unstable > - > > Key: ARROW-16293 > URL: https://issues.apache.org/jira/browse/ARROW-16293 > Project: Apache Arrow > Issue Type: Test > Components: Continuous Integration, GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > 1. macOS test is timed out because ccache cache isn't available: > https://github.com/apache/arrow/runs/6134456502?check_suite_focus=true > 2. {{gparquet_row_group_metadata_equal()}} isn't stable on Windows: > https://github.com/apache/arrow/runs/6134457213?check_suite_focus=true#step:14:308 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16294) [C++] Improve performance of parquet readahead
[ https://issues.apache.org/jira/browse/ARROW-16294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16294: --- Labels: pull-request-available (was: ) > [C++] Improve performance of parquet readahead > -- > > Key: ARROW-16294 > URL: https://issues.apache.org/jira/browse/ARROW-16294 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Weston Pace >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The 7.0.0 readahead for parquet would read up to 256 row groups at once which > meant that, if the consumer were too slow, we would almost certainly run out > of memory. > ARROW-15410 improved readahead as a whole and, in the process, changed > parquet so it's always reading 1 row group in advance. > This is not always ideal in S3 scenarios. We may want to read many row > groups in advance if the row groups are small. To fix this we should > continue reading in parallel until there are at least batch_size * > batch_readahead rows being fetched. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16291) [Java]: Support JSE17 for Java Cookbooks
[ https://issues.apache.org/jira/browse/ARROW-16291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16291: --- Labels: pull-request-available (was: ) > [Java]: Support JSE17 for Java Cookbooks > > > Key: ARROW-16291 > URL: https://issues.apache.org/jira/browse/ARROW-16291 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: David Dali Susanibar Arce >Assignee: David Dali Susanibar Arce >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > > Realize changes needed to run cookbooks through JSE17. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16294) [C++] Improve performance of parquet readahead
[ https://issues.apache.org/jira/browse/ARROW-16294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526714#comment-17526714 ] David Li commented on ARROW-16294: -- This is very similar to ARROW-14648 right? Or rather ARROW-14648 is the fully general solution? > [C++] Improve performance of parquet readahead > -- > > Key: ARROW-16294 > URL: https://issues.apache.org/jira/browse/ARROW-16294 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Weston Pace >Priority: Major > > The 7.0.0 readahead for parquet would read up to 256 row groups at once which > meant that, if the consumer were too slow, we would almost certainly run out > of memory. > ARROW-15410 improved readahead as a whole and, in the process, changed > parquet so it's always reading 1 row group in advance. > This is not always ideal in S3 scenarios. We may want to read many row > groups in advance if the row groups are small. To fix this we should > continue reading in parallel until there are at least batch_size * > batch_readahead rows being fetched. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16294) [C++] Improve performance of parquet readahead
Weston Pace created ARROW-16294: --- Summary: [C++] Improve performance of parquet readahead Key: ARROW-16294 URL: https://issues.apache.org/jira/browse/ARROW-16294 Project: Apache Arrow Issue Type: Improvement Reporter: Weston Pace The 7.0.0 readahead for parquet would read up to 256 row groups at once which meant that, if the consumer were too slow, we would almost certainly run out of memory. ARROW-15410 improved readahead as a whole and, in the process, changed parquet so it's always reading 1 row group in advance. This is not always ideal in S3 scenarios. We may want to read many row groups in advance if the row groups are small. To fix this we should continue reading in parallel until there are at least batch_size * batch_readahead rows being fetched. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15410) [C++][Datasets] Improve memory usage of datasets API when scanning parquet
[ https://issues.apache.org/jira/browse/ARROW-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-15410. - Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12228 [https://github.com/apache/arrow/pull/12228] > [C++][Datasets] Improve memory usage of datasets API when scanning parquet > -- > > Key: ARROW-15410 > URL: https://issues.apache.org/jira/browse/ARROW-15410 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > This is a more targeted fix to improve memory usage when scanning parquet > files. It is related to broader issues like ARROW-14648 but those will > likely take longer to fix. The goal here is to make it possible to scan > large parquet datasets with many files where each file has reasonably sized > row groups (e.g. 1 million rows). Currently we run out of memory scanning a > configuration as simple as: > 21 parquet files > Each parquet file has 10 million rows split into row groups of size 1 million -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16273) [C++] Valgrind error in arrow-compute-scalar-test
[ https://issues.apache.org/jira/browse/ARROW-16273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526712#comment-17526712 ] Weston Pace commented on ARROW-16273: - I have not been able to reproduce this and it appears the nightly valgrind is now passing. I'm not sure if some issue got fixed concurrently or if this is just flaky. > [C++] Valgrind error in arrow-compute-scalar-test > - > > Key: ARROW-16273 > URL: https://issues.apache.org/jira/browse/ARROW-16273 > Project: Apache Arrow > Issue Type: Bug >Reporter: Weston Pace >Priority: Major > > Currently valgrind is failing earlier on the tpch-node-test and > hash-join-node-test. Once we fix those tests it seems the next error is this: > {noformat} > [ RUN ] TestStringKernels/0.Strptime > ==9928== Conditional jump or move depends on uninitialised value(s) > ==9928==at 0x411AEA2: arrow::TestInitialized(arrow::ArrayData const&) > (gtest_util.cc:682) > ==9928==by 0xAE1C79: arrow::compute::(anonymous > namespace)::ValidateOutput(arrow::ArrayData const&) (test_util.cc:287) > ==9928==by 0xAE23FC: arrow::compute::ValidateOutput(arrow::Datum const&) > (test_util.cc:320) > ==9928==by 0xAE4946: > arrow::compute::CheckScalarNonRecursive(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, arrow::Datum > const&, arrow::compute::FunctionOptions const*) (test_util.cc:80) > ==9928==by 0xAE63A4: > arrow::compute::CheckScalar(std::__cxx11::basic_string std::char_traits, std::allocator >, std::vector std::allocator > const&, arrow::Datum, > arrow::compute::FunctionOptions const*) (test_util.cc:108) > ==9928==by 0xAE7E28: > arrow::compute::CheckScalarUnary(std::__cxx11::basic_string std::char_traits, std::allocator >, arrow::Datum, arrow::Datum, > arrow::compute::FunctionOptions const*) (test_util.cc:254) > ==9928==by 0xAE80D3: > arrow::compute::CheckScalarUnary(std::__cxx11::basic_string std::char_traits, std::allocator >, > std::shared_ptr, std::__cxx11::basic_string std::char_traits, std::allocator >, > std::shared_ptr, std::__cxx11::basic_string std::char_traits, std::allocator >, > arrow::compute::FunctionOptions const*) (test_util.cc:260) > ==9928==by 0x9F783F: > arrow::compute::BaseTestStringKernels::CheckUnary(std::__cxx11::basic_string std::char_traits, std::allocator >, > std::__cxx11::basic_string, std::allocator > >, std::shared_ptr, std::__cxx11::basic_string std::char_traits, std::allocator >, > arrow::compute::FunctionOptions const*) (scalar_string_test.cc:56) > ==9928==by 0xA2A62D: > arrow::compute::TestStringKernels_Strptime_Test::TestBody() > (scalar_string_test.cc:1855) > ==9928==by 0x64974DC: void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607) > ==9928==by 0x648E90C: void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643) > ==9928==by 0x6469CDC: testing::Test::Run() (gtest.cc:2682) > ==9928==by 0x646A6FE: testing::TestInfo::Run() (gtest.cc:2861) > ==9928==by 0x646B0BD: testing::TestSuite::Run() (gtest.cc:3015) > ==9928==by 0x647B1DB: testing::internal::UnitTestImpl::RunAllTests() > (gtest.cc:5855) > ==9928==by 0x6498497: bool > testing::internal::HandleSehExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607) > ==9928==by 0x648FAF9: bool > testing::internal::HandleExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643) > ==9928==by 0x64796A8: testing::UnitTest::Run() (gtest.cc:5438) > ==9928==by 0x4204918: RUN_ALL_TESTS() (gtest.h:2490) > ==9928==by 0x420495B: main (gtest_main.cc:52) > ==9928== > { > >Memcheck:Cond >fun:_ZN5arrow15TestInitializedERKNS_9ArrayDataE >fun:_ZN5arrow7compute12_GLOBAL__N_114ValidateOutputERKNS_9ArrayDataE >fun:_ZN5arrow7compute14ValidateOutputERKNS_5DatumE > > fun:_ZN5arrow7compute23CheckScalarNonRecursiveERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaISA_EERKSA_PKNS0_15FunctionOptionsE > > fun:_ZN5arrow7compute11CheckScalarENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaIS8_EES8_PKNS0_15FunctionOptionsE > > fun:_ZN5arrow7compute16CheckScalarUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_5DatumES7_PKNS0_15FunctionOptionsE > > fun:_ZN5arrow7compute16CheckScalarUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS_8DataTypeEES6_S9_S6_PKNS0_15FunctionOptionsE > > fun:_ZN5arrow7
[jira] [Resolved] (ARROW-16264) [C++][CI] Valgrind timeout in arrow-compute-hash-join-node-test
[ https://issues.apache.org/jira/browse/ARROW-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-16264. - Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12944 [https://github.com/apache/arrow/pull/12944] > [C++][CI] Valgrind timeout in arrow-compute-hash-join-node-test > --- > > Key: ARROW-16264 > URL: https://issues.apache.org/jira/browse/ARROW-16264 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > This is starting to show up once we fixed the valgrind errors in the tpch > node test. > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=23628&view=results -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16173) [C++] Add benchmarks for temporal functions/kernels
[ https://issues.apache.org/jira/browse/ARROW-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-16173: -- Assignee: Rok Mihevc > [C++] Add benchmarks for temporal functions/kernels > --- > > Key: ARROW-16173 > URL: https://issues.apache.org/jira/browse/ARROW-16173 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: Rok Mihevc >Priority: Major > Labels: good-second-issue, kernel > > See ML: https://lists.apache.org/thread/bp2f036sgfj72o46yqmglnx20zfc6tfq -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526694#comment-17526694 ] David Li commented on ARROW-16234: -- Ah, sounds reasonable. > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16293) [CI][GLib] Tests are unstable
[ https://issues.apache.org/jira/browse/ARROW-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16293: --- Labels: pull-request-available (was: ) > [CI][GLib] Tests are unstable > - > > Key: ARROW-16293 > URL: https://issues.apache.org/jira/browse/ARROW-16293 > Project: Apache Arrow > Issue Type: Test > Components: Continuous Integration, GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > 1. macOS test is timed out because ccache cache isn't available: > https://github.com/apache/arrow/runs/6134456502?check_suite_focus=true > 2. {{gparquet_row_group_metadata_equal()}} isn't stable on Windows: > https://github.com/apache/arrow/runs/6134457213?check_suite_focus=true#step:14:308 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16240) [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False
[ https://issues.apache.org/jira/browse/ARROW-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-16240. - Resolution: Fixed Issue resolved by pull request 12955 [https://github.com/apache/arrow/pull/12955] > [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset > with use_legacy_dataset=False > --- > > Key: ARROW-16240 > URL: https://issues.apache.org/jira/browse/ARROW-16240 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The {{pq.write_to_dataset}} (legacy implementation) supports the > {{row_group_size}}/{{chunk_size}} keyword to specify the row group size of > the written parquet files. > Now that we made {{use_legacy_dataset=False}} the default, this keyword > doesn't work anymore. > This is because {{dataset.write_dataset(..)}} doesn't support the parquet > {{row_group_size}} keyword. The {{ParquetFileWriteOptions}} class doesn't > support this keyword. > On the parquet side, this is also the only keyword that is not passed to the > {{ParquetWriter}} init (and thus to parquet's {{WriterProperties}} or > {{ArrowWriterProperties}}), but to the actual {{write_table}} call. In C++ > this can be seen at > https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71 > See discussion: > [https://github.com/apache/arrow/pull/12811#discussion_r845304218] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16293) [CI][GLib] Tests are unstable
Kouhei Sutou created ARROW-16293: Summary: [CI][GLib] Tests are unstable Key: ARROW-16293 URL: https://issues.apache.org/jira/browse/ARROW-16293 Project: Apache Arrow Issue Type: Test Components: Continuous Integration, GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou 1. macOS test is timed out because ccache cache isn't available: https://github.com/apache/arrow/runs/6134456502?check_suite_focus=true 2. {{gparquet_row_group_metadata_equal()}} isn't stable on Windows: https://github.com/apache/arrow/runs/6134457213?check_suite_focus=true#step:14:308 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526689#comment-17526689 ] Will Ayd commented on ARROW-16234: -- I was thinking they wouldn't - the returning array would just give back NULL where NULL was initially provided. You'll see this in the pandas docs as "na_option" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15542) [GLib][Parquet] Add GParquet*Metadata
[ https://issues.apache.org/jira/browse/ARROW-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15542. -- Fix Version/s: 8.0.0 Resolution: Fixed We can use {{parquet_arrow_file_reader_get_metadata()}}, {{gparquet_file_metadata_get_row_group()}}, {{gparquet_row_group_metadata_get_column_chunk()}}, {{gparquet_column_chunk_metadata_get_statistics()}}, {{gparquet_statistics_get_n_distinct_values()}}, {{gparquet_*_statistics_get_min()}} and {{gparquet_*_statistics_get_max()}} for this. > [GLib][Parquet] Add GParquet*Metadata > - > > Key: ARROW-15542 > URL: https://issues.apache.org/jira/browse/ARROW-15542 > Project: Apache Arrow > Issue Type: Wish > Components: GLib >Affects Versions: 6.0.1 >Reporter: FrankJiao >Assignee: Kouhei Sutou >Priority: Major > Fix For: 8.0.0 > > > how to read ColumnChunkMetaData in parquet-glib? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16251) [GLib][Parquet] Add GParquetStatistics
[ https://issues.apache.org/jira/browse/ARROW-16251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-16251. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12953 [https://github.com/apache/arrow/pull/12953] > [GLib][Parquet] Add GParquetStatistics > -- > > Key: ARROW-16251 > URL: https://issues.apache.org/jira/browse/ARROW-16251 > Project: Apache Arrow > Issue Type: Sub-task > Components: GLib, Parquet >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-12885) [C++] Error: template with C linkage template
[ https://issues.apache.org/jira/browse/ARROW-12885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526683#comment-17526683 ] Kouhei Sutou commented on ARROW-12885: -- Thanks for the information. It seems that we need something like https://github.com/protocolbuffers/protobuf/pull/9065/files . > [C++] Error: template with C linkage template > --- > > Key: ARROW-12885 > URL: https://issues.apache.org/jira/browse/ARROW-12885 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Environment: IBM i | AS400 | AIX >Reporter: Menno >Priority: Major > Attachments: 2021-05-26 16_31_09-Window.png, thrift_ep-build-err.log > > > When installing arrow on IBM i it fails the install at the thrift dependency > install with the following output: > !2021-05-26 16_31_09-Window.png! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526678#comment-17526678 ] Kouhei Sutou commented on ARROW-16282: -- We have an issue for this: ARROW-16176 I don't think that this is a blocker of 8.0.0 because we didn't have support for Ubuntu 22.04 (OpenSSL 3) in the previous release. This is not break a backward compatibility. This is just a new feature. > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Assignee: Jacob Wujciak-Jens >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15757) [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior
[ https://issues.apache.org/jira/browse/ARROW-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-15757. - Resolution: Fixed Issue resolved by pull request 12838 [https://github.com/apache/arrow/pull/12838] > [Python] Missing bindings for existing_data_behavior makes it impossible to > maintain old behavior > -- > > Key: ARROW-15757 > URL: https://issues.apache.org/jira/browse/ARROW-15757 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 7.0.0 >Reporter: christophe bagot >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0, 7.0.1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Shouldn't the missing bindings reported earlier in > [https://github.com/apache/arrow/pull/11632] be propagated higher up [here in > the parquet.py > module|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2217]? > Passing **kwargs as is the case for {{write_table}} would do the trick I > think. > I am finding myself stuck while using pandas.to_parquet with > {{use_legacy_dataset=false}} and no way to set the {{existing_data_behavior}} > flag to {{overwrite_or_ignore}} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16291) [Java]: Support JSE17 for Java Cookbooks
[ https://issues.apache.org/jira/browse/ARROW-16291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Dali Susanibar Arce updated ARROW-16291: -- Fix Version/s: 9.0.0 > [Java]: Support JSE17 for Java Cookbooks > > > Key: ARROW-16291 > URL: https://issues.apache.org/jira/browse/ARROW-16291 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: David Dali Susanibar Arce >Assignee: David Dali Susanibar Arce >Priority: Major > Fix For: 9.0.0 > > > Realize changes needed to run cookbooks through JSE17. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16292) [Java][Doc]: Upgrade java documentation for JSE17
David Dali Susanibar Arce created ARROW-16292: - Summary: [Java][Doc]: Upgrade java documentation for JSE17 Key: ARROW-16292 URL: https://issues.apache.org/jira/browse/ARROW-16292 Project: Apache Arrow Issue Type: Sub-task Components: Documentation, Java Affects Versions: 9.0.0 Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce Document changes needed to support JSE17: # Changed for arrow side: Changes related to {{--add-exports"}} are needed to continue supporting erroProne base on JSE11+ [installation doc|https://errorprone.info/docs/installation]. It mean you won't need this changes if you run arrow java building code without errorProne validation (mvn clean install -P-error-prone-jdk11+ ) # Changes as a user of arrow: If the user are planning to use Arrow with JSE17 is needed to pass modules needed. For example if I run cookbook for IO [https://arrow.apache.org/cookbook/java/io.html] it finished with an error mention {{Unable to make field long java.nio.Buffer.address accessible: module java.base does not "opens java.nio" to unnamed module}} for that reason as a user for JSE17 (not for arrow changes) is needed to add VM arguments as {{-ea --add-opens=java.base/java.nio=ALL-UNNAMED}} and it will finished without errors. This ticket are related with https://github.com/apache/arrow/pull/12941#pullrequestreview-950090643 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16291) [Java]: Support JSE17 for Java Cookbooks
[ https://issues.apache.org/jira/browse/ARROW-16291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Dali Susanibar Arce updated ARROW-16291: -- Component/s: Java > [Java]: Support JSE17 for Java Cookbooks > > > Key: ARROW-16291 > URL: https://issues.apache.org/jira/browse/ARROW-16291 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: David Dali Susanibar Arce >Assignee: David Dali Susanibar Arce >Priority: Major > > Realize changes needed to run cookbooks through JSE17. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16291) [Java]: Support JSE17 for Java Cookbooks
David Dali Susanibar Arce created ARROW-16291: - Summary: [Java]: Support JSE17 for Java Cookbooks Key: ARROW-16291 URL: https://issues.apache.org/jira/browse/ARROW-16291 Project: Apache Arrow Issue Type: Sub-task Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce Realize changes needed to run cookbooks through JSE17. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526665#comment-17526665 ] David Li commented on ARROW-16234: -- Not sure 'ignore nulls' works for sorting in general, how would nulls get compared to other items? > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526664#comment-17526664 ] David Li commented on ARROW-16234: -- We could offer function options for different modes which will also impact the return type. This is what we do for other similar kernels (e.g. quantile I think where you can choose to interpolate and get a float, or not interpolate and get the original data type). > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16280) [C++] Avoid copying shared_ptr in Expression::type()
[ https://issues.apache.org/jira/browse/ARROW-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-16280. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12957 [https://github.com/apache/arrow/pull/12957] > [C++] Avoid copying shared_ptr in Expression::type() > > > Key: ARROW-16280 > URL: https://issues.apache.org/jira/browse/ARROW-16280 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Tobias Zagorni >Assignee: Tobias Zagorni >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Split off from ARROW-16161, since this is a fairly straightforward fix and > completely independent of ExecBatch. > Expression::type() currently copies a shared_ptr, while the return > value is often used directly. We can avoid copying the shared_ptr, by > returning a reference to it. This reduces thread contention on these > shared_ptrs (ARROW-16161). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526659#comment-17526659 ] Will Ayd commented on ARROW-16234: -- I think we also need to consider how to handle NULL. In my current design I was thinking we should delegate as much responsibility to the standard sorting behavior, but AFAICT there are only SortOptions to rank NULLs at the start or the end, not necessarily to ignore NULL altogether. If we want to completely remove NULL from being calculated in the ranking algorithm I wonder if we should try and work that up the class hierarchy a bit to to the same thing in general sorting > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526658#comment-17526658 ] Will Ayd commented on ARROW-16234: -- I pushed up a rough draft for this on GH just to make sure the foundation was right. However, I'm wondering if you think we should mirror what pandas does in cases of ties or pick another default. Pandas interpolates an average for tied rankings by default, which of course is going to change our returned data type. Not sure if we want to stray from the integral return value as a default or instead pick another thing like dense ranking > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays
[ https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526656#comment-17526656 ] Eduardo Ponce commented on ARROW-16289: --- The term Scalar is used in different (but related) contexts. For example, the notion of a Scalar value, Scalar kernels, Scalar expressions, etc. I recall from an ad-hoc conversation last year where it was discussed that we should consider treating Scalars as a 1-element Array to making the compute layer logic more straightforward. The front-end API would still have the concept of a Scalar but it would be disguised as an Array for execution purposes. I think such a proposal has its merits, but we should ensure where the concept of Scalar will remain and make these distinctions clear. > [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE > encoded arrays > > > Key: ARROW-16289 > URL: https://issues.apache.org/jira/browse/ARROW-16289 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > This JIRA is a proposal / discussion. I am not asserting this is the way to > go but I would like to consider it. > From the execution engine's perspective an exec batch's columns are always > either arrays or scalars. The only time we make use of scalars today is for > the four augmented columns (e.g. __filename). Once we have support for RLE > arrays a scalar could easily be encoded as an RLE array and there would be no > need to use scalars here. > The advantage would be reducing the complexity in exec nodes and avoiding > issues like ARROW-16288. It is already rather difficult to explain the idea > of a "scalar" and "vector" function and then have to turn around and explain > that the word "scalar" has an entirely different meaning when talking about > field shape. > I think it's worth considering taking this even further and removing the > concept from the compute layer entirely. Kernel functions that want to have > special logic for scalars could do so using the RLE array. This would be a > significant change to many kernels which currently declare the ANY shape and > determine which logic to apply within the kernel itself (e.g. there is one > array OR scalar kernel and not one kernel for each). > Admittedly there is probably a few instructions and a few bytes more to > handle an RLE scalar than the scalar we have today. However, this is just > different flavors of O(1) and not likely to have significant impact. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16234) [C++] Implement Rank Kernel
[ https://issues.apache.org/jira/browse/ARROW-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16234: --- Labels: C++ good-second-issue kernel pull-request-available (was: C++ good-second-issue kernel) > [C++] Implement Rank Kernel > --- > > Key: ARROW-16234 > URL: https://issues.apache.org/jira/browse/ARROW-16234 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Will Ayd >Assignee: Will Ayd >Priority: Minor > Labels: C++, good-second-issue, kernel, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Didn't see this in the library already so apologies if overlooked, but I > think it would be nice to add a compute kernel for ranking. Here is a similar > function in pandas: > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16290) [C++] ExecuteScalarExpression, when calling a nullary function on a nullary batch, resets the batch length to 1
Weston Pace created ARROW-16290: --- Summary: [C++] ExecuteScalarExpression, when calling a nullary function on a nullary batch, resets the batch length to 1 Key: ARROW-16290 URL: https://issues.apache.org/jira/browse/ARROW-16290 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace At the moment ARROW-16286 prevents us from using ExecuteScalarExpression on nullary functions. However, if we bypass constant folding, then we run into another problem. The batch passed to the function always has length = 1. This appears to be tied up with the logic of ExecBatchIterator that I don't quite follow entirely. However, we should be preserving the batch length of the input to ExecuteScalarExpression and passing that to the function. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays
[ https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526637#comment-17526637 ] David Li commented on ARROW-16289: -- The concept of scalars would still exist (e.g. in expressions, options) so there's still potential for confusion though this would reduce it. Aggregations would presumably still return scalars, too. It does seem being able to accept scalars is more confusing than it's worth, though. > [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE > encoded arrays > > > Key: ARROW-16289 > URL: https://issues.apache.org/jira/browse/ARROW-16289 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > This JIRA is a proposal / discussion. I am not asserting this is the way to > go but I would like to consider it. > From the execution engine's perspective an exec batch's columns are always > either arrays or scalars. The only time we make use of scalars today is for > the four augmented columns (e.g. __filename). Once we have support for RLE > arrays a scalar could easily be encoded as an RLE array and there would be no > need to use scalars here. > The advantage would be reducing the complexity in exec nodes and avoiding > issues like ARROW-16288. It is already rather difficult to explain the idea > of a "scalar" and "vector" function and then have to turn around and explain > that the word "scalar" has an entirely different meaning when talking about > field shape. > I think it's worth considering taking this even further and removing the > concept from the compute layer entirely. Kernel functions that want to have > special logic for scalars could do so using the RLE array. This would be a > significant change to many kernels which currently declare the ANY shape and > determine which logic to apply within the kernel itself (e.g. there is one > array OR scalar kernel and not one kernel for each). > Admittedly there is probably a few instructions and a few bytes more to > handle an RLE scalar than the scalar we have today. However, this is just > different flavors of O(1) and not likely to have significant impact. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15015) [R] Test / CI flag for ensuring all tests are run?
[ https://issues.apache.org/jira/browse/ARROW-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15015. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12940 [https://github.com/apache/arrow/pull/12940] > [R] Test / CI flag for ensuring all tests are run? > -- > > Key: ARROW-15015 > URL: https://issues.apache.org/jira/browse/ARROW-15015 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Jacob Wujciak-Jens >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > We frequently skip [tests that depend on features that are not > available|https://github.com/apache/arrow/blob/b9ac245afef081339093cd1930153d6b18b0479d/r/tests/testthat/helper-skip.R#L24-L34] > which is nice (especially for CRAN tests, where the features might not be > buildable. > But should we have a CI flag that we could turn on (that is, off by default) > that forces all of these tests to be run so we can know there is at least one > build that runs all the tests? AFAICT, right now if we have no CI that > successfully builds parquet, for example, most of our parquet tests would > silently not run. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays
[ https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526627#comment-17526627 ] Weston Pace commented on ARROW-16289: - CC [~lidavidm] [~edponce] [~apitrou] [~michalno] [~yibocai] > [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE > encoded arrays > > > Key: ARROW-16289 > URL: https://issues.apache.org/jira/browse/ARROW-16289 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > This JIRA is a proposal / discussion. I am not asserting this is the way to > go but I would like to consider it. > From the execution engine's perspective an exec batch's columns are always > either arrays or scalars. The only time we make use of scalars today is for > the four augmented columns (e.g. __filename). Once we have support for RLE > arrays a scalar could easily be encoded as an RLE array and there would be no > need to use scalars here. > The advantage would be reducing the complexity in exec nodes and avoiding > issues like ARROW-16288. It is already rather difficult to explain the idea > of a "scalar" and "vector" function and then have to turn around and explain > that the word "scalar" has an entirely different meaning when talking about > field shape. > I think it's worth considering taking this even further and removing the > concept from the compute layer entirely. Kernel functions that want to have > special logic for scalars could do so using the RLE array. This would be a > significant change to many kernels which currently declare the ANY shape and > determine which logic to apply within the kernel itself (e.g. there is one > array OR scalar kernel and not one kernel for each). > Admittedly there is probably a few instructions and a few bytes more to > handle an RLE scalar than the scalar we have today. However, this is just > different flavors of O(1) and not likely to have significant impact. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16282: --- Labels: pull-request-available (was: ) > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Assignee: Jacob Wujciak-Jens >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays
Weston Pace created ARROW-16289: --- Summary: [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays Key: ARROW-16289 URL: https://issues.apache.org/jira/browse/ARROW-16289 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace This JIRA is a proposal / discussion. I am not asserting this is the way to go but I would like to consider it. >From the execution engine's perspective an exec batch's columns are always >either arrays or scalars. The only time we make use of scalars today is for >the four augmented columns (e.g. __filename). Once we have support for RLE >arrays a scalar could easily be encoded as an RLE array and there would be no >need to use scalars here. The advantage would be reducing the complexity in exec nodes and avoiding issues like ARROW-16288. It is already rather difficult to explain the idea of a "scalar" and "vector" function and then have to turn around and explain that the word "scalar" has an entirely different meaning when talking about field shape. I think it's worth considering taking this even further and removing the concept from the compute layer entirely. Kernel functions that want to have special logic for scalars could do so using the RLE array. This would be a significant change to many kernels which currently declare the ANY shape and determine which logic to apply within the kernel itself (e.g. there is one array OR scalar kernel and not one kernel for each). Admittedly there is probably a few instructions and a few bytes more to handle an RLE scalar than the scalar we have today. However, this is just different flavors of O(1) and not likely to have significant impact. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16288) [C++] ValueDescr::SCALAR nearly unused and does not work for projection
Weston Pace created ARROW-16288: --- Summary: [C++] ValueDescr::SCALAR nearly unused and does not work for projection Key: ARROW-16288 URL: https://issues.apache.org/jira/browse/ARROW-16288 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace First, there are almost no kernels that actually use this shape. Only the functions "all", "any", "list_element", "mean", "product", "struct_field", and "sum" have kernels with this shape. Most kernels that have special logic for scalars handle it by using {{ValueDescr::ANY}} Second, when passing an expression to the project node, the expression must be bound based on the dataset schema. Since the binding happens based on a schema (and not a batch) the function is bound to ValueDescr::ARRAY (https://github.com/apache/arrow/blob/a16be6b7b6c8271202ff766b99c199b2e29bdfa8/cpp/src/arrow/compute/exec/expression.cc#L461) This results in an error if the function has only ValueDescr::SCALAR kernels and would likely be a problem even if the function had both types of kernels because it would get bound to the wrong kernel. This simplest fix may be to just get rid of ValueDescr and change all kernels to ValueDescr::ANY behavior. If we choose to keep it we will need to figure out how to handle this kind of binding. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Wujciak-Jens reassigned ARROW-16282: -- Assignee: Jacob Wujciak-Jens > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Assignee: Jacob Wujciak-Jens >Priority: Blocker > Fix For: 8.0.0 > > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16121) [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-16121. - Resolution: Fixed Issue resolved by pull request 12952 [https://github.com/apache/arrow/pull/12952] > [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset > > > Key: ARROW-16121 > URL: https://issues.apache.org/jira/browse/ARROW-16121 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Joris Van den Bossche >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The custom python ParquetDataset implementation exposes the {{metadata}}, > {{metadata_path}}, {{common_metadata}} and {{common_metadata_path}} > attributes, something for which we didn't add an equivalent to the new > dataset API. > Unless we still want to add something for this, we should deprecate those > attributes in the legacy ParquetDataset. > In addition, we should also deprecate passing the {{metadata}} keyword in the > ParquetDataset constructor. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526597#comment-17526597 ] Jacob Wujciak-Jens commented on ARROW-16282: This is most likely due to ubuntu 22 using openssl 3 https://discourse.ubuntu.com/t/openssl-3-0-transition-plans/24453 > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Priority: Blocker > Fix For: 8.0.0 > > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15989) [R] Implement rbind for Table and RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-15989. - Resolution: Fixed Issue resolved by pull request 12751 [https://github.com/apache/arrow/pull/12751] > [R] Implement rbind for Table and RecordBatch > - > > Key: ARROW-15989 > URL: https://issues.apache.org/jira/browse/ARROW-15989 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Affects Versions: 7.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 12h 50m > Remaining Estimate: 0h > > In ARROW-15013 we implemented c() for Arrow arrays. We should now be able to > implement rbind for Tables and RecordBatches (rbind on batches would produce > a table). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16284) [Python][Packaging] Use delocate-fuse to create universal2 wheels
[ https://issues.apache.org/jira/browse/ARROW-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-16284. - Resolution: Fixed Issue resolved by pull request 12959 [https://github.com/apache/arrow/pull/12959] > [Python][Packaging] Use delocate-fuse to create universal2 wheels > - > > Key: ARROW-16284 > URL: https://issues.apache.org/jira/browse/ARROW-16284 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Previously we used specific universal2 configurations for vcpkg to build the > dependencies containing symbols for both architectures. This approach proved > to be fragile to vcpkg changes making it hard to upgrade the vcpkg version. > As an example https://github.com/apache/arrow/pull/12893 bumps the vcpkg > version where absl has stopped compiling for two CMAKE_OSX_ARCHITECTURES, it > has been already fixed in absl's upstream but that hasn't been released yet. > The new approach uses multibuild's delocate to build the wheels for both > arm64 and amd64 separately and fuse them in an upcoming step to a universal2 > wheel (using {{lipo}} under the hood). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
[ https://issues.apache.org/jira/browse/ARROW-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Barron updated ARROW-16287: Description: I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: {code:java} from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame({"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) {code} This raises the error {code:java} --- RuntimeError Traceback (most recent call last) Input In [92], in () > 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. {code} But all schemas in the `metadata_collector` list seem to be the same: {code:java} all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True {code} was: I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: ``` from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame(\{"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) # Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") # Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) ``` This raises the error ``` --- RuntimeError Traceback (most recent call last) Input In [92], in () > 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. ``` But all schemas in the `metadata_collector` list seem to be the same: ``` all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True ``` > PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing > _metadata file > - > > Key: ARROW-16287 > URL: https://issues.apache.org/jira/browse/ARROW-16287 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 7.0.0 > Environment: MacOS. Python 3.8.10. > pyarrow: '7.0.0' > pandas: '1.4.2' > numpy: '1.22.3' >Reporter: Kyle Barron >Priority: Major > > I'm trying to follow the example h
[jira] [Created] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
Kyle Barron created ARROW-16287: --- Summary: PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file Key: ARROW-16287 URL: https://issues.apache.org/jira/browse/ARROW-16287 Project: Apache Arrow Issue Type: Bug Components: Parquet Affects Versions: 7.0.0 Environment: MacOS. Python 3.8.10. pyarrow: '7.0.0' pandas: '1.4.2' numpy: '1.22.3' Reporter: Kyle Barron I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: ``` from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame(\{"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) # Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") # Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) ``` This raises the error ``` --- RuntimeError Traceback (most recent call last) Input In [92], in () > 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. ``` But all schemas in the `metadata_collector` list seem to be the same: ``` all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-13505) [R] Installation of Arrow fails on Debian Gnu/Linux
[ https://issues.apache.org/jira/browse/ARROW-13505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-13505. -- Resolution: Done > [R] Installation of Arrow fails on Debian Gnu/Linux > --- > > Key: ARROW-13505 > URL: https://issues.apache.org/jira/browse/ARROW-13505 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 5.0.0 > Environment: Debian Gnu/Linux 11 > R 4.0.4 >Reporter: Amit Ramon >Priority: Major > Labels: Linux, debian, r > Attachments: arrow-1.log > > > {{I'm trying to install Arrow on Debian Gnu/Linux using R 4.0.4. Arrow is not > installed, and I tried both `install.packages("arrow")` and the script > provided by the Arrow project that contains the `install_arrow()` function.}} > {{The installation always fail at some point after lots of compilation with a > message}} > {code:java} > /usr/bin/ld: cannot find > /home/amit/tmp/Rtmp54dYjQ/R.INSTALL49a67e7832c7/arrow/libarrow/arrow-5.0.0/lib: > file format not recognized{code} > {{I've tried calling the `install_arrow()` function in the following ways:}} > > {code:java} > install_arrow(binary = TRUE, minimal = TRUE, verbose = TRUE) > install_arrow(binary = FALSE, minimal = TRUE, verbose = TRUE) {code} > {{I also try to install the Arrow binaries (using the command in > [https://arrow.apache.org/install/|}}{{https://arrow.apache.org/install/}}{{]) > and then running the above commands again but got the same error.}} > {{I'm attaching the log from running the first command above.}} > > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-12654) [R] Bundled C++ build fails with ccache
[ https://issues.apache.org/jira/browse/ARROW-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526574#comment-17526574 ] Jonathan Keane commented on ARROW-12654: We resolved an issue that looks very similar in ARROW-14638 and I suspect this will be resolved by that as well. I'm going to close this for now, but if you do still run into this issue after our next release (8.0.0, which should be done shortly) either re-open this, or please create a new Jira so we can dig into it. Thanks! > [R] Bundled C++ build fails with ccache > --- > > Key: ARROW-12654 > URL: https://issues.apache.org/jira/browse/ARROW-12654 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 4.0.0 > Environment: debian:buster-slim, pkg-config installed >Reporter: Stephan Radel >Priority: Major > Labels: Debian > > Dear Apache Team, > I'm not able to install Arrow 4.0.0 properly on Debian 10 via Docker. I tried > several switches from recommendation here: > [https://arrow.apache.org/docs/r/articles/install.html|https://arrow.apache.org/docs/r/articles/install.html.] > (ARROW_USE_PKG_CONFIG TRUE/FALSE, LIBARROW_BINARY TRUE/FALSE) but with no > success so far. > Here the latest log: > {code:java} > * installing *source* package ‘arrow’ ... > ** package ‘arrow’ successfully unpacked and MD5 sums checked > trying URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/debian-10/arrow-4.0.0.zip' > Error in download.file(from_url, to_file, quiet = quietly) : > cannot open URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/debian-10/arrow-4.0.0.zip' > *** No C++ binaries found for debian-10 > trying URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/src/arrow-4.0.0.zip' > Error in download.file(from_url, to_file, quiet = quietly) : > cannot open URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/src/arrow-4.0.0.zip' > trying URL > 'https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-4.0.0/apache-arrow-4.0.0.tar.gz' > Content type 'application/octet-stream' length 9042294 bytes (8.6 MB) > == > downloaded 8.6 MB > *** Successfully retrieved C++ source > *** Building C++ libraries > *** Building with MAKEFLAGS= -j2 > cmake > trying URL > 'https://github.com/Kitware/CMake/releases/download/v3.19.2/cmake-3.19.2-Linux-x86_64.tar.gz' > Content type 'application/octet-stream' length 42931014 bytes (40.9 MB) > == > downloaded 40.9 MB > arrow with > SOURCE_DIR="/tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp" > BUILD_DIR="/tmp/RtmpykibIC/file77bf609e654f" DEST_DIR="libarrow/arrow-4.0.0" > CMAKE="/tmp/RtmpykibIC/file77bf5d6bcf01/cmake-3.19.2-Linux-x86_64/bin/cmake" > CC="ccache gcc" CXX="ccache g++ -std=gnu++11" LDFLAGS="-Wl,-z,relro" > ARROW_S3=OFF ARROW_MIMALLOC=OFF > ++ pwd > + : /tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow > + : /tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > + : /tmp/RtmpykibIC/file77bf609e654f > + : libarrow/arrow-4.0.0 > + : /tmp/RtmpykibIC/file77bf5d6bcf01/cmake-3.19.2-Linux-x86_64/bin/cmake > ++ cd /tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > ++ pwd > + SOURCE_DIR=/tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > ++ mkdir -p libarrow/arrow-4.0.0 > ++ cd libarrow/arrow-4.0.0 > ++ pwd > + DEST_DIR=/tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow/libarrow/arrow-4.0.0 > + '[' '' = false ']' > + ARROW_DEFAULT_PARAM=OFF > + mkdir -p /tmp/RtmpykibIC/file77bf609e654f > + pushd /tmp/RtmpykibIC/file77bf609e654f > /tmp/RtmpykibIC/file77bf609e654f /tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow > + /tmp/RtmpykibIC/file77bf5d6bcf01/cmake-3.19.2-Linux-x86_64/bin/cmake > -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF > -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON > -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON > -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF > -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=OFF > -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=OFF -DARROW_WITH_UTF8PROC=ON > -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF -DCMAKE_BUILD_TYPE=Release > -DCMAKE_INSTALL_LIBDIR=lib > -DCMAKE_INSTALL_PREFIX=/tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow/libarrow/arrow-4.0.0 > -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON > -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON -G 'Unix > Makefiles' /tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > -- Building using CMake version: 3.19.2 > -- The C compiler identification is GNU 8.3.0 > -- The CXX compiler identification is GNU 8.3.0 > -- Detecting C compiler ABI info > --
[jira] [Closed] (ARROW-12654) [R] Bundled C++ build fails with ccache
[ https://issues.apache.org/jira/browse/ARROW-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12654. -- Resolution: Fixed > [R] Bundled C++ build fails with ccache > --- > > Key: ARROW-12654 > URL: https://issues.apache.org/jira/browse/ARROW-12654 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 4.0.0 > Environment: debian:buster-slim, pkg-config installed >Reporter: Stephan Radel >Priority: Major > Labels: Debian > > Dear Apache Team, > I'm not able to install Arrow 4.0.0 properly on Debian 10 via Docker. I tried > several switches from recommendation here: > [https://arrow.apache.org/docs/r/articles/install.html|https://arrow.apache.org/docs/r/articles/install.html.] > (ARROW_USE_PKG_CONFIG TRUE/FALSE, LIBARROW_BINARY TRUE/FALSE) but with no > success so far. > Here the latest log: > {code:java} > * installing *source* package ‘arrow’ ... > ** package ‘arrow’ successfully unpacked and MD5 sums checked > trying URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/debian-10/arrow-4.0.0.zip' > Error in download.file(from_url, to_file, quiet = quietly) : > cannot open URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/debian-10/arrow-4.0.0.zip' > *** No C++ binaries found for debian-10 > trying URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/src/arrow-4.0.0.zip' > Error in download.file(from_url, to_file, quiet = quietly) : > cannot open URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/src/arrow-4.0.0.zip' > trying URL > 'https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-4.0.0/apache-arrow-4.0.0.tar.gz' > Content type 'application/octet-stream' length 9042294 bytes (8.6 MB) > == > downloaded 8.6 MB > *** Successfully retrieved C++ source > *** Building C++ libraries > *** Building with MAKEFLAGS= -j2 > cmake > trying URL > 'https://github.com/Kitware/CMake/releases/download/v3.19.2/cmake-3.19.2-Linux-x86_64.tar.gz' > Content type 'application/octet-stream' length 42931014 bytes (40.9 MB) > == > downloaded 40.9 MB > arrow with > SOURCE_DIR="/tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp" > BUILD_DIR="/tmp/RtmpykibIC/file77bf609e654f" DEST_DIR="libarrow/arrow-4.0.0" > CMAKE="/tmp/RtmpykibIC/file77bf5d6bcf01/cmake-3.19.2-Linux-x86_64/bin/cmake" > CC="ccache gcc" CXX="ccache g++ -std=gnu++11" LDFLAGS="-Wl,-z,relro" > ARROW_S3=OFF ARROW_MIMALLOC=OFF > ++ pwd > + : /tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow > + : /tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > + : /tmp/RtmpykibIC/file77bf609e654f > + : libarrow/arrow-4.0.0 > + : /tmp/RtmpykibIC/file77bf5d6bcf01/cmake-3.19.2-Linux-x86_64/bin/cmake > ++ cd /tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > ++ pwd > + SOURCE_DIR=/tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > ++ mkdir -p libarrow/arrow-4.0.0 > ++ cd libarrow/arrow-4.0.0 > ++ pwd > + DEST_DIR=/tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow/libarrow/arrow-4.0.0 > + '[' '' = false ']' > + ARROW_DEFAULT_PARAM=OFF > + mkdir -p /tmp/RtmpykibIC/file77bf609e654f > + pushd /tmp/RtmpykibIC/file77bf609e654f > /tmp/RtmpykibIC/file77bf609e654f /tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow > + /tmp/RtmpykibIC/file77bf5d6bcf01/cmake-3.19.2-Linux-x86_64/bin/cmake > -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF > -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON > -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON > -DARROW_MIMALLOC=OFF -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF > -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=OFF > -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=OFF -DARROW_WITH_UTF8PROC=ON > -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF -DCMAKE_BUILD_TYPE=Release > -DCMAKE_INSTALL_LIBDIR=lib > -DCMAKE_INSTALL_PREFIX=/tmp/Rtmpru2gYI/R.INSTALL779923ea9560/arrow/libarrow/arrow-4.0.0 > -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON > -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON -G 'Unix > Makefiles' /tmp/RtmpykibIC/file77bf2da3d338/apache-arrow-4.0.0/cpp > -- Building using CMake version: 3.19.2 > -- The C compiler identification is GNU 8.3.0 > -- The CXX compiler identification is GNU 8.3.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/ccache - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/ccache - skipped > -- Detecting CXX compile features > -- Detecting
[jira] [Assigned] (ARROW-16281) [R] [CI] Bump versions with the release of 4.2
[ https://issues.apache.org/jira/browse/ARROW-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld reassigned ARROW-16281: Assignee: Dragoș Moldovan-Grünfeld (was: Jacob Wujciak-Jens) > [R] [CI] Bump versions with the release of 4.2 > -- > > Key: ARROW-16281 > URL: https://issues.apache.org/jira/browse/ARROW-16281 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > Now that R 4.2 is released, we should bump all of our R versions where we > have ones hardcoded. > This will mean dropping support for 3.4 entirely and adding in 4.0 to > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/r/github.linux.versions.yml#L34 > There are a few other places that we have hard-coded versions (we might need > to wait a few days for these to catch up): > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/tasks.yml#L1291-L1295 > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/.github/workflows/r.yml#L60 > (and a few other places in that file — though one note: we build an old > version of windows that uses rtools35 in the GHA CI so that we catch when we > break that — we'll want to keep that!) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15163) [R] lubridate functions for 8.0.0
[ https://issues.apache.org/jira/browse/ARROW-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld resolved ARROW-15163. -- Resolution: Fixed > [R] lubridate functions for 8.0.0 > - > > Key: ARROW-15163 > URL: https://issues.apache.org/jira/browse/ARROW-15163 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Alessandro Molina >Assignee: Jonathan Keane >Priority: Major > Fix For: 8.0.0 > > > *Umbrella ticket for the Initiative aimed at reaching support for the most > important lubridate functions in the R bindings* -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-9235) [R] Support for `connection` class when reading and writing files
[ https://issues.apache.org/jira/browse/ARROW-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-9235. Resolution: Fixed Issue resolved by pull request 12323 [https://github.com/apache/arrow/pull/12323] > [R] Support for `connection` class when reading and writing files > - > > Key: ARROW-9235 > URL: https://issues.apache.org/jira/browse/ARROW-9235 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Michael Quinn >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > We have an internal filesystem that we interact with through objects that > inherit from the connection class. These files aren't necessarily local, > making it slightly more complicated to read and write parquet files, for > example. > For now, we're generating raw vectors and using that to create the file. For > example, to read files > {noformat} > ReadParquet <- function(filename, ...) {}} > file <-file(filename,"rb") > on.exit(close(file)) > raw <- readBin(file, "raw", FileInfo(filename)$size) > return(arrow::read_parquet(raw, ...)) > } > {noformat} > And to write, > {noformat} > WriteParquet <- function(df, filepath, ...) { > stream <- BufferOutputStream$create() > write_parquet(df, stream, ...) > raw <- stream$finish()$data() >file <- file(filepath, "wb") > on.exit(close(file) > writeBin(raw, file) > return(invisible()) > } > {noformat} > At the C++ level, we are interacting with ` R_new_custom_connection` defined > here: > [https://github.com/wch/r-source/blob/trunk/src/include/R_ext/Connections.h] > I've been very impressed with how feature-rich arrow is. It would be nice to > see this API supported as well. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16286) [C++] SimplifyWithGuarantee does not work with non-deterministic expressions
Weston Pace created ARROW-16286: --- Summary: [C++] SimplifyWithGuarantee does not work with non-deterministic expressions Key: ARROW-16286 URL: https://issues.apache.org/jira/browse/ARROW-16286 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace If an expression is non-deterministic (e.g. "random") then SimplifyWithGuarantee may incorrectly think it can fold constants. For example, if the call is {{random()}} then {{SimplifyWithGuarantee}} will detect that all the arguments are constants (or, more accurately, there are zero non-constant arguments) and decide it can execute the expression immediately and fold it into a constant. We could maybe add a hack for the random case since it is the only nullary function but, in general, we will probably need a way to define functions as "non-deterministic" and prevent constant folding. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16189) [CI][C++] Implement CI on Apple M1 for C++
[ https://issues.apache.org/jira/browse/ARROW-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Wujciak-Jens updated ARROW-16189: --- Parent: ARROW-10657 Issue Type: Sub-task (was: Improvement) > [CI][C++] Implement CI on Apple M1 for C++ > -- > > Key: ARROW-16189 > URL: https://issues.apache.org/jira/browse/ARROW-16189 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Continuous Integration >Reporter: Jacob Wujciak-Jens >Priority: Critical > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16190) [CI][R] Implement CI on Apple M1 for R
[ https://issues.apache.org/jira/browse/ARROW-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Wujciak-Jens updated ARROW-16190: --- Parent: ARROW-10657 Issue Type: Sub-task (was: Improvement) > [CI][R] Implement CI on Apple M1 for R > -- > > Key: ARROW-16190 > URL: https://issues.apache.org/jira/browse/ARROW-16190 > Project: Apache Arrow > Issue Type: Sub-task > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Priority: Critical > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16285) [CI][Python] Enable skipped kartothek integration tests
[ https://issues.apache.org/jira/browse/ARROW-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526561#comment-17526561 ] Jacob Wujciak-Jens commented on ARROW-16285: This issue is blocked until kartothek fixes the linked issue. > [CI][Python] Enable skipped kartothek integration tests > > > Key: ARROW-16285 > URL: https://issues.apache.org/jira/browse/ARROW-16285 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Jacob Wujciak-Jens >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4
[ https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526558#comment-17526558 ] Antoine Pitrou commented on ARROW-12203: Marking this is as critical for 9.0 so that we finally do it. > [C++][Python] Switch default Parquet version to 2.4 > --- > > Key: ARROW-12203 > URL: https://issues.apache.org/jira/browse/ARROW-12203 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Critical > Fix For: 9.0.0 > > > Currently, Parquet write APIs default to maximum-compatibility Parquet > version "1.0", which disables some logical types such as UINT32. We may want > to switch the default to "2.0" instead, to allow faithful representation of > more types. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16181) [CI][C++] Valgrind failure in TPCH node tests
[ https://issues.apache.org/jira/browse/ARROW-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526559#comment-17526559 ] Krisztian Szucs commented on ARROW-16181: - [~apitrou] is this still valid? > [CI][C++] Valgrind failure in TPCH node tests > - > > Key: ARROW-16181 > URL: https://issues.apache.org/jira/browse/ARROW-16181 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Critical > Fix For: 8.0.0 > > > See > [https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=23077&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=7667] > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4
[ https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-12203: --- Priority: Critical (was: Major) > [C++][Python] Switch default Parquet version to 2.4 > --- > > Key: ARROW-12203 > URL: https://issues.apache.org/jira/browse/ARROW-12203 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Critical > Fix For: 9.0.0 > > > Currently, Parquet write APIs default to maximum-compatibility Parquet > version "1.0", which disables some logical types such as UINT32. We may want > to switch the default to "2.0" instead, to allow faithful representation of > more types. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16285) [CI][Python] Enable skipped kartothek integration tests
[ https://issues.apache.org/jira/browse/ARROW-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Wujciak-Jens updated ARROW-16285: --- Summary: [CI][Python] Enable skipped kartothek integration tests (was: [CI][Python} Enable skipped kartothek integration tests ) > [CI][Python] Enable skipped kartothek integration tests > > > Key: ARROW-16285 > URL: https://issues.apache.org/jira/browse/ARROW-16285 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Jacob Wujciak-Jens >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16285) [CI][Python} Enable skipped kartothek integration tests
Jacob Wujciak-Jens created ARROW-16285: -- Summary: [CI][Python} Enable skipped kartothek integration tests Key: ARROW-16285 URL: https://issues.apache.org/jira/browse/ARROW-16285 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Python Reporter: Jacob Wujciak-Jens -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4
[ https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526556#comment-17526556 ] Krisztian Szucs commented on ARROW-12203: - Postponing to 9.0 > [C++][Python] Switch default Parquet version to 2.4 > --- > > Key: ARROW-12203 > URL: https://issues.apache.org/jira/browse/ARROW-12203 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > Currently, Parquet write APIs default to maximum-compatibility Parquet > version "1.0", which disables some logical types such as UINT32. We may want > to switch the default to "2.0" instead, to allow faithful representation of > more types. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4
[ https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-12203: Fix Version/s: 9.0.0 (was: 8.0.0) > [C++][Python] Switch default Parquet version to 2.4 > --- > > Key: ARROW-12203 > URL: https://issues.apache.org/jira/browse/ARROW-12203 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 9.0.0 > > > Currently, Parquet write APIs default to maximum-compatibility Parquet > version "1.0", which disables some logical types such as UINT32. We may want > to switch the default to "2.0" instead, to allow faithful representation of > more types. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16182) [C++][CI] TPCH node tests timeout under ThreadSanitizer
[ https://issues.apache.org/jira/browse/ARROW-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-16182. - Resolution: Fixed > [C++][CI] TPCH node tests timeout under ThreadSanitizer > --- > > Key: ARROW-16182 > URL: https://issues.apache.org/jira/browse/ARROW-16182 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Assignee: Sasha Krassovsky >Priority: Critical > Fix For: 8.0.0 > > > See > https://github.com/ursacomputing/crossbow/runs/6000716964?check_suite_focus=true#step:5:4854 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16182) [C++][CI] TPCH node tests timeout under ThreadSanitizer
[ https://issues.apache.org/jira/browse/ARROW-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526554#comment-17526554 ] Krisztian Szucs commented on ARROW-16182: - Seems to be resolved now https://github.com/apache/arrow/pull/12843#issuecomment-1106673435 > [C++][CI] TPCH node tests timeout under ThreadSanitizer > --- > > Key: ARROW-16182 > URL: https://issues.apache.org/jira/browse/ARROW-16182 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Assignee: Sasha Krassovsky >Priority: Critical > Fix For: 8.0.0 > > > See > https://github.com/ursacomputing/crossbow/runs/6000716964?check_suite_focus=true#step:5:4854 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16257) [R] Break-up as_date and as_datetime into individual functions
[ https://issues.apache.org/jira/browse/ARROW-16257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld resolved ARROW-16257. -- Resolution: Fixed > [R] Break-up as_date and as_datetime into individual functions > -- > > Key: ARROW-16257 > URL: https://issues.apache.org/jira/browse/ARROW-16257 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > A follow-up from > [ARROW-15800|https://issues.apache.org/jira/browse/ARROW-15800]. > See also: https://github.com/apache/arrow/pull/12738#discussion_r854329903 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15801) [R] Implement bindings for lubridate date-time helpers
[ https://issues.apache.org/jira/browse/ARROW-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld resolved ARROW-15801. -- Resolution: Fixed > [R] Implement bindings for lubridate date-time helpers > -- > > Key: ARROW-15801 > URL: https://issues.apache.org/jira/browse/ARROW-15801 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Jonathan Keane >Priority: Major > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16219) [CI][Python] Install failure on s390x
[ https://issues.apache.org/jira/browse/ARROW-16219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-16219. - Resolution: Fixed Issue resolved by pull request 12945 [https://github.com/apache/arrow/pull/12945] > [CI][Python] Install failure on s390x > - > > Key: ARROW-16219 > URL: https://issues.apache.org/jira/browse/ARROW-16219 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Antoine Pitrou >Assignee: Raúl Cumplido >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Seems to happen quite reliably on Travis-CI: > https://app.travis-ci.com/github/apache/arrow/builds/249511328 > Perhaps just a matter of setting the SETUPTOOLS_SCM_VERSION environment > variable to some dummy value? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16278) [CI] Git installation failure on homebrew
[ https://issues.apache.org/jira/browse/ARROW-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-16278. - Resolution: Fixed Issue resolved by pull request 12958 [https://github.com/apache/arrow/pull/12958] > [CI] Git installation failure on homebrew > - > > Key: ARROW-16278 > URL: https://issues.apache.org/jira/browse/ARROW-16278 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Some builds are failing due to git unable to install on homebrew. This seems > to be related to the new git release: > _With the fixes for CVE-2022-24765 that are common with versions of_ > _Git 2.30.4, 2.31.3, 2.32.2, 2.33.3, 2.34.3, and 2.35.3, Git has_ > _been taught not to recognise repositories owned by other users, in_ > _order to avoid getting affected by their config files and hooks._ > _You can list the path to the safe/trusted repositories that may be_ > _owned by others on a multi-valued configuration variable_ > _safe.directory to override this behaviour, or use '*' to declare_ > _that you trust anything._ > Failed job example > https://github.com/apache/arrow/runs/6114985460?check_suite_focus=true: > {code:java} > Installing automake > Installing aws-sdk-cpp > Installing boost > Using brotli > Using c-ares > Installing ccache > Using cmake > Installing flatbuffers > Installing git > ==> Downloading https://ghcr.io/v2/homebrew/core/git/manifests/2.36.0 > ==> Downloading > https://ghcr.io/v2/homebrew/core/git/blobs/sha256:5739e703f9ad34dba01e343d76f363143f740bf6e05c945c8f19a073546c6ce5 > ==> Downloading from > https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:5739e703f9ad34dba01e343d76f363143f740bf6e05c945c8f19a073546c6ce5?se=2022-04-21T18%3A35%3A00Z&sig=ZdiaSBdomnIwd4Ga4PORXPs2%2FYZXrrLLaks61mgmyEs%3D&sp=r&spr=https&sr=b&sv=2019-12-12 > ==> Pouring git--2.36.0.big_sur.bottle.tar.gz > Error: The `brew link` step did not complete successfully > The formula built, but is not symlinked into /usr/local > Could not symlink etc/bash_completion.d/git-completion.bash > Target /usr/local/etc/bash_completion.d/git-completion.bash > is a symlink belonging to git@2.35.1. You can unlink it: > brew unlink git@2.35.1To force the link and overwrite all conflicting files: > brew link --overwrite gitTo list all files that would be deleted: > brew link --overwrite --dry-run gitPossible conflicting files are: > /usr/local/etc/bash_completion.d/git-completion.bash -> > /usr/local/Cellar/git@2.35.1/2.35.1/etc/bash_completion.d/git-completion.bash > /usr/local/etc/bash_completion.d/git-prompt.sh -> > /usr/local/Cellar/git@2.35.1/2.35.1/etc/bash_completion.d/git-prompt.sh > /usr/local/bin/git -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git > /usr/local/bin/git-cvsserver -> > /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-cvsserver > /usr/local/bin/git-receive-pack -> > /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-receive-pack > /usr/local/bin/git-shell -> /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-shell > /usr/local/bin/git-upload-archive -> > /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-upload-archive > /usr/local/bin/git-upload-pack -> > /usr/local/Cellar/git@2.35.1/2.35.1/bin/git-upload-pack > Error: Could not symlink share/doc/git-doc/MyFirstContribution.html > Target /usr/local/share/doc/git-doc/MyFirstContribution.html > is a symlink belonging to git@2.35.1. You can unlink it: > brew unlink git@2.35.1To force the link and overwrite all conflicting files: > brew link --overwrite git@2.35.1To list all files that would be deleted: > brew link --overwrite --dry-run git@2.35.1 > Installing git has failed! > Installing glog > Installing grpc > Using llvm > Installing llvm@12 > Using lz4 > Installing minio > Installing ninja > Installing numpy > Using openssl@1.1 > Installing protobuf > Using python > Installing rapidjson > Installing snappy > Installing thrift > Using wget > Using zstd > Homebrew Bundle failed! 1 Brewfile dependency failed to install. > Error: Process completed with exit code 1. {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16121) [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-16121: -- Fix Version/s: 8.0.0 (was: 9.0.0) > [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset > > > Key: ARROW-16121 > URL: https://issues.apache.org/jira/browse/ARROW-16121 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Joris Van den Bossche >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The custom python ParquetDataset implementation exposes the {{metadata}}, > {{metadata_path}}, {{common_metadata}} and {{common_metadata_path}} > attributes, something for which we didn't add an equivalent to the new > dataset API. > Unless we still want to add something for this, we should deprecate those > attributes in the legacy ParquetDataset. > In addition, we should also deprecate passing the {{metadata}} keyword in the > ParquetDataset constructor. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16121) [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526530#comment-17526530 ] Joris Van den Bossche commented on ARROW-16121: --- Moved back to 8.0.0 milestone, the PR is ready > [Python] Deprecate the (common_)metadata(_path) attributes of ParquetDataset > > > Key: ARROW-16121 > URL: https://issues.apache.org/jira/browse/ARROW-16121 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Joris Van den Bossche >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The custom python ParquetDataset implementation exposes the {{metadata}}, > {{metadata_path}}, {{common_metadata}} and {{common_metadata_path}} > attributes, something for which we didn't add an equivalent to the new > dataset API. > Unless we still want to add something for this, we should deprecate those > attributes in the legacy ParquetDataset. > In addition, we should also deprecate passing the {{metadata}} keyword in the > ParquetDataset constructor. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file
[ https://issues.apache.org/jira/browse/ARROW-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-16204. --- Resolution: Fixed Issue resolved by pull request 12898 [https://github.com/apache/arrow/pull/12898] > [C++][Dataset] Default error existing_data_behaviour for writing dataset > ignores a single file > -- > > Key: ARROW-16204 > URL: https://issues.apache.org/jira/browse/ARROW-16204 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: dataset, pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > While trying to understand a failing test in > https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed > that the {{write_dataset}} function does not actually always raise an error > by default if there is already existing data in the target location. > The documentation says it will raise "if any data exists in the destination" > (which is also what I would expect), but in practice it seems that it does > ignore certain file names: > {code:python} > import pyarrow.dataset as ds > table = pa.table({'a': [1, 2, 3]}) > # write a first time to new directory: OK > >>> ds.write_dataset(table, "test_overwrite", format="parquet") > >>> !ls test_overwrite > part-0.parquet > # write a second time to the same directory: passes, but should raise? > >>> ds.write_dataset(table, "test_overwrite", format="parquet") > >>> !ls test_overwrite > part-0.parquet > # write a another time to the same directory with different name: still passes > >>> ds.write_dataset(table, "test_overwrite", format="parquet", > >>> basename_template="data-{i}.parquet") > >>> !ls test_overwrite > data-0.parquetpart-0.parquet > # now writing again finally raises an error > >>> ds.write_dataset(table, "test_overwrite", format="parquet") > ... > ArrowInvalid: Could not write to test_overwrite as the directory is not empty > and existing_data_behavior is to error > {code} > So it seems that when checking if existing data exists, it seems to ignore > any files that match the basename template pattern. > cc [~westonpace] do you know if this was intentional? (I would find that a > strange corner case, and in any case it is also not documented) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15800) [R] Implement bindings for lubridate::as_date() and lubridate::as_datetime()
[ https://issues.apache.org/jira/browse/ARROW-15800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15800. Resolution: Fixed Issue resolved by pull request 12738 [https://github.com/apache/arrow/pull/12738] > [R] Implement bindings for lubridate::as_date() and lubridate::as_datetime() > > > Key: ARROW-15800 > URL: https://issues.apache.org/jira/browse/ARROW-15800 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 9h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16284) [Python][Packaging] Use delocate-fuse to create universal2 wheels
[ https://issues.apache.org/jira/browse/ARROW-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16284: --- Labels: pull-request-available (was: ) > [Python][Packaging] Use delocate-fuse to create universal2 wheels > - > > Key: ARROW-16284 > URL: https://issues.apache.org/jira/browse/ARROW-16284 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Previously we used specific universal2 configurations for vcpkg to build the > dependencies containing symbols for both architectures. This approach proved > to be fragile to vcpkg changes making it hard to upgrade the vcpkg version. > As an example https://github.com/apache/arrow/pull/12893 bumps the vcpkg > version where absl has stopped compiling for two CMAKE_OSX_ARCHITECTURES, it > has been already fixed in absl's upstream but that hasn't been released yet. > The new approach uses multibuild's delocate to build the wheels for both > arm64 and amd64 separately and fuse them in an upcoming step to a universal2 > wheel (using {{lipo}} under the hood). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16284) [Python][Packaging] Use delocate-fuse to create universal2 wheels
[ https://issues.apache.org/jira/browse/ARROW-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-16284: Fix Version/s: 8.0.0 > [Python][Packaging] Use delocate-fuse to create universal2 wheels > - > > Key: ARROW-16284 > URL: https://issues.apache.org/jira/browse/ARROW-16284 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 8.0.0 > > > Previously we used specific universal2 configurations for vcpkg to build the > dependencies containing symbols for both architectures. This approach proved > to be fragile to vcpkg changes making it hard to upgrade the vcpkg version. > As an example https://github.com/apache/arrow/pull/12893 bumps the vcpkg > version where absl has stopped compiling for two CMAKE_OSX_ARCHITECTURES, it > has been already fixed in absl's upstream but that hasn't been released yet. > The new approach uses multibuild's delocate to build the wheels for both > arm64 and amd64 separately and fuse them in an upcoming step to a universal2 > wheel (using {{lipo}} under the hood). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16284) [Python][Packaging] Use delocate-fuse to create universal2 wheels
Krisztian Szucs created ARROW-16284: --- Summary: [Python][Packaging] Use delocate-fuse to create universal2 wheels Key: ARROW-16284 URL: https://issues.apache.org/jira/browse/ARROW-16284 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Krisztian Szucs Previously we used specific universal2 configurations for vcpkg to build the dependencies containing symbols for both architectures. This approach proved to be fragile to vcpkg changes making it hard to upgrade the vcpkg version. As an example https://github.com/apache/arrow/pull/12893 bumps the vcpkg version where absl has stopped compiling for two CMAKE_OSX_ARCHITECTURES, it has been already fixed in absl's upstream but that hasn't been released yet. The new approach uses multibuild's delocate to build the wheels for both arm64 and amd64 separately and fuse them in an upcoming step to a universal2 wheel (using {{lipo}} under the hood). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16284) [Python][Packaging] Use delocate-fuse to create universal2 wheels
[ https://issues.apache.org/jira/browse/ARROW-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-16284: --- Assignee: Krisztian Szucs > [Python][Packaging] Use delocate-fuse to create universal2 wheels > - > > Key: ARROW-16284 > URL: https://issues.apache.org/jira/browse/ARROW-16284 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > Previously we used specific universal2 configurations for vcpkg to build the > dependencies containing symbols for both architectures. This approach proved > to be fragile to vcpkg changes making it hard to upgrade the vcpkg version. > As an example https://github.com/apache/arrow/pull/12893 bumps the vcpkg > version where absl has stopped compiling for two CMAKE_OSX_ARCHITECTURES, it > has been already fixed in absl's upstream but that hasn't been released yet. > The new approach uses multibuild's delocate to build the wheels for both > arm64 and amd64 separately and fuse them in an upcoming step to a universal2 > wheel (using {{lipo}} under the hood). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16283) [Go] Cleanup Panics in new Buffered Reader
Matthew Topol created ARROW-16283: - Summary: [Go] Cleanup Panics in new Buffered Reader Key: ARROW-16283 URL: https://issues.apache.org/jira/browse/ARROW-16283 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
[ https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526515#comment-17526515 ] David Li commented on ARROW-6390: - ARROW-16065 adds a basic Python/Flight documentation page. Maybe further work can be done in the cookbook? > [Python][Flight] Add Python documentation / tutorial for Flight > --- > > Key: ARROW-6390 > URL: https://issues.apache.org/jira/browse/ARROW-6390 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, FlightRPC, Python >Reporter: Wes McKinney >Assignee: Alessandro Molina >Priority: Major > > There is no Sphinx documentation for using Flight from Python. I have found > that writing documentation is an effective way to uncover usability problems > -- I would suggest we write comprehensive documentation for using Flight from > Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526514#comment-17526514 ] Krisztian Szucs commented on ARROW-16282: - cc [~eerhardt] > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Priority: Blocker > Fix For: 8.0.0 > > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282 ] Krisztian Szucs deleted comment on ARROW-16282: - was (Author: kszucs): cc @eerhardt > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Priority: Blocker > Fix For: 8.0.0 > > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526513#comment-17526513 ] Krisztian Szucs commented on ARROW-16282: - cc @eerhardt > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Priority: Blocker > Fix For: 8.0.0 > > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-9825) [FlightRPC] Add a "Flight SQL" extension on top of FlightRPC
[ https://issues.apache.org/jira/browse/ARROW-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-9825. - Resolution: Fixed This was added, we can open follow up tickets for any other work. > [FlightRPC] Add a "Flight SQL" extension on top of FlightRPC > - > > Key: ARROW-9825 > URL: https://issues.apache.org/jira/browse/ARROW-9825 > Project: Apache Arrow > Issue Type: New Feature > Components: FlightRPC, Format >Reporter: Ryan Nicholson >Priority: Major > > As a developer of database clients and backends, I would like to have a > standard in place to communicate between the two while being able to leverage > the data transfer features of Arrow Flight. > The Arrow Flight RPC specification allows for extensibility by using opaque > payloads to perform “Actions”, specify “Commands” and so on. > I propose the addition of a Flight SQL extension consisting of predefined > protobuf messages and workflows to enable features such as: > * Discovering specific database capabilities from generic clients > * Browsing catalogs. > * Execution of different types of SQL commands. > Supporting documentation and a POC changelist will be sent to the mailing > list in the coming days which describes protobuf messages and workflows > enabling these features. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526487#comment-17526487 ] Lance Dacey commented on ARROW-12358: - Nice, thanks. I can try to test with a nightly build this weekend. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 9.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526483#comment-17526483 ] Weston Pace commented on ARROW-12358: - [~ldacey] now that ARROW-16159 has merged this is probably ready to test again. Are you able to test with the nightly builds? Or do you want to wait for the release? > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 9.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
[ https://issues.apache.org/jira/browse/ARROW-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526475#comment-17526475 ] Raúl Cumplido commented on ARROW-16282: --- I can reproduce locally by using: {code:java} $ UBUNTU=22.04 docker-compose run --rm ubuntu-verify-rc {code} > [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu > to 22.04 > - > > Key: ARROW-16282 > URL: https://issues.apache.org/jira/browse/ARROW-16282 > Project: Apache Arrow > Issue Type: Bug > Components: C#, Continuous Integration >Reporter: Raúl Cumplido >Priority: Blocker > Fix For: 8.0.0 > > > We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu > 22.04 and we can see how the nightly release job has been failing since then. > Working for ubuntu 20.04 on 2022-04-08: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] > Failing for ubuntu 22.04 on 2022-04-09: > [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] > The error seems to be related with missing libssl: > {code:java} > === > Build and test C# libraries > === > └ Ensuring that C# is installed... > └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! > - > SDK Version: 3.1.405Telemetry > - > The .NET Core tools collect usage data in order to help us improve your > experience. It is collected by Microsoft and shared with the community. You > can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT > environment variable to '1' or 'true' using your favorite shell.Read more > about .NET Core CLI Tools telemetry: > https://aka.ms/dotnet-cli-telemetry > Explore documentation: https://aka.ms/dotnet-docs > Report issues and find source on GitHub: https://github.com/dotnet/core > Find out what's new: https://aka.ms/dotnet-whats-new > Learn about the installed HTTPS developer cert: > https://aka.ms/aspnet-core-https > Use 'dotnet --help' to see available commands or visit: > https://aka.ms/dotnet-cli-docs > Write your first app: https://aka.ms/first-net-core-app > -- > No usable version of libssl was found > /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted > (core dumped) dotnet tool install --tool-path ${csharp_bin} > sourcelink > Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. > 134 > Error: `docker-compose --file > /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e > VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 > ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log > above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-14591) [R] Implement bindings for lubridate duration types
[ https://issues.apache.org/jira/browse/ARROW-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld closed ARROW-14591. Resolution: Fixed > [R] Implement bindings for lubridate duration types > --- > > Key: ARROW-14591 > URL: https://issues.apache.org/jira/browse/ARROW-14591 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Jonathan Keane >Priority: Major > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-14596) [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False
[ https://issues.apache.org/jira/browse/ARROW-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526407#comment-17526407 ] Alenka Frim commented on ARROW-14596: - I would like to add observations we got today when pairing with [~jorisvandenbossche] on this topic. First was the result of using {{pq.read_table}} with legacy implementation vs using {{ds.dataset}} with column projection. The data can get selected correctly with the dataset implementation but what happens is that the structure of a nested field is not kept (from struct it is flattened to string column). In case of using columns selection with a list in {{{}ds.dataset{}}}, it errors, as reported in the issue. {code:python} >>> import pandas as pd >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> >>> df = pd.DataFrame({ ... 'user_id': ['abc123', 'qrs456'], ... 'interaction': [{'type': 'click', 'element': 'button'}, {'type':'scroll', 'element': 'window'}] ... }) >>> >>> table = pa.Table.from_pandas(df) >>> pq.write_table(table, 'example.parquet') {code} {code:python} >>> pq.read_table('example.parquet', columns = ['user_id', 'interaction.type'], >>> use_legacy_dataset = True) pyarrow.Table user_id: string interaction: struct child 0, type: string user_id: [["abc123","qrs456"]] interaction: [ -- is_valid: all not null -- child 0 type: string ["click","scroll"]] {code} {code:python} >>> import pyarrow.dataset as ds >>> projection = { ... 'user_id': ds.field('user_id'), ... 'new': ds.field(('interaction', 'type')) ... } >>> ds.dataset('example.parquet').to_table(columns=projection) pyarrow.Table user_id: string new: string user_id: [["abc123","qrs456"]] new: [["click","scroll"]] {code} {code:python} >>> ds.dataset('example.parquet').to_table(columns=['user_id', >>> 'interaction.type']) Traceback (most recent call last): File "", line 1, in File "pyarrow/_dataset.pyx", line 303, in pyarrow._dataset.Dataset.to_table return self.scanner(**kwargs).to_table() File "pyarrow/_dataset.pyx", line 270, in pyarrow._dataset.Dataset.scanner return Scanner.from_dataset(self, **kwargs) File "pyarrow/_dataset.pyx", line 2322, in pyarrow._dataset.Scanner.from_dataset _populate_builder(builder, columns=columns, filter=filter, File "pyarrow/_dataset.pyx", line 2168, in pyarrow._dataset._populate_builder check_status(builder.ProjectColumns([tobytes(c) for c in columns])) File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status raise ArrowInvalid(message) pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(interaction.type) in user_id: string interaction: struct __fragment_index: int32 __batch_index: int32 __last_in_fragment: bool __filename: string /Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1722 CheckNonEmpty(matches, root) /Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1757 FindOne(root) /Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:714 ref->GetOne(dataset_schema) /Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:784 ProjectionDescr::FromNames(std::move(columns), *scan_options_->dataset_schema) {code} When Scanner object is being created from the dataset class via {{to_table}} and (through _populate_builder) and in the case of a list of columns the {{ProjectColumns}} method ("arrow::dataset::ScannerBuilder") is being called it only accepts string column names and errors when a column is a struct. We were thinking if it would be a good idea to add a new method in {{scanner.cc}} that would mimic {{FromNames}} method but takes {{field_ref}} as an argument? Afterwords there would also be a need to recreate a struct field for which we are not sure how to approach. cc [~westonpace] [~apitrou] do you think that would be a correct way to go? > [Python] parquet.read_table nested fields in columns does not work for > use_legacy_dataset=False > --- > > Key: ARROW-14596 > URL: https://issues.apache.org/jira/browse/ARROW-14596 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Tom Scheffers >Assignee: Alenka Frim >Priority: Critical > Fix For: 9.0.0 > > > Reading nested field does not work with use_legacy_dataset=False. > This works: > > {code:java} > import pyarrow.parquet as pq > t = pq.read_table( > source=*filename*, > columns=['store_key', 'properties.country'], > use_legacy_dataset=True, > ).to_pandas() > {code} > This does not work (for the same parquet file): > > {code:java} > import pyarrow.parquet as pq > t = pq.read_table( > source=*filename*, > columns=['store_key', 'properties.country'], > use_legacy_dataset=False, > ).to_pandas(){code} > -- This message was sent by Atlassia
[jira] [Closed] (ARROW-15224) [R] Add binding for not_between() ternary kernel
[ https://issues.apache.org/jira/browse/ARROW-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld closed ARROW-15224. Resolution: Won't Fix A corresponding {{dplyr::not_between()}} function does not exist. > [R] Add binding for not_between() ternary kernel > > > Key: ARROW-15224 > URL: https://issues.apache.org/jira/browse/ARROW-15224 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Eduardo Ponce >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Add R binding for {{not_between()}} compute function from ARROW-15223. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15224) [R] Add binding for not_between() ternary kernel
[ https://issues.apache.org/jira/browse/ARROW-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526444#comment-17526444 ] Dragoș Moldovan-Grünfeld commented on ARROW-15224: -- I will _close_ the issue with _won't fix._ Thanks __ > [R] Add binding for not_between() ternary kernel > > > Key: ARROW-15224 > URL: https://issues.apache.org/jira/browse/ARROW-15224 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Eduardo Ponce >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Add R binding for {{not_between()}} compute function from ARROW-15223. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-15224) [R] Add binding for not_between() ternary kernel
[ https://issues.apache.org/jira/browse/ARROW-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526444#comment-17526444 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-15224 at 4/22/22 1:53 PM: --- I will _close_ the issue with _won't fix._ Thanks was (Author: dragosmg): I will _close_ the issue with _won't fix._ Thanks __ > [R] Add binding for not_between() ternary kernel > > > Key: ARROW-15224 > URL: https://issues.apache.org/jira/browse/ARROW-15224 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Eduardo Ponce >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Add R binding for {{not_between()}} compute function from ARROW-15223. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16282) [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04
Raúl Cumplido created ARROW-16282: - Summary: [CI] [C#] Verifiy release on c-sharp has been failing since upgrading ubuntu to 22.04 Key: ARROW-16282 URL: https://issues.apache.org/jira/browse/ARROW-16282 Project: Apache Arrow Issue Type: Bug Components: C#, Continuous Integration Reporter: Raúl Cumplido Fix For: 8.0.0 We upgraded the verify-release job for c-sharp from Ubuntu 20.04 to Ubuntu 22.04 and we can see how the nightly release job has been failing since then. Working for ubuntu 20.04 on 2022-04-08: [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-08-0-github-verify-rc-source-csharp-linux-ubuntu-20.04-amd64] Failing for ubuntu 22.04 on 2022-04-09: [https://github.com/ursacomputing/crossbow/tree/nightly-release-2022-04-09-0-github-verify-rc-source-csharp-linux-ubuntu-22.04-amd64] The error seems to be related with missing libssl: {code:java} === Build and test C# libraries === └ Ensuring that C# is installed... └ Installed C# at (.NET 3.1.405)Welcome to .NET Core 3.1! - SDK Version: 3.1.405Telemetry - The .NET Core tools collect usage data in order to help us improve your experience. It is collected by Microsoft and shared with the community. You can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT environment variable to '1' or 'true' using your favorite shell.Read more about .NET Core CLI Tools telemetry: https://aka.ms/dotnet-cli-telemetry Explore documentation: https://aka.ms/dotnet-docs Report issues and find source on GitHub: https://github.com/dotnet/core Find out what's new: https://aka.ms/dotnet-whats-new Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs Write your first app: https://aka.ms/first-net-core-app -- No usable version of libssl was found /arrow/dev/release/verify-release-candidate.sh: line 325: 49 Aborted (core dumped) dotnet tool install --tool-path ${csharp_bin} sourcelink Failed to verify release candidate. See /tmp/arrow-HEAD.CiwJM for details. 134 Error: `docker-compose --file /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e TEST_CSHARP=1 ubuntu-verify-rc` exited with a non-zero exit code 134, see the process log above.{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15224) [R] Add binding for not_between() ternary kernel
[ https://issues.apache.org/jira/browse/ARROW-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526442#comment-17526442 ] Eduardo Ponce commented on ARROW-15224: --- Based on these observations, it seems we conclude that a {{not_between}} function will not be included. So we can close this issue. > [R] Add binding for not_between() ternary kernel > > > Key: ARROW-15224 > URL: https://issues.apache.org/jira/browse/ARROW-15224 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Eduardo Ponce >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Add R binding for {{not_between()}} compute function from ARROW-15223. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15224) [R] Add binding for not_between() ternary kernel
[ https://issues.apache.org/jira/browse/ARROW-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526439#comment-17526439 ] Dragoș Moldovan-Grünfeld commented on ARROW-15224: -- I think the situation is similar in {{dplyr}} - the data manipulation R package we link to. {code:r} library(dplyr, warn.conflicts = FALSE) starwars %>% filter(between(height, 100, 150)) #> # A tibble: 5 × 14 #> name height mass hair_color skin_color eye_color birth_year sex gender #> #> 1 Leia Org…15049 brown light brown 19 fema… femin… #> 2 Mon Moth…150NA auburn fair blue 48 fema… femin… #> 3 Watto137NA black blue, grey yellowNA male mascu… #> 4 Sebulba 11240 none grey, red orangeNA male mascu… #> 5 Gasgano 122NA none white, bl… black NA male mascu… #> # … with 5 more variables: homeworld , species , films , #> # vehicles , starships starwars %>% filter(!between(height, 100, 150)) #> # A tibble: 76 × 14 #>name height mass hair_color skin_color eye_color birth_year sex gender #> #> 1 Luke Sk…17277 blond fair blue19 male mascu… #> 2 C-3PO 16775gold yellow 112 none mascu… #> 3 R2-D29632white, bl… red 33 none mascu… #> 4 Darth V…202 136 none white yellow 41.9 male mascu… #> 5 Owen La…178 120 brown, gr… light blue52 male mascu… #> 6 Beru Wh…16575 brown light blue47 fema… femin… #> 7 R5-D49732white, red red NA none mascu… #> 8 Biggs D…18384 black light brown 24 male mascu… #> 9 Obi-Wan…18277 auburn, w… fair blue-gray 57 male mascu… #> 10 Anakin …18884 blond fair blue41.9 male mascu… #> # … with 66 more rows, and 5 more variables: homeworld , species , #> # films , vehicles , starships {code} > [R] Add binding for not_between() ternary kernel > > > Key: ARROW-15224 > URL: https://issues.apache.org/jira/browse/ARROW-15224 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Eduardo Ponce >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Add R binding for {{not_between()}} compute function from ARROW-15223. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-15224) [R] Add binding for not_between() ternary kernel
[ https://issues.apache.org/jira/browse/ARROW-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526406#comment-17526406 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-15224 at 4/22/22 1:43 PM: --- Given {{dplyr::not_between()}} does not exist, do we need an R {{not_between()}} binding? What do you think? [~jonkeane] [~thisisnic][~paleolimbot] was (Author: dragosmg): Given {{dplyr::not_between()}} does not exist, do we need an R {{not_between()} binding? What do you think? [~jonkeane] [~thisisnic][~paleolimbot] > [R] Add binding for not_between() ternary kernel > > > Key: ARROW-15224 > URL: https://issues.apache.org/jira/browse/ARROW-15224 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Eduardo Ponce >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Add R binding for {{not_between()}} compute function from ARROW-15223. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16281) [R] [CI] Bump versions with the release of 4.2
[ https://issues.apache.org/jira/browse/ARROW-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16281: --- Summary: [R] [CI] Bump versions with the release of 4.2 (was: [R] [CI]) > [R] [CI] Bump versions with the release of 4.2 > -- > > Key: ARROW-16281 > URL: https://issues.apache.org/jira/browse/ARROW-16281 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jacob Wujciak-Jens >Priority: Major > > Now that R 4.2 is released, we should bump all of our R versions where we > have ones hardcoded. > This will mean dropping support for 3.4 entirely and adding in 4.0 to > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/r/github.linux.versions.yml#L34 > There are a few other places that we have hard-coded versions (we might need > to wait a few days for these to catch up): > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/tasks.yml#L1291-L1295 > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/.github/workflows/r.yml#L60 > (and a few other places in that file — though one note: we build an old > version of windows that uses rtools35 in the GHA CI so that we catch when we > break that — we'll want to keep that!) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16281) [R] [CI]
Jonathan Keane created ARROW-16281: -- Summary: [R] [CI] Key: ARROW-16281 URL: https://issues.apache.org/jira/browse/ARROW-16281 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane Assignee: Jacob Wujciak-Jens Now that R 4.2 is released, we should bump all of our R versions where we have ones hardcoded. This will mean dropping support for 3.4 entirely and adding in 4.0 to https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/r/github.linux.versions.yml#L34 There are a few other places that we have hard-coded versions (we might need to wait a few days for these to catch up): https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/tasks.yml#L1291-L1295 https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/.github/workflows/r.yml#L60 (and a few other places in that file — though one note: we build an old version of windows that uses rtools35 in the GHA CI so that we catch when we break that — we'll want to keep that!) -- This message was sent by Atlassian Jira (v8.20.7#820007)