[jira] [Resolved] (ARROW-17212) [Python] Support lazy Dataset.filter

2022-12-16 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-17212.
---
Resolution: Done

> [Python] Support lazy Dataset.filter
> 
>
> Key: ARROW-17212
> URL: https://issues.apache.org/jira/browse/ARROW-17212
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>
> Given that when possible we would like to keep Dataset and Table with a 
> similar enough API that allows to perform most convenient operations on 
> Dataset without having to materialise it to a table, it would be good to add 
> proper support for a {{Dataset.filter}} method like the one we have on 
> {{Table.filter}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

2022-12-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-16616.
---
Resolution: Fixed

Issue resolved by pull request 13409
https://github.com/apache/arrow/pull/13409

> [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter 
> method
> -
>
> Key: ARROW-16616
> URL: https://issues.apache.org/jira/browse/ARROW-16616
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> To keep the {{Dataset}} api compatible with the {{Table}} one in terms of 
> analytics capabilities, we should add a {{Dataset.filter}} method. The 
> initial POC was based on {{_table_filter}} but that required materialising 
> all the {{Dataset}} content after filtering as it returned an 
> {{{}InMemoryDataset{}}}. 
> Given that {{Scanner}} can filter a dataset without actually materialising 
> the data until a final step happens, it would be good to have 
> {{Dataset.filter}} return some form of lazy dataset when the filter is only 
> stored aside and the Scanner is created when data is actually retrieved.
> PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} 
> method



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16518) [Python] Ensure _exec_plan.execplan preserves order of inputs

2022-12-02 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16518:
--
Parent: (was: ARROW-17212)
Issue Type: Bug  (was: Sub-task)

> [Python] Ensure _exec_plan.execplan preserves order of inputs
> -
>
> Key: ARROW-16518
> URL: https://issues.apache.org/jira/browse/ARROW-16518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>
> At the moment execplan doesn't guarantee any ordered output, the batches are 
> consumed in a random order. This can lead to unordered rows in outputs when 
> {{use_threads=True}}
> For example providing a column with {{b=[a, a, a, a, b, b, b, b]}} will 
> sometimes give back {{b=[a, b]}} and sometimes {{b=[b, a]}}
> See
> {code:java}
> In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
> In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
> In [20]: table = pa.concat_tables([table1, table2])
> In [21]: ep._filter_table(table, pc.field('a') == 1)
> Out[21]: 
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1],[1]]
> b: [["b"],["a"]]
> In [22]: ep._filter_table(table, pc.field('a') == 1)
> Out[22]: 
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1],[1]]
> b: [["a"],["b"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18002) [C++] Ensure sorting Table, RecordBatch can sort Struct and Dictionary cols

2022-12-02 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18002:
--
Summary: [C++] Ensure sorting Table, RecordBatch can sort Struct and 
Dictionary cols  (was: [Python] Improve Sorting Capabilities in PyArrow)

> [C++] Ensure sorting Table, RecordBatch can sort Struct and Dictionary cols
> ---
>
> Key: ARROW-18002
> URL: https://issues.apache.org/jira/browse/ARROW-18002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18003) [Python] Add sort_by to Table and RecordBatch

2022-12-02 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18003:
--
Parent: (was: ARROW-18002)
Issue Type: Improvement  (was: Sub-task)

> [Python] Add sort_by to Table and RecordBatch
> -
>
> Key: ARROW-18003
> URL: https://issues.apache.org/jira/browse/ARROW-18003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14656) [Python] Add sort_by method to Array, StructArray and ChunkedArray

2022-12-02 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14656:
--
Parent: (was: ARROW-18002)
Issue Type: Improvement  (was: Sub-task)

> [Python] Add sort_by method to Array, StructArray and ChunkedArray
> --
>
> Key: ARROW-14656
> URL: https://issues.apache.org/jira/browse/ARROW-14656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> It would be convenient to be able to sort a {{StructArray}} on one of its 
> columns. This can be done by combining multiple compute functions, but having 
> a helper that does that for you would probably be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18380) MIGRATION: Enable bot handling of GitHub issue linked PRs

2022-11-29 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-18380.
---
Resolution: Fixed

Issue resolved by pull request 14731
https://github.com/apache/arrow/pull/14731

> MIGRATION: Enable bot handling of GitHub issue linked PRs
> -
>
> Key: ARROW-18380
> URL: https://issues.apache.org/jira/browse/ARROW-18380
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Todd Farmer
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> GitHub workflows for the Apache Arrow project assume that PRs reference ASF 
> Jira issues (or are minor changes). This needs to be revisited now that 
> GitHub issue reporting is enabled, as there may well be no ASF Jira issue to 
> link a PR against going forward. The resulting bot comments can be confusing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18214) [C++][R][Python] Use ISO 8601 in character representations of timestamps?

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18214:
--
Issue Type: Improvement  (was: Bug)

> [C++][R][Python] Use ISO 8601 in character representations of timestamps?
> -
>
> Key: ARROW-18214
> URL: https://issues.apache.org/jira/browse/ARROW-18214
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Carl Boettiger
>Priority: Major
>  Labels: triaged
>
> Arrow needs to represent datetime / timestamp values as character strings, 
> e.g. when writing to CSV or when generating partitions on timestamp-valued 
> column. When this occurs, Arrow generates a string such as:
> "2022-11-01 21:12:46.771925+"
> In particular, this uses a space instead of a T between the date and time 
> components.  I believe either is permitted in [RFC 
> 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] 
> ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications 
> using this syntax may choose, for the sake of readability, to specify a 
> full-date and full-time separated by (say) a space character.??
>  
> But as RFC 3339 notes, this is not valid under ISO 8601.  It would be 
> preferable to stick to the stricter ISO 8601 convention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18230) [Python] Pass Cmake args to Python CPP

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18230:
--
Labels: triaged  (was: )

> [Python] Pass Cmake args to Python CPP 
> ---
>
> Key: ARROW-18230
> URL: https://issues.apache.org/jira/browse/ARROW-18230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Will Jones
>Priority: Major
>  Labels: triaged
> Fix For: 11.0.0
>
>
> We pass {{extra_cmake_args}} to {{_run_cmake}} (Cython build) but not to {{
> _run_cmake_pyarrow_cpp}} (PyArrow C++ build). We should probably be passing 
> to both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18229) [C++][Python] RecordBatchReader can be created with a 'dict' schema which then crashes on use

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18229:
--
Labels: triaged  (was: )

> [C++][Python] RecordBatchReader can be created with a 'dict' schema which 
> then crashes on use
> -
>
> Key: ARROW-18229
> URL: https://issues.apache.org/jira/browse/ARROW-18229
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: David Li
>Priority: Blocker
>  Labels: triaged
>
> Presumably we should disallow this or convert it to a schema?
> https://github.com/duckdb/duckdb/issues/5143
> {noformat}
> >>> import pyarrow as pa
> >>> pa.__version__
> '10.0.0'
> >>> reader = pa.RecordBatchReader.from_batches({"a": pa.int8()}, [])
> >>> reader.schema
> fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
> (gdb) bt
> #0  0x74247580 in arrow::Schema::num_fields() const ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #1  0x742b93f7 in arrow::(anonymous namespace)::SchemaPrinter::Print()
> ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #2  0x742b98a7 in arrow::PrettyPrint(arrow::Schema const&, 
> arrow::PrettyPrintOptions const&, std::string*) ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #3  0x764f814b in 
> __pyx_pw_7pyarrow_3lib_6Schema_52to_string(_object*, _object*, _object*) ()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18214) [C++][R][Python] Use ISO 8601 in character representations of timestamps?

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18214:
--
Labels: triaged  (was: )

> [C++][R][Python] Use ISO 8601 in character representations of timestamps?
> -
>
> Key: ARROW-18214
> URL: https://issues.apache.org/jira/browse/ARROW-18214
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Carl Boettiger
>Priority: Major
>  Labels: triaged
>
> Arrow needs to represent datetime / timestamp values as character strings, 
> e.g. when writing to CSV or when generating partitions on timestamp-valued 
> column. When this occurs, Arrow generates a string such as:
> "2022-11-01 21:12:46.771925+"
> In particular, this uses a space instead of a T between the date and time 
> components.  I believe either is permitted in [RFC 
> 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] 
> ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications 
> using this syntax may choose, for the sake of readability, to specify a 
> full-date and full-time separated by (say) a space character.??
>  
> But as RFC 3339 notes, this is not valid under ISO 8601.  It would be 
> preferable to stick to the stricter ISO 8601 convention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18191) [C++] Valgrind failure in arrow-gcsfs-test

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18191:
--
Priority: Critical  (was: Major)

> [C++] Valgrind failure in arrow-gcsfs-test
> --
>
> Key: ARROW-18191
> URL: https://issues.apache.org/jira/browse/ARROW-18191
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Priority: Critical
>  Labels: triaged
>
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=38546&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> {noformat}
> ==11267== 
> ==11267== HEAP SUMMARY:
> ==11267== in use at exit: 12,091 bytes in 190 blocks
> ==11267==   total heap usage: 982,685 allocs, 982,495 frees, 1,332,264,705 
> bytes allocated
> ==11267== 
> ==11267== 192 bytes in 8 blocks are definitely lost in loss record 35 of 45
> ==11267==at 0x40377A5: operator new(unsigned long, std::nothrow_t const&) 
> (vg_replace_malloc.c:542)
> ==11267==by 0x682B079: __cxa_thread_atexit (atexit_thread.cc:152)
> ==11267==by 0x672F2D6: 
> google::cloud::v2_3_0::internal::OptionsSpan::OptionsSpan(google::cloud::v2_3_0::Options)
>  (in /opt/conda/envs/arrow/lib/libgoogle_cloud_cpp_common.so.2.3.0)
> ==11267==by 0x5DFCA33: google::cloud::v2_3_0::Status 
> google::cloud::storage::v2_3_0::Client::DeleteObject(std::__cxx11::basic_string  std::char_traits, std::allocator > const&, 
> std::__cxx11::basic_string, std::allocator 
> > const&, google::cloud::storage::v2_3_0::Generation&&) (client.h:1285)
> ==11267==by 0x5DFD022: operator() (gcsfs.cc:550)
> ==11267==by 0x5DFD022: 
> operator() arrow::fs::(anonymous namespace)::GcsPath&, bool, const 
> arrow::io::IOContext&):: google::cloud::v2_3_0::StatusOr&)>&,
>  
> google::cloud::v2_3_0::StatusOr&>
>  (future.h:150)
> ==11267==by 0x5DFD022: __invoke_impl arrow::detail::ContinueFuture&, arrow::Future&, 
> arrow::fs::GcsFileSystem::Impl::DeleteDirContents(const arrow::fs::(anonymous 
> namespace)::GcsPath&, bool, const arrow::io::IOContext&):: google::cloud::v2_3_0::StatusOr&)>&,
>  
> google::cloud::v2_3_0::StatusOr&>
>  (invoke.h:60)
> ==11267==by 0x5DFD022: __invoke arrow::Future&, 
> arrow::fs::GcsFileSystem::Impl::DeleteDirContents(const arrow::fs::(anonymous 
> namespace)::GcsPath&, bool, const arrow::io::IOContext&):: google::cloud::v2_3_0::StatusOr&)>&,
>  
> google::cloud::v2_3_0::StatusOr&>
>  (invoke.h:95)
> ==11267==by 0x5DFD022: __call (functional:416)
> ==11267==by 0x5DFD022: operator()<> (functional:499)
> ==11267==by 0x5DFD022: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::fs::GcsFileSystem::Impl::DeleteDirContents(arrow::fs::(anonymous 
> namespace)::GcsPath const&, bool, arrow::io::IOContext 
> const&)::{lambda(google::cloud::v2_3_0::StatusOr
>  const&)#1}, 
> google::cloud::v2_3_0::StatusOr)>
>  >::invoke() (functional.h:152)
> ==11267==by 0x50BDAA1: operator() (functional.h:140)
> ==11267==by 0x50BDAA1: 
> arrow::internal::WorkerLoop(std::shared_ptr,
>  std::_List_iterator) (thread_pool.cc:243)
> ==11267==by 0x50BE161: operator() (thread_pool.cc:414)
> ==11267==by 0x50BE161: __invoke_impl arrow::internal::ThreadPool::LaunchWorkersUnlocked(int):: > 
> (invoke.h:60)
> ==11267==by 0x50BE161: 
> __invoke 
> > (invoke.h:95)
> ==11267==by 0x50BE161: _M_invoke<0> (thread:264)
> ==11267==by 0x50BE161: operator() (thread:271)
> ==11267==by 0x50BE161: 
> std::thread::_State_impl
>  > >::_M_run() (thread:215)
> ==11267==by 0x6849A92: execute_native_thread_routine (thread.cc:82)
> ==11267==by 0x69666DA: start_thread (pthread_create.c:463)
> ==11267==by 0x6C9F61E: clone (clone.S:95)
> ==11267== 
> {
>
>Memcheck:Leak
>match-leak-kinds: definite
>fun:_ZnwmRKSt9nothrow_t
>fun:execute_native_thread_routine
>fun:start_thread
>fun:clone
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18193) [C++] Acero should reject Substrait plans that require an implicit cast from decimal to float

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18193:
--
Labels: triaged  (was: )

> [C++] Acero should reject Substrait plans that require an implicit cast from 
> decimal to float
> -
>
> Key: ARROW-18193
> URL: https://issues.apache.org/jira/browse/ARROW-18193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: triaged
>
> There are a number of functions that Acero currently implements for decimal 
> types by first casting to float implicitly.  Substrait does not allow for 
> this kind of implicit cast.  If a plan does not have a decimal-native 
> implementation of a function then it should reject the plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18191) [C++] Valgrind failure in arrow-gcsfs-test

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18191:
--
Labels: triaged  (was: )

> [C++] Valgrind failure in arrow-gcsfs-test
> --
>
> Key: ARROW-18191
> URL: https://issues.apache.org/jira/browse/ARROW-18191
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Priority: Major
>  Labels: triaged
>
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=38546&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> {noformat}
> ==11267== 
> ==11267== HEAP SUMMARY:
> ==11267== in use at exit: 12,091 bytes in 190 blocks
> ==11267==   total heap usage: 982,685 allocs, 982,495 frees, 1,332,264,705 
> bytes allocated
> ==11267== 
> ==11267== 192 bytes in 8 blocks are definitely lost in loss record 35 of 45
> ==11267==at 0x40377A5: operator new(unsigned long, std::nothrow_t const&) 
> (vg_replace_malloc.c:542)
> ==11267==by 0x682B079: __cxa_thread_atexit (atexit_thread.cc:152)
> ==11267==by 0x672F2D6: 
> google::cloud::v2_3_0::internal::OptionsSpan::OptionsSpan(google::cloud::v2_3_0::Options)
>  (in /opt/conda/envs/arrow/lib/libgoogle_cloud_cpp_common.so.2.3.0)
> ==11267==by 0x5DFCA33: google::cloud::v2_3_0::Status 
> google::cloud::storage::v2_3_0::Client::DeleteObject(std::__cxx11::basic_string  std::char_traits, std::allocator > const&, 
> std::__cxx11::basic_string, std::allocator 
> > const&, google::cloud::storage::v2_3_0::Generation&&) (client.h:1285)
> ==11267==by 0x5DFD022: operator() (gcsfs.cc:550)
> ==11267==by 0x5DFD022: 
> operator() arrow::fs::(anonymous namespace)::GcsPath&, bool, const 
> arrow::io::IOContext&):: google::cloud::v2_3_0::StatusOr&)>&,
>  
> google::cloud::v2_3_0::StatusOr&>
>  (future.h:150)
> ==11267==by 0x5DFD022: __invoke_impl arrow::detail::ContinueFuture&, arrow::Future&, 
> arrow::fs::GcsFileSystem::Impl::DeleteDirContents(const arrow::fs::(anonymous 
> namespace)::GcsPath&, bool, const arrow::io::IOContext&):: google::cloud::v2_3_0::StatusOr&)>&,
>  
> google::cloud::v2_3_0::StatusOr&>
>  (invoke.h:60)
> ==11267==by 0x5DFD022: __invoke arrow::Future&, 
> arrow::fs::GcsFileSystem::Impl::DeleteDirContents(const arrow::fs::(anonymous 
> namespace)::GcsPath&, bool, const arrow::io::IOContext&):: google::cloud::v2_3_0::StatusOr&)>&,
>  
> google::cloud::v2_3_0::StatusOr&>
>  (invoke.h:95)
> ==11267==by 0x5DFD022: __call (functional:416)
> ==11267==by 0x5DFD022: operator()<> (functional:499)
> ==11267==by 0x5DFD022: arrow::internal::FnOnce ()>::FnImpl (arrow::Future, 
> arrow::fs::GcsFileSystem::Impl::DeleteDirContents(arrow::fs::(anonymous 
> namespace)::GcsPath const&, bool, arrow::io::IOContext 
> const&)::{lambda(google::cloud::v2_3_0::StatusOr
>  const&)#1}, 
> google::cloud::v2_3_0::StatusOr)>
>  >::invoke() (functional.h:152)
> ==11267==by 0x50BDAA1: operator() (functional.h:140)
> ==11267==by 0x50BDAA1: 
> arrow::internal::WorkerLoop(std::shared_ptr,
>  std::_List_iterator) (thread_pool.cc:243)
> ==11267==by 0x50BE161: operator() (thread_pool.cc:414)
> ==11267==by 0x50BE161: __invoke_impl arrow::internal::ThreadPool::LaunchWorkersUnlocked(int):: > 
> (invoke.h:60)
> ==11267==by 0x50BE161: 
> __invoke 
> > (invoke.h:95)
> ==11267==by 0x50BE161: _M_invoke<0> (thread:264)
> ==11267==by 0x50BE161: operator() (thread:271)
> ==11267==by 0x50BE161: 
> std::thread::_State_impl
>  > >::_M_run() (thread:215)
> ==11267==by 0x6849A92: execute_native_thread_routine (thread.cc:82)
> ==11267==by 0x69666DA: start_thread (pthread_create.c:463)
> ==11267==by 0x6C9F61E: clone (clone.S:95)
> ==11267== 
> {
>
>Memcheck:Leak
>match-leak-kinds: definite
>fun:_ZnwmRKSt9nothrow_t
>fun:execute_native_thread_routine
>fun:start_thread
>fun:clone
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18201) Remove ad-hoc substrait version after substrait#342

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18201:
--
Labels: triaged  (was: )

> Remove ad-hoc substrait version after substrait#342
> ---
>
> Key: ARROW-18201
> URL: https://issues.apache.org/jira/browse/ARROW-18201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: triaged
>
> After https://github.com/apache/arrow/pull/14415 arrow's version of substrait 
> will be a specific git hash instead of a released version. As soon as a 
> released version is available which includes 
> https://github.com/substrait-io/substrait/pull/342 we should update to depend 
> on that, in order to minimize build up of divergence between arrow's 
> consumption of substrait and its released API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18229) [C++][Python] RecordBatchReader can be created with a 'dict' schema which then crashes on use

2022-11-03 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18229:
--
Priority: Blocker  (was: Major)

> [C++][Python] RecordBatchReader can be created with a 'dict' schema which 
> then crashes on use
> -
>
> Key: ARROW-18229
> URL: https://issues.apache.org/jira/browse/ARROW-18229
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: David Li
>Priority: Blocker
>
> Presumably we should disallow this or convert it to a schema?
> https://github.com/duckdb/duckdb/issues/5143
> {noformat}
> >>> import pyarrow as pa
> >>> pa.__version__
> '10.0.0'
> >>> reader = pa.RecordBatchReader.from_batches({"a": pa.int8()}, [])
> >>> reader.schema
> fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
> (gdb) bt
> #0  0x74247580 in arrow::Schema::num_fields() const ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #1  0x742b93f7 in arrow::(anonymous namespace)::SchemaPrinter::Print()
> ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #2  0x742b98a7 in arrow::PrettyPrint(arrow::Schema const&, 
> arrow::PrettyPrintOptions const&, std::string*) ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #3  0x764f814b in 
> __pyx_pw_7pyarrow_3lib_6Schema_52to_string(_object*, _object*, _object*) ()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18189) [Python] Table.drop should support passing a single column

2022-10-28 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18189:
--
Description: 
{code}
> 2 data = dataset.drop("Churn") 

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
pyarrow.lib.Table.drop()
KeyError: "Column 'C' not found"
{code}

Also, for consistency, it would probably be good to have a 
{{Table.drop_column}} alias, as all the other methods are named 
{{Table.add_column}}, {{Table.append_column}} and {{Table.set_column}}

  was:
{code}
> 2 data = dataset.drop("Churn") 

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
pyarrow.lib.Table.drop()
KeyError: "Column 'C' not found"
{code}

Also, for consistency, it would probably be good to have a 
{{Table.drop_column}} alias, as all the other methods are name 
{{Table.add_column}}, {{Table.append_column}} and {{Table.set_column}}


> [Python] Table.drop should support passing a single column
> --
>
> Key: ARROW-18189
> URL: https://issues.apache.org/jira/browse/ARROW-18189
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
>  Labels: good-first-issue
>
> {code}
> > 2 data = dataset.drop("Churn") 
> /usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.drop()
> KeyError: "Column 'C' not found"
> {code}
> Also, for consistency, it would probably be good to have a 
> {{Table.drop_column}} alias, as all the other methods are named 
> {{Table.add_column}}, {{Table.append_column}} and {{Table.set_column}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18189) [Python] Table.drop should support passing a single column

2022-10-28 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18189:
--
Description: 
{code}
> 2 data = dataset.drop("Churn") 

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
pyarrow.lib.Table.drop()
KeyError: "Column 'C' not found"
{code}

Also, for consistency, it would probably be good to have a 
{{Table.drop_column}} alias, as all the other methods are name 
{{Table.add_column}}, {{Table.append_column}} and {{Table.set_column}}

  was:
{code}
> 2 data = dataset.drop("Churn") 

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
pyarrow.lib.Table.drop()
KeyError: "Column 'C' not found"
{code}

Also, for consistency, it would probably be good to have a 
{{Table.drop_column}} alias, as all the other methods are name 
{{Table.add_column}} and {{Table.set_column}}


> [Python] Table.drop should support passing a single column
> --
>
> Key: ARROW-18189
> URL: https://issues.apache.org/jira/browse/ARROW-18189
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
>  Labels: good-first-issue
>
> {code}
> > 2 data = dataset.drop("Churn") 
> /usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.drop()
> KeyError: "Column 'C' not found"
> {code}
> Also, for consistency, it would probably be good to have a 
> {{Table.drop_column}} alias, as all the other methods are name 
> {{Table.add_column}}, {{Table.append_column}} and {{Table.set_column}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18189) [Python] Table.drop should support passing a single column

2022-10-28 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-18189:
-

 Summary: [Python] Table.drop should support passing a single column
 Key: ARROW-18189
 URL: https://issues.apache.org/jira/browse/ARROW-18189
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alessandro Molina


{code}
> 2 data = dataset.drop("Churn") 

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in 
pyarrow.lib.Table.drop()
KeyError: "Column 'C' not found"
{code}

Also, for consistency, it would probably be good to have a 
{{Table.drop_column}} alias, as all the other methods are name 
{{Table.add_column}} and {{Table.set_column}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18143) [Java] Allow to set Compression Codec in Arrow Writer

2022-10-25 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18143:
--
Priority: Critical  (was: Minor)

> [Java] Allow to set Compression Codec in Arrow Writer
> -
>
> Key: ARROW-18143
> URL: https://issues.apache.org/jira/browse/ARROW-18143
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Michal Zaborec
>Priority: Critical
>  Labels: triaged
>
> ArrowFileWriter and ArrowStreamWriter doesn't expose a constructor for 
> setting a {_}CompressionCodec{_}.
> If they are not meant to, there should be proper documentation on how to save 
> compressed data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18143) [Java] Allow to set Compression Codec in Arrow Writer

2022-10-25 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18143:
--
Description: 
ArrowFileWriter and ArrowStreamWriter doesn't expose a constructor for setting 
a {_}CompressionCodec{_}.

If they are not meant to, there should be proper documentation on how to save 
compressed data.

  was:ArrowFileWriter and ArrowStreamWriter doesn't expose a constructor for 
setting a {_}CompressionCodec{_}.


> [Java] Allow to set Compression Codec in Arrow Writer
> -
>
> Key: ARROW-18143
> URL: https://issues.apache.org/jira/browse/ARROW-18143
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Michal Zaborec
>Priority: Minor
>  Labels: triaged
>
> ArrowFileWriter and ArrowStreamWriter doesn't expose a constructor for 
> setting a {_}CompressionCodec{_}.
> If they are not meant to, there should be proper documentation on how to save 
> compressed data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18143) [Java] Allow to set Compression Codec in Arrow Writer

2022-10-25 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18143:
--
Labels: triaged  (was: )

> [Java] Allow to set Compression Codec in Arrow Writer
> -
>
> Key: ARROW-18143
> URL: https://issues.apache.org/jira/browse/ARROW-18143
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Michal Zaborec
>Priority: Minor
>  Labels: triaged
>
> ArrowFileWriter and ArrowStreamWriter doesn't expose a constructor for 
> setting a {_}CompressionCodec{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18128) [Java][CI] Java nightlies does not remove binaries keeping the newest ones

2022-10-25 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18128:
--
Priority: Critical  (was: Major)

> [Java][CI] Java nightlies does not remove binaries keeping the newest ones
> --
>
> Key: ARROW-18128
> URL: https://issues.apache.org/jira/browse/ARROW-18128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Raúl Cumplido
>Priority: Critical
> Fix For: 11.0.0
>
>
> After some investigation on why some of our jars did not have a correct 
> 10.0.0-SNAPSHOT folder on the repository, i.e:
> [https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-vector/]{color:#1d1c1d}
>  {color}
> [https://nightlies.apache.org/arrow/java/org/apache/arrow/flight-core/]
> It seems that when pruning the old artifacts we are not ordering based on 
> newer ones:
> {code:java}
>       - name: Prune Repository
>         shell: bash
>         env:
>           KEEP: ${{ github.event.inputs.keep || 14 }}
>         run: |
>           for i in `ls -t repo/org/apache/arrow`; do
>             find repo/org/apache/arrow/$i -mindepth 1 -maxdepth 1 -type d 
> -print0 \
>             | xargs -0 ls -t -d \
>             | tail -n +$((KEEP + 1)) \
>             | xargs rm -rf
>           done {code}
> that makes us delete based on the output from find and having things like:
> {code:java}
> [DIR] 2022-09-21/                   2022-09-21 15:43    -   
> [DIR] 2022-09-22/                   2022-09-22 20:53    -   
> [DIR] 2022-09-23/                   2022-09-23 15:14    -   
> [DIR] 2022-10-11/                   2022-10-11 15:18    -   
> [DIR] 2022-10-12/                   2022-10-12 18:04    -   
> [DIR] 2022-10-13/                   2022-10-13 20:35    -   
> [DIR] 2022-10-14/                   2022-10-14 17:28    -   
> [DIR] 2022-10-15/                   2022-10-15 14:10    -   
> [DIR] 2022-10-16/                   2022-10-16 14:13    -   
> [DIR] 2022-10-17/                   2022-10-17 14:21    -   
> [DIR] 2022-10-18/                   2022-10-18 16:24    -   
> [DIR] 2022-10-19/                   2022-10-19 14:31    -   
> [DIR] 2022-10-20/                   2022-10-20 17:09    -   {code}
> See artifacts for 21st-23rd September and then jumping to 11th October.
> We should fix how we prune the older artifacts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17972) [CI] Update CUDA docker jobs

2022-10-25 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-17972.
---
Fix Version/s: 10.0.0
   (was: 11.0.0)
   Resolution: Fixed

Issue resolved by pull request 14362
[https://github.com/apache/arrow/pull/14362]

> [CI] Update CUDA docker jobs
> 
>
> Key: ARROW-17972
> URL: https://issues.apache.org/jira/browse/ARROW-17972
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The CUDA job config in docker-compose.yml are outdated and do not work 
> anymore. Additionally disable optional features to keep build time low.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18129) get_include() gives wrong directory

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18129:
--
Priority: Critical  (was: Minor)

> get_include() gives wrong directory
> ---
>
> Key: ARROW-18129
> URL: https://issues.apache.org/jira/browse/ARROW-18129
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: conda
>Reporter: Left Screen
>Priority: Critical
>  Labels: triaged
>
> {{get_include}} seems to do:
>  
> {code:java}
> def get_include():
>     """
>     Return absolute path to directory containing Arrow C++ include
>     headers. Similar to numpy.get_include
>     """
>     return _os.path.join(_os.path.dirname(__file__), 'include') {code}
> This returns something like:
> {code:java}
> /path/to/myconda/envs/envname/lib/python3.8/site-packages/pyarrow/include{code}
> which does not exist in a conda environment. The path where the headers 
> actually get installed is to:
>  
> {code:java}
> $ echo $CONDA_PREFIX
> /path/to/myconda/envs/envname
> $ ls $CONDA_PREFIX/include/arrow | head
> adapters
> api.h
> array
> array.h
> buffer_builder.h
> buffer.h
> builder.h
> c
> chunked_array.h
> chunk_resolver.h
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18129) get_include() gives wrong directory

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18129:
--
Labels: triaged  (was: )

> get_include() gives wrong directory
> ---
>
> Key: ARROW-18129
> URL: https://issues.apache.org/jira/browse/ARROW-18129
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: conda
>Reporter: Left Screen
>Priority: Minor
>  Labels: triaged
>
> {{get_include}} seems to do:
>  
> {code:java}
> def get_include():
>     """
>     Return absolute path to directory containing Arrow C++ include
>     headers. Similar to numpy.get_include
>     """
>     return _os.path.join(_os.path.dirname(__file__), 'include') {code}
> This returns something like:
> {code:java}
> /path/to/myconda/envs/envname/lib/python3.8/site-packages/pyarrow/include{code}
> which does not exist in a conda environment. The path where the headers 
> actually get installed is to:
>  
> {code:java}
> $ echo $CONDA_PREFIX
> /path/to/myconda/envs/envname
> $ ls $CONDA_PREFIX/include/arrow | head
> adapters
> api.h
> array
> array.h
> buffer_builder.h
> buffer.h
> builder.h
> c
> chunked_array.h
> chunk_resolver.h
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18116) [R][Doc] correct paths for the read_parquet examples in cloud storage vignette

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18116:
--
Labels: triaged  (was: )

> [R][Doc] correct paths for the read_parquet examples in cloud storage vignette
> --
>
> Key: ARROW-18116
> URL: https://issues.apache.org/jira/browse/ARROW-18116
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, R
>Reporter: Stephanie Hazlitt
>Priority: Major
>  Labels: triaged
>
> {{The S3 file paths don't run:}}
> {code:java}
> > library(arrow) 
> > read_parquet(file = 
> > "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") 
> Error in url(file, open = "rb") : URL scheme unsupported by this method{code}
> {{It looks like the file names are `part-0.parquet` not `data.parquet`.}}
> {{This runs:}}
> {code:java}
> read_parquet(file = 
> "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623258#comment-17623258
 ] 

Alessandro Molina commented on ARROW-18114:
---

[~westonpace] I checked that the R bindings seems to be properly passing 
{{InspectOptions}} 
https://github.com/apache/arrow/blob/afc6840c28f69f1554fa4a974b90195348f48978/r/src/dataset.cpp#L135-L142

Is this an expected behaviour?

> [R] unify_schemas=FALSE does not improve open_dataset() read times
> --
>
> Key: ARROW-18114
> URL: https://issues.apache.org/jira/browse/ARROW-18114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> open_dataset() provides the very helpful optional argument to set 
> unify_schemas=FALSE, which should allow arrow to inspect a single parquet 
> file instead of touching potentially thousands or more parquet files to 
> determine a consistent unified schema.  This ought to provide a substantial 
> performance increase in contexts where the schema is known in advance.
> Unfortunately, in my tests it seems to have no impact on performance.  
> Consider the following reprexes:
>  default, unify_schemas=TRUE 
> {code:java}
> library(arrow)
> ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", 
> endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
> bench::bench_time(
> { open_dataset(ex) }
> ){code}
> about 32 seconds for me.
>  manual, unify_schemas=FALSE:  
> {code:java}
> bench::bench_time({
> open_dataset(ex, unify_schemas = FALSE)
> }){code}
> takes about 32 seconds as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18137) [Python][Docs] Allow passing no aggregations to TableGroupBy.aggregate

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623257#comment-17623257
 ] 

Alessandro Molina commented on ARROW-18137:
---

I wonder if we should have a dedicated helper {{TableGroupBy.collect()}} 
instead of abusing {{TableGroupBy.aggregate()}}, if then {{collect()}} is 
underlying implemented by invoking {{aggregate([])}} that's an implementation 
detail.

> [Python][Docs] Allow passing no aggregations to TableGroupBy.aggregate
> --
>
> Key: ARROW-18137
> URL: https://issues.apache.org/jira/browse/ARROW-18137
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Documentation, Python
>Affects Versions: 9.0.0
>Reporter: Jacek Pliszka
>Assignee: Jacek Pliszka
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> If we could allow TableGroupBy.aggregate to accept no aggregation functions 
> then it would behave like pandas drop_duplicates.
> {code:python}
> t.group_by(['keys', 'values']).aggregate()
> {code}
> I did some naive benchmarks and looks like it should be 30% faster than 
> converting to pandas and deduplicating. This was my naive test:
> {code:python}
>  t.append_column('i', pa.array([1]*len(t),pa.int64())).group_by(['keys', 
> 'values']).aggregate([("i", "max")]).drop(['i_max'])
> {code}
> And on small 5M table it took 245ms while 359ms for 
> t.to_pandas().drop_duplicates()
> Actual aggregation without adding dummy column should be  even faster still 
> will allow drop_duplicates functionality until better implementation arrives



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18114:
--
Description: 
open_dataset() provides the very helpful optional argument to set 
unify_schemas=FALSE, which should allow arrow to inspect a single parquet file 
instead of touching potentially thousands or more parquet files to determine a 
consistent unified schema.  This ought to provide a substantial performance 
increase in contexts where the schema is known in advance.

Unfortunately, in my tests it seems to have no impact on performance.  Consider 
the following reprexes:

 default, unify_schemas=TRUE 
{code:java}
library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override 
= "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
){code}
about 32 seconds for me.

 manual, unify_schemas=FALSE:  
{code:java}
bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
}){code}
takes about 32 seconds as well. 

  was:
open_dataset() provides the very helpful optional argument to set 
unify_schemas=FALSE, which should allow arrow to inspect a single parquet file 
instead of touching potentially thousands or more parquet files to determine a 
consistent unified schema.  This ought to provide a substantial performance 
increase in contexts where the schema is known in advance. 

Unfortunately, in my tests it seems to have no impact on performance.  Consider 
the following reprexes:

default, unify_schemas=TRUE
library(arrow)
 ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", 
endpoint_override = "data.ecoforecast.org", anonymous=TRUE)

bench::bench_time({
open_dataset(ex) 
})
about 32 seconds for me.

manual, unify_schemas=FALSE:

 
bench::bench_time(\{

open_dataset(ex, unify_schemas = FALSE)

})
takes about 32 seconds as well. 


> [R] unify_schemas=FALSE does not improve open_dataset() read times
> --
>
> Key: ARROW-18114
> URL: https://issues.apache.org/jira/browse/ARROW-18114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> open_dataset() provides the very helpful optional argument to set 
> unify_schemas=FALSE, which should allow arrow to inspect a single parquet 
> file instead of touching potentially thousands or more parquet files to 
> determine a consistent unified schema.  This ought to provide a substantial 
> performance increase in contexts where the schema is known in advance.
> Unfortunately, in my tests it seems to have no impact on performance.  
> Consider the following reprexes:
>  default, unify_schemas=TRUE 
> {code:java}
> library(arrow)
> ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", 
> endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
> bench::bench_time(
> { open_dataset(ex) }
> ){code}
> about 32 seconds for me.
>  manual, unify_schemas=FALSE:  
> {code:java}
> bench::bench_time({
> open_dataset(ex, unify_schemas = FALSE)
> }){code}
> takes about 32 seconds as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18089) [R] Cannot read_parquet on http URL

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18089:
--
Priority: Critical  (was: Major)

> [R] Cannot read_parquet on http URL
> ---
>
> Key: ARROW-18089
> URL: https://issues.apache.org/jira/browse/ARROW-18089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Critical
>  Labels: triaged
> Fix For: 11.0.0
>
>
> {code}
> u <- 
> "https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
> read_parquet(u)
> # Error: file must be a "RandomAccessFile"
> read_parquet(url(u))
> # Error: file must be a "RandomAccessFile"
> {code}
> The issue is that urls get turned into InputStream by {{make_readable_file}}, 
> and parquet requires RandomAccessFile. 
> {code}
> arrow:::make_readable_file(u)
> # InputStream
> {code}
> There are two relevant codepaths in make_readable_file: if given a string 
> URL, it tries {{FileSystem$from_uri()}} and falls back to 
> {{MakeRConnectionInputStream}}, which returns InputStream not 
> RandomAccessFile. If provided a connection object (i.e. {{url(u)}}), it tries 
> MakeRConnectionRandomAccessFile first and falls back to 
> MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
> InputStream: 
> {code}
> arrow:::MakeRConnectionRandomAccessFile(url(u))
> # Error: Tell() returned an error
> {code}
> If we truly can't work with a HTTP URL in read_parquet, we should at least 
> document that. We could also do the workaround of downloading to a tempfile 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18089) [R] Cannot read_parquet on http URL

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18089:
--
Labels: triaged  (was: )

> [R] Cannot read_parquet on http URL
> ---
>
> Key: ARROW-18089
> URL: https://issues.apache.org/jira/browse/ARROW-18089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: triaged
> Fix For: 11.0.0
>
>
> {code}
> u <- 
> "https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
> read_parquet(u)
> # Error: file must be a "RandomAccessFile"
> read_parquet(url(u))
> # Error: file must be a "RandomAccessFile"
> {code}
> The issue is that urls get turned into InputStream by {{make_readable_file}}, 
> and parquet requires RandomAccessFile. 
> {code}
> arrow:::make_readable_file(u)
> # InputStream
> {code}
> There are two relevant codepaths in make_readable_file: if given a string 
> URL, it tries {{FileSystem$from_uri()}} and falls back to 
> {{MakeRConnectionInputStream}}, which returns InputStream not 
> RandomAccessFile. If provided a connection object (i.e. {{url(u)}}), it tries 
> MakeRConnectionRandomAccessFile first and falls back to 
> MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
> InputStream: 
> {code}
> arrow:::MakeRConnectionRandomAccessFile(url(u))
> # Error: Tell() returned an error
> {code}
> If we truly can't work with a HTTP URL in read_parquet, we should at least 
> document that. We could also do the workaround of downloading to a tempfile 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18123:
--
Summary: [Python] Cannot use multi-byte characters in file names in 
write_table  (was: [Python] Cannot use multi-byte characters in file names)

> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623249#comment-17623249
 ] 

Alessandro Molina commented on ARROW-18123:
---

Fair point, I got distracted by the ticket summary, but the problem is not that 
we don't generally support unicode file names, but that {{write_table}} which 
should definitely support relative paths doesn't. I'll update the ticket 
summary to make it clear and reopen it.

> [Python] Cannot use multi-byte characters in file names
> ---
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina reopened ARROW-18123:
---

> [Python] Cannot use multi-byte characters in file names
> ---
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18123:
--
Priority: Critical  (was: Major)

> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Critical
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18099) [Python] Cannot create pandas categorical from table only with nulls

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623235#comment-17623235
 ] 

Alessandro Molina commented on ARROW-18099:
---

[~jorisvandenbossche] what is your thinking on this one? The need to be able to 
convert to pandas categoricals seems reasonable, I'm just not sure it 
semantically retains the same meaning from the point of view of missing/null 
values.

> [Python] Cannot create pandas categorical from table only with nulls
> 
>
> Key: ARROW-18099
> URL: https://issues.apache.org/jira/browse/ARROW-18099
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: OSX 12.6
> M1 silicon
>Reporter: Damian Barabonkov
>Priority: Minor
>  Labels: python-conversion
>
> A pyarrow Table with only null values cannot be instantiated as a Pandas 
> DataFrame with said column as a category. However, pandas does support 
> "empty" categoricals. Therefore, a simple patch would be to load the pa.Table 
> as an object first and convert, once in pandas, to a categorical which will 
> be empty. However, that does not solve the pyarrow bug at its root.
>  
> Sample reproducible example
> {code:java}
> import pyarrow as pa
> pylist = [{'x': None, '__index_level_0__': 2}, {'x': None, 
> '__index_level_0__': 3}]
> tbl = pa.Table.from_pylist(pylist)
>  
> # Errors
> df_broken = tbl.to_pandas(categories=["x"])
>  
> # Works
> df_works = tbl.to_pandas()
> df_works = df_works.astype({"x": "category"}) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18139) [C++][Release] Verification failures on CentOS7

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623220#comment-17623220
 ] 

Alessandro Molina commented on ARROW-18139:
---

[~raulcd] is this something that we should bring up in the context of the 
current RC?

> [C++][Release] Verification failures on CentOS7
> ---
>
> Key: ARROW-18139
> URL: https://issues.apache.org/jira/browse/ARROW-18139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Benson Muite
>Priority: Minor
>
> When verifying on Cent OS 7 with OpenSSL 3.0, devtoolset 13 and 
> llvm-devtoolset 8, get the following errors:
> ```
> /root/apache-arrow-10.0.0/cpp/src/arrow/util/value_parsing_test.cc:805: 
> Failure
> Expected equality of these values:
>   expected
> Which is: 1514769420
> ```
> ```
> /root/apache-arrow-10.0.0/cpp/src/arrow/flight/sql/server_test.cc:329: Failure
> Failed
> '_error_or_value59.status()' failed with Invalid: Can't prepare statement: 
> near "(": syntax error. gRPC client debug context: 
> {"created":"@1666547640.550873050","description":"Error received from peer 
> ipv6:[::1]:41650","file":"/tmp/arrow-HEAD.bOtfP/cpp-build/grpc_ep-prefix/src/grpc_ep/src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Can't
>  prepare statement: near "(": syntax error","grpc_status":3}. Client context: 
> OK
> [  FAILED  ] TestFlightSqlServer.TestCommandGetTablesWithIncludedSchemas (12 
> ms)
> ```
> ```
> /root/apache-arrow-10.0.0/cpp/src/arrow/flight/sql/server_test.cc:618: Failure
> Failed
> '_error_or_value104.status()' failed with Invalid: Can't prepare statement: 
> near "(": syntax error. gRPC client debug context: 
> {"created":"@1666547640.649382989","description":"Error received from peer 
> ipv6:[::1]:38451","file":"/tmp/arrow-HEAD.bOtfP/cpp-build/grpc_ep-prefix/src/grpc_ep/src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Can't
>  prepare statement: near "(": syntax error","grpc_status":3}. Client context: 
> IOError: Server never sent a data message. Detail: Internal
> [  FAILED  ] TestFlightSqlServer.TestCommandGetPrimaryKeys (15 ms)
> ```
> ```
> /root/apache-arrow-10.0.0/cpp/src/arrow/flight/sql/server_test.cc:642: Failure
> Failed
> '_error_or_value107.status()' failed with Invalid: Can't prepare statement: 
> near "(": syntax error. gRPC client debug context: 
> {"created":"@1666547640.653859201","description":"Error received from peer 
> ipv6:[::1]:38210","file":"/tmp/arrow-HEAD.bOtfP/cpp-build/grpc_ep-prefix/src/grpc_ep/src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Can't
>  prepare statement: near "(": syntax error","grpc_status":3}. Client context: 
> IOError: Server never sent a data message. Detail: Internal
> [  FAILED  ] TestFlightSqlServer.TestCommandGetImportedKeys (4 ms)
> ```
> ```
> /root/apache-arrow-10.0.0/cpp/src/arrow/flight/sql/server_test.cc:674: Failure
> Failed
> '_error_or_value110.status()' failed with Invalid: Can't prepare statement: 
> near "(": syntax error. gRPC client debug context: 
> {"created":"@1666547640.657013835","description":"Error received from peer 
> ipv6:[::1]:36523","file":"/tmp/arrow-HEAD.bOtfP/cpp-build/grpc_ep-prefix/src/grpc_ep/src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Can't
>  prepare statement: near "(": syntax error","grpc_status":3}. Client context: 
> IOError: Server never sent a data message. Detail: Internal
> [  FAILED  ] TestFlightSqlServer.TestCommandGetExportedKeys (5 ms)
> ```
> ```
> /root/apache-arrow-10.0.0/cpp/src/arrow/flight/sql/server_test.cc:708: Failure
> Failed
> '_error_or_value113.status()' failed with Invalid: Can't prepare statement: 
> near "(": syntax error. gRPC client debug context: 
> {"created":"@1666547640.662347626","description":"Error received from peer 
> ipv6:[::1]:43394","file":"/tmp/arrow-HEAD.bOtfP/cpp-build/grpc_ep-prefix/src/grpc_ep/src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Can't
>  prepare statement: near "(": syntax error","grpc_status":3}. Client context: 
> IOError: Server never sent a data message. Detail: Internal
> [  FAILED  ] TestFlightSqlServer.TestCommandGetCrossReference (3 ms)
> ```
>   converted
> Which is: 1514769408
> Google Test trace:
> /root/apache-arrow-10.0.0/cpp/src/arrow/util/value_parsing_test.cc:800: 
> 2018-01-01 00:00:00-0117
> [  FAILED  ] TimestampParser.StrptimeZoneOffset (0 ms)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18133) [C++] Update "options" handling for Substrait functions

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18133:
--
Labels: triaged  (was: )

> [C++] Update "options" handling for Substrait functions
> ---
>
> Key: ARROW-18133
> URL: https://issues.apache.org/jira/browse/ARROW-18133
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: triaged
>
> ARROW-17966 will adjust to the new style of options.  However, many of our 
> existing mappings are treating options incorrectly, largely because the 
> Substrait function definitions changed after the mappings were added.  For 
> example, the arithmetic functions are always looking for an "overflow" 
> option, even though that option is only defined for integral kernels.
> For the time being, this is mostly harmless.  No producers that I am aware of 
> specify options yet.  When they do, some of these issues will probably still 
> be harmless.  For example, it should not hurt to look for an overflow option 
> when it can never be defined.
> However, we should do a pass through and cleanup our handling of options once 
> the Substrait spec has stabilized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18134) [C++][CI] Add Substrait integration testing to CI

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18134:
--
Labels: triaged  (was: )

> [C++][CI] Add Substrait integration testing to CI
> -
>
> Key: ARROW-18134
> URL: https://issues.apache.org/jira/browse/ARROW-18134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: triaged
>
> I don't think we are quite ready (need some of the options handling things to 
> stabilize) but pretty soon we should be able to pass (or expected fail) the 
> tests defined at https://github.com/substrait-io/consumer-testing
> We should add some portion of these tests to our CI as an integration test 
> (nightly is fine) so that we can detect when Substrait definitions change 
> (this should eventually be a rare / non-existent event but in the meantime it 
> is not)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18134) [C++][CI] Add Substrait integration testing to CI

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18134:
--
Issue Type: Improvement  (was: Bug)

> [C++][CI] Add Substrait integration testing to CI
> -
>
> Key: ARROW-18134
> URL: https://issues.apache.org/jira/browse/ARROW-18134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> I don't think we are quite ready (need some of the options handling things to 
> stabilize) but pretty soon we should be able to pass (or expected fail) the 
> tests defined at https://github.com/substrait-io/consumer-testing
> We should add some portion of these tests to our CI as an integration test 
> (nightly is fine) so that we can detect when Substrait definitions change 
> (this should eventually be a rare / non-existent event but in the meantime it 
> is not)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18100) [C++] Intermittent failure in TestNewScanner.Backpressure

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18100:
--
Labels: triaged  (was: )

> [C++] Intermittent failure in TestNewScanner.Backpressure
> -
>
> Key: ARROW-18100
> URL: https://issues.apache.org/jira/browse/ARROW-18100
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: triaged
>
> For example:
> https://github.com/ursacomputing/crossbow/actions/runs/3277989378
> /jobs/5395881371#step:5:3133



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18140) The metadata info will lost in parquet file schema after writing the parquet file by calling the FileSystemDataset::Write() method.

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18140:
--
Component/s: C++

> The metadata info will lost in parquet file schema after writing the parquet 
> file by calling the FileSystemDataset::Write() method.
> ---
>
> Key: ARROW-18140
> URL: https://issues.apache.org/jira/browse/ARROW-18140
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ke Jia
>Priority: Major
>
> This issue can be reproduced by the following code.
> auto format = std::make_shared();
> auto fs = std::make_shared(fs::kNoTime);
> FileSystemDatasetWriteOptions write_options;
> write_options.file_write_options = format->DefaultWriteOptions();
> write_options.filesystem = fs;
> write_options.base_dir = "root";
> write_options.partitioning = std::make_shared(schema({}));
> write_options.basename_template = "\{i}.parquet";
> auto metadata =
>     std::shared_ptr(new KeyValueMetadata(\{"foo"}, 
> \{"bar"}));
> auto dataset_schema = schema(\{field("a", int64())}, metadata);
> RecordBatchVector batches{
>     ConstantArrayGenerator::Zeroes(kRowsPerBatch, dataset_schema)};
> ASSERT_EQ(0, batches[0]->column(0)->null_count());
> auto dataset = std::make_shared(dataset_schema, batches);
> ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
> ASSERT_OK(scanner_builder->Project(
>     \{compute::call("add", {compute::field_ref("a"), compute::literal(1)})},
>     \{"a_plus_one"}));
> ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
> // Before write the schema has the metadata info.
> ASSERT_EQ(1, dataset_schema->HasMetadata());
> ASSERT_OK(FileSystemDataset::Write(write_options, scanner));
> ASSERT_OK_AND_ASSIGN(auto dataset_factory, FileSystemDatasetFactory::Make(
>                                                fs, \{"root/0.parquet"}, 
> format, {}));
> ASSERT_OK_AND_ASSIGN(auto written_dataset, 
> dataset_factory->Finish(FinishOptions{}));
> // After write the schema does not has the metadata info.
> ASSERT_EQ(0, written_dataset->schema()->HasMetadata());



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina closed ARROW-18123.
-
Resolution: Not A Bug

> [Python] Cannot use multi-byte characters in file names
> ---
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623170#comment-17623170
 ] 

Alessandro Molina commented on ARROW-18123:
---

The documentation states
{code:java}
the argument can be a pathlib.Path object, or a string describing an absolute 
local path. {code}
*absolute local path* is the key here
{code:java}
>>> f = pyarrow.fs.FileSystem.from_uri("/home/amol/ARROW/arrow/python/例.pippo")
>>> f
(, 
'/home/amol/ARROW/arrow/python/例.pippo')
>>> f[0].open_input_file(f[1]).read()
b''{code}
 

If you are willing to use a local path, you can rely on {{pathlib.Path}} for 
that
{code:java}
>>> f = pyarrow.fs.FileSystem.from_uri(pathlib.Path("例.pippo"))
>>> f
(, 
'/home/amol/ARROW/arrow/python/例.pippo')
>>> f[0].open_input_file(f[1])

>>> f[0].open_input_file(f[1]).read()
b''
{code}
 

Trying to use an actual uri (with {{file://}} schema) will result in an error 
by the way, and that should probably be supported too:
{code:java}
>>> f = 
>>> pyarrow.fs.FileSystem.from_uri("file:///home/amol/ARROW/arrow/python/例.pippo")
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
    return FileSystem.wrap(GetResultValue(result)), frombytes(c_path)
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
    raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Cannot parse URI: 
'file:///home/amol/ARROW/arrow/python/例.pippo'{code}
As URI are expected to be percentage encoded, I tried percent encoding the 
provided uri. That works as expected regarding parsing the uri, but as the file 
path is not decoded, it results in {{NotFound}} errors. 
{code:java}
>>> f = 
>>> pyarrow.fs.FileSystem.from_uri("file:///home/amol/ARROW/arrow/python/%E4%BE%8B.pippo")
>>> f
(, 
'/home/amol/ARROW/arrow/python/%E4%BE%8B.pippo') >>> f[0].get_file_info(f[1])

>>> f[0].open_input_file(f[1])
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/_fs.pyx", line 763, in pyarrow._fs.FileSystem.open_input_file
    in_handle = GetResultValue(self.fs.OpenInputFile(pathstr))
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
    raise IOError(errno, message)
FileNotFoundError: [Errno 2] Failed to open local file 
'/home/amol/ARROW/arrow/python/%E4%BE%8B.pippo'. Detail: [errno 2] No such file 
or directory {code}
This should probably be an issue we want to fix

 

> [Python] Cannot use multi-byte characters in file names
> ---
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18123:
--
Issue Type: Bug  (was: Improvement)

> [Python] Cannot use multi-byte characters in file names
> ---
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

2022-10-24 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623138#comment-17623138
 ] 

Alessandro Molina commented on ARROW-18076:
---

Have you tried reaching to Cloudflare to verify if it might be a problem with 
the file itself? That error is usually caused by a mismatch between 
{{Content-Length}} header and the actually transfered amount of bytes. In the 
majority of cases the problem is caused by the server setting a wrong 
{{Content-Length}} or truncating the connection. So I would check with 
Cloudflare support, especially if you say that when using S3 the same file 
works correctly.

I'm going to close the ticket, if you get an answer from Cloudflare confirming 
that everything is fine on their side, feel free to reopen it.

> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> --
>
> Key: ARROW-18076
> URL: https://issues.apache.org/jira/browse/ARROW-18076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Ubuntu 20
>Reporter: Vedant Roy
>Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get 
> the following stracktrace:
> {noformat}
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File 
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
>  line 446, in _sample_piece
> (_sample_piece pid=49818) batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in 
> _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
> Transferred a partial file
> {noformat}
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

2022-10-24 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina closed ARROW-18076.
-
Resolution: Not A Bug

> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> --
>
> Key: ARROW-18076
> URL: https://issues.apache.org/jira/browse/ARROW-18076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Ubuntu 20
>Reporter: Vedant Roy
>Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get 
> the following stracktrace:
> {noformat}
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File 
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
>  line 446, in _sample_piece
> (_sample_piece pid=49818) batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in 
> _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
> Transferred a partial file
> {noformat}
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16658) [Python] Support arithmetic on arrays and scalars

2022-10-19 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16658:
--
Priority: Major  (was: Critical)

> [Python] Support arithmetic on arrays and scalars
> -
>
> Key: ARROW-16658
> URL: https://issues.apache.org/jira/browse/ARROW-16658
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Priority: Major
>
> I was surprised to find you can't use standard arithmetic operators on 
> PyArrow arrays and scalars. Instead, one must use the compute functions:
> {code:Python}
> import pyarrow as pa
> arr = pa.array([1, 2, 3])
> pc.add(arr, 2)
> # Doesn't work:
> # arr + 2
> # arr + pa.scalar(2)
> # arr + arr
> pc.multiply(arr, 20)
> # Doesn't work:
> # 20 * arr
> # pa.scalar(20) * arr
> {code}
> Is it intentional we don't support this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18089) [R] Cannot read_parquet on http URL

2022-10-19 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18089:
--
Labels:   (was: triaged)

> [R] Cannot read_parquet on http URL
> ---
>
> Key: ARROW-18089
> URL: https://issues.apache.org/jira/browse/ARROW-18089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 11.0.0
>
>
> {code}
> u <- 
> "https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
> read_parquet(u)
> # Error: file must be a "RandomAccessFile"
> read_parquet(url(u))
> # Error: file must be a "RandomAccessFile"
> {code}
> The issue is that urls get turned into InputStream by {{make_readable_file}}, 
> and parquet requires RandomAccessFile. 
> {code}
> arrow:::make_readable_file(u)
> # InputStream
> {code}
> There are two relevant codepaths in make_readable_file: if given a string 
> URL, it tries {{FileSystem$from_uri()}} and falls back to 
> {{MakeRConnectionInputStream}}, which returns InputStream not 
> RandomAccessFile. If provided a connection object (i.e. {{url(u)}}), it tries 
> MakeRConnectionRandomAccessFile first and falls back to 
> MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
> InputStream: 
> {code}
> arrow:::MakeRConnectionRandomAccessFile(url(u))
> # Error: Tell() returned an error
> {code}
> If we truly can't work with a HTTP URL in read_parquet, we should at least 
> document that. We could also do the workaround of downloading to a tempfile 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18089) [R] Cannot read_parquet on http URL

2022-10-19 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-18089:
--
Labels: triaged  (was: )

> [R] Cannot read_parquet on http URL
> ---
>
> Key: ARROW-18089
> URL: https://issues.apache.org/jira/browse/ARROW-18089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: triaged
> Fix For: 11.0.0
>
>
> {code}
> u <- 
> "https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
> read_parquet(u)
> # Error: file must be a "RandomAccessFile"
> read_parquet(url(u))
> # Error: file must be a "RandomAccessFile"
> {code}
> The issue is that urls get turned into InputStream by {{make_readable_file}}, 
> and parquet requires RandomAccessFile. 
> {code}
> arrow:::make_readable_file(u)
> # InputStream
> {code}
> There are two relevant codepaths in make_readable_file: if given a string 
> URL, it tries {{FileSystem$from_uri()}} and falls back to 
> {{MakeRConnectionInputStream}}, which returns InputStream not 
> RandomAccessFile. If provided a connection object (i.e. {{url(u)}}), it tries 
> MakeRConnectionRandomAccessFile first and falls back to 
> MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
> InputStream: 
> {code}
> arrow:::MakeRConnectionRandomAccessFile(url(u))
> # Error: Tell() returned an error
> {code}
> If we truly can't work with a HTTP URL in read_parquet, we should at least 
> document that. We could also do the workaround of downloading to a tempfile 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15904) [C++] Support rolling backwards and forwards with temporal arithmetic

2022-10-17 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15904:
--
Priority: Major  (was: Blocker)

> [C++] Support rolling backwards and forwards with temporal arithmetic
> -
>
> Key: ARROW-15904
> URL: https://issues.apache.org/jira/browse/ARROW-15904
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Rok Mihevc
>Priority: Major
>
> Original description in ARROW-11090: 
> "This should also cover the ability to do with and without rollback (so have 
> the ability to do e.g. 2021-03-30 minus 1 month and either get a null back, 
> or 2021-02-28), plus the ability to specify whether to rollback to the first 
> or last, and whether to preserve or rest the time.)"
> For example, in R, lubridate has the following functionality:
> * {{rollbackward()}} or {{rollback()}} which changes a date to the last day 
> of the previous month or to the first day of the current month
> * {{rollforward()}} which rolls to the last day of the current month or to 
> the first day of the next month.
> * all of the above also offer the option to preserve hms (hours, minutes and 
> seconds) when rolling. 
> This functionality underpins functions such as {{%m-%}} and {{%m+%}} which 
> are used to add or subtract months to a date without exceeding the last day 
> of the new month.
> {code:r}
> library(lubridate)
> jan <- ymd_hms("2010-01-31 03:04:05")
> jan + months(1:3) # Feb 31 and April 31 returned as NA
> #> [1] NA"2010-03-31 03:04:05 UTC"
> #> [3] NA
> # NA "2010-03-31 03:04:05 UTC" NA
> jan %m+% months(1:3) # No rollover
> #> [1] "2010-02-28 03:04:05 UTC" "2010-03-31 03:04:05 UTC"
> #> [3] "2010-04-30 03:04:05 UTC"
> leap <- ymd("2012-02-29")
> "2012-02-29 UTC"
> #> [1] "2012-02-29 UTC"
> leap %m+% years(1)
> #> [1] "2013-02-28"
> leap %m+% years(-1)
> #> [1] "2011-02-28"
> leap %m-% years(1)
> #> [1] "2011-02-28"
> x <- ymd_hms("2019-01-29 01:02:03")
> add_with_rollback(x, months(1))
> #> [1] "2019-02-28 01:02:03 UTC"
> add_with_rollback(x, months(1), preserve_hms = FALSE)
> #> [1] "2019-02-28 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE)
> #> [1] "2019-03-01 01:02:03 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE, preserve_hms = FALSE)
> #> [1] "2019-03-01 UTC"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17932) [C++] Implement streaming RecordBatchReader for JSON

2022-10-17 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-17932:
--
Priority: Major  (was: Blocker)

> [C++] Implement streaming RecordBatchReader for JSON
> 
>
> Key: ARROW-17932
> URL: https://issues.apache.org/jira/browse/ARROW-17932
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Ben Harkins
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We don't currently support incremental RecordBatch reading from JSON streams, 
> which is needed to properly implement JSON support in Dataset. The existing 
> CSV StreamingReader API can be used as a model.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17292) [C++] Segmentation fault on arrow-compute-hash-join-node-test on macos nightlies

2022-10-13 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617120#comment-17617120
 ] 

Alessandro Molina commented on ARROW-17292:
---

I'm lowering this from Blocker to Critical has it has already been present in 
past releases. But we should keep it under high attention, the nightlies should 
prevent us from forgetting it's a problem we need to fix by the way.

> [C++] Segmentation fault on arrow-compute-hash-join-node-test on macos 
> nightlies
> 
>
> Key: ARROW-17292
> URL: https://issues.apache.org/jira/browse/ARROW-17292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 17.5h
>  Remaining Estimate: 0h
>
> Some of our nightly builds are failing due to a segmentation fault on 
> hash-join tests:
> {code:java}
>  33/90 Test #35: arrow-compute-hash-join-node-test .***Failed    1.21 
> sec
> Running arrow-compute-hash-join-node-test, redirecting output into 
> /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/arrow-HEAD.X.W72iCJcj/cpp-build/build/test-logs/arrow-compute-hash-join-node-test.txt
>  (attempt 1/1)
> /Users/runner/work/crossbow/crossbow/arrow/cpp/build-support/run-test.sh: 
> line 88: 78018 Segmentation fault: 11  $TEST_EXECUTABLE "$@" > $LOGFILE.raw 
> 2>&1
> Running main() from 
> /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/arrow-HEAD.X.W72iCJcj/cpp-build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 29 tests from 4 test suites.
> [--] Global test environment set-up.
> [--] 10 tests from HashJoin
> [ RUN      ] HashJoin.Suffix
> [       OK ] HashJoin.Suffix (4 ms)
> [ RUN      ] HashJoin.Random
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/arrow-HEAD.X.W72iCJcj/cpp-build/src/arrow/compute/exec
>  {code}
> The failures can be seen. It seems to be only related to macos from the 
> failed jobs:
> [verify-rc-source-cpp-macos-conda-amd64|https://github.com/ursacomputing/crossbow/runs/7631965199?check_suite_focus=true]
> [verify-rc-source-integration-macos-conda-amd64|https://github.com/ursacomputing/crossbow/runs/7631969879?check_suite_focus=true]
> [verify-rc-source-python-macos-amd64|https://github.com/ursacomputing/crossbow/runs/7631926429?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17292) [C++] Segmentation fault on arrow-compute-hash-join-node-test on macos nightlies

2022-10-13 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-17292:
--
Priority: Critical  (was: Blocker)

> [C++] Segmentation fault on arrow-compute-hash-join-node-test on macos 
> nightlies
> 
>
> Key: ARROW-17292
> URL: https://issues.apache.org/jira/browse/ARROW-17292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Raúl Cumplido
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 17.5h
>  Remaining Estimate: 0h
>
> Some of our nightly builds are failing due to a segmentation fault on 
> hash-join tests:
> {code:java}
>  33/90 Test #35: arrow-compute-hash-join-node-test .***Failed    1.21 
> sec
> Running arrow-compute-hash-join-node-test, redirecting output into 
> /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/arrow-HEAD.X.W72iCJcj/cpp-build/build/test-logs/arrow-compute-hash-join-node-test.txt
>  (attempt 1/1)
> /Users/runner/work/crossbow/crossbow/arrow/cpp/build-support/run-test.sh: 
> line 88: 78018 Segmentation fault: 11  $TEST_EXECUTABLE "$@" > $LOGFILE.raw 
> 2>&1
> Running main() from 
> /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/arrow-HEAD.X.W72iCJcj/cpp-build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 29 tests from 4 test suites.
> [--] Global test environment set-up.
> [--] 10 tests from HashJoin
> [ RUN      ] HashJoin.Suffix
> [       OK ] HashJoin.Suffix (4 ms)
> [ RUN      ] HashJoin.Random
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/arrow-HEAD.X.W72iCJcj/cpp-build/src/arrow/compute/exec
>  {code}
> The failures can be seen. It seems to be only related to macos from the 
> failed jobs:
> [verify-rc-source-cpp-macos-conda-amd64|https://github.com/ursacomputing/crossbow/runs/7631965199?check_suite_focus=true]
> [verify-rc-source-integration-macos-conda-amd64|https://github.com/ursacomputing/crossbow/runs/7631969879?check_suite_focus=true]
> [verify-rc-source-python-macos-amd64|https://github.com/ursacomputing/crossbow/runs/7631926429?check_suite_focus=true]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18005) [C++] Bind the JSON RecordBatchReader to Dataset

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina reassigned ARROW-18005:
-

Assignee: Ben Harkins

> [C++] Bind the JSON RecordBatchReader to Dataset
> 
>
> Key: ARROW-18005
> URL: https://issues.apache.org/jira/browse/ARROW-18005
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Alessandro Molina
>Assignee: Ben Harkins
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18005) [C++] Bind the JSON RecordBatchReader to Dataset

2022-10-12 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-18005:
-

 Summary: [C++] Bind the JSON RecordBatchReader to Dataset
 Key: ARROW-18005
 URL: https://issues.apache.org/jira/browse/ARROW-18005
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Alessandro Molina






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16279) [Python] Support Expressions in Table.filter

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina closed ARROW-16279.
-
Resolution: Fixed

> [Python] Support Expressions in Table.filter
> 
>
> Key: ARROW-16279
> URL: https://issues.apache.org/jira/browse/ARROW-16279
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 10.0.0
>
>
> *Umbrella ticket*
> At the moment {{Table.filter}} only accepts a mask, and building a mask that 
> actually leads to the rows we care about can be complex and slow in cases 
> where more than one compute function is used to generate the mask. It would 
> be helpful to be able to pass an {{Expression}} as the argument and get the 
> table filtered by that expression as expressions are easier to understand and 
> reason about than masks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17212) [Python] Support lazy Dataset.filter

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-17212:
--
Fix Version/s: 11.0.0
   (was: 10.0.0)

> [Python] Support lazy Dataset.filter
> 
>
> Key: ARROW-17212
> URL: https://issues.apache.org/jira/browse/ARROW-17212
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>
> Given that when possible we would like to keep Dataset and Table with a 
> similar enough API that allows to perform most convenient operations on 
> Dataset without having to materialise it to a table, it would be good to add 
> proper support for a {{Dataset.filter}} method like the one we have on 
> {{Table.filter}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16616:
--
Fix Version/s: 11.0.0
   (was: 10.0.0)

> [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter 
> method
> -
>
> Key: ARROW-16616
> URL: https://issues.apache.org/jira/browse/ARROW-16616
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> To keep the {{Dataset}} api compatible with the {{Table}} one in terms of 
> analytics capabilities, we should add a {{Dataset.filter}} method. The 
> initial POC was based on {{_table_filter}} but that required materialising 
> all the {{Dataset}} content after filtering as it returned an 
> {{{}InMemoryDataset{}}}. 
> Given that {{Scanner}} can filter a dataset without actually materialising 
> the data until a final step happens, it would be good to have 
> {{Dataset.filter}} return some form of lazy dataset when the filter is only 
> stored aside and the Scanner is created when data is actually retrieved.
> PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} 
> method



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18003) [Python] Add sort_by to Table and RecordBatch

2022-10-12 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616417#comment-17616417
 ] 

Alessandro Molina commented on ARROW-18003:
---

Part of the work here was being conducted in 
https://github.com/apache/arrow/pull/11659

> [Python] Add sort_by to Table and RecordBatch
> -
>
> Key: ARROW-18003
> URL: https://issues.apache.org/jira/browse/ARROW-18003
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14656) [Python] Add sort_by method to Array, StructArray and ChunkedArray

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14656:
--
Fix Version/s: 11.0.0

> [Python] Add sort_by method to Array, StructArray and ChunkedArray
> --
>
> Key: ARROW-14656
> URL: https://issues.apache.org/jira/browse/ARROW-14656
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> It would be convenient to be able to sort a {{StructArray}} on one of its 
> columns. This can be done by combining multiple compute functions, but having 
> a helper that does that for you would probably be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18003) [Python] Add sort_by to Table and RecordBatch

2022-10-12 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-18003:
-

 Summary: [Python] Add sort_by to Table and RecordBatch
 Key: ARROW-18003
 URL: https://issues.apache.org/jira/browse/ARROW-18003
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Python
Reporter: Alessandro Molina
 Fix For: 11.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14656) [Python] Add sort_by method to Array, StructArray and ChunkedArray

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14656:
--
Parent: ARROW-18002
Issue Type: Sub-task  (was: New Feature)

> [Python] Add sort_by method to Array, StructArray and ChunkedArray
> --
>
> Key: ARROW-14656
> URL: https://issues.apache.org/jira/browse/ARROW-14656
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> It would be convenient to be able to sort a {{StructArray}} on one of its 
> columns. This can be done by combining multiple compute functions, but having 
> a helper that does that for you would probably be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18002) [Python] Improve Sorting Capabilities in PyArrow

2022-10-12 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-18002:
-

 Summary: [Python] Improve Sorting Capabilities in PyArrow
 Key: ARROW-18002
 URL: https://issues.apache.org/jira/browse/ARROW-18002
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alessandro Molina






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14656) [Python] Add sort_by method to Array, StructArray and ChunkedArray

2022-10-12 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14656:
--
Summary: [Python] Add sort_by method to Array, StructArray and ChunkedArray 
 (was: [Python] Add sort_by helper method to StructArray)

> [Python] Add sort_by method to Array, StructArray and ChunkedArray
> --
>
> Key: ARROW-14656
> URL: https://issues.apache.org/jira/browse/ARROW-14656
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> It would be convenient to be able to sort a {{StructArray}} on one of its 
> columns. This can be done by combining multiple compute functions, but having 
> a helper that does that for you would probably be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16728) [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset

2022-09-27 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16728:
--
Priority: Blocker  (was: Major)

> [Python] Switch default and deprecate use_legacy_dataset=True in 
> ParquetDataset
> ---
>
> Key: ARROW-16728
> URL: https://issues.apache.org/jira/browse/ARROW-16728
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The ParquetDataset() constructor itself still defaults to 
> {{use_legacy_dataset=True}} (although using specific attributes or keywords 
> related to that will raise a warning). So a next step will be to actually 
> deprecate passing that and switching the default, and then only afterwards we 
> can remove the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17660) [Doc] pyarrow.Array.diff Examples is wrongly rendered

2022-09-09 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-17660:
--
Component/s: Documentation
 Python

> [Doc] pyarrow.Array.diff Examples is wrongly rendered
> -
>
> Key: ARROW-17660
> URL: https://issues.apache.org/jira/browse/ARROW-17660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>
> See 
> [https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.diff]
> The output of the diff function is rendered outside of the code block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17660) [Doc] pyarrow.Array.diff Examples is wrongly rendered

2022-09-09 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-17660:
--
Affects Version/s: 9.0.0

> [Doc] pyarrow.Array.diff Examples is wrongly rendered
> -
>
> Key: ARROW-17660
> URL: https://issues.apache.org/jira/browse/ARROW-17660
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Affects Versions: 9.0.0
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>
> See 
> [https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.diff]
> The output of the diff function is rendered outside of the code block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17660) [Doc] pyarrow.Array.diff Examples is wrongly rendered

2022-09-09 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-17660:
-

 Summary: [Doc] pyarrow.Array.diff Examples is wrongly rendered
 Key: ARROW-17660
 URL: https://issues.apache.org/jira/browse/ARROW-17660
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Alessandro Molina
Assignee: Alessandro Molina


See 
[https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.diff]

The output of the diff function is rendered outside of the code block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15693) [Dev] Update crossbow templates to use master or main

2022-09-06 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-15693.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13975
[https://github.com/apache/arrow/pull/13975]

> [Dev] Update crossbow templates to use master or main
> -
>
> Key: ARROW-15693
> URL: https://issues.apache.org/jira/browse/ARROW-15693
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Neal Richardson
>Assignee: Kevin Gurney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

2022-07-26 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16616:
--
Parent Issue: ARROW-17212  (was: ARROW-16279)

> [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter 
> method
> -
>
> Key: ARROW-16616
> URL: https://issues.apache.org/jira/browse/ARROW-16616
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> To keep the {{Dataset}} api compatible with the {{Table}} one in terms of 
> analytics capabilities, we should add a {{Dataset.filter}} method. The 
> initial POC was based on {{_table_filter}} but that required materialising 
> all the {{Dataset}} content after filtering as it returned an 
> {{{}InMemoryDataset{}}}. 
> Given that {{Scanner}} can filter a dataset without actually materialising 
> the data until a final step happens, it would be good to have 
> {{Dataset.filter}} return some form of lazy dataset when the filter is only 
> stored aside and the Scanner is created when data is actually retrieved.
> PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} 
> method



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16518) [Python] Ensure _exec_plan.execplan preserves order of inputs

2022-07-26 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16518:
--
Parent Issue: ARROW-17212  (was: ARROW-16279)

> [Python] Ensure _exec_plan.execplan preserves order of inputs
> -
>
> Key: ARROW-16518
> URL: https://issues.apache.org/jira/browse/ARROW-16518
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 10.0.0
>
>
> At the moment execplan doesn't guarantee any ordered output, the batches are 
> consumed in a random order. This can lead to unordered rows in outputs when 
> {{use_threads=True}}
> For example providing a column with {{b=[a, a, a, a, b, b, b, b]}} will 
> sometimes give back {{b=[a, b]}} and sometimes {{b=[b, a]}}
> See
> {code:java}
> In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
> In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
> In [20]: table = pa.concat_tables([table1, table2])
> In [21]: ep._filter_table(table, pc.field('a') == 1)
> Out[21]: 
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1],[1]]
> b: [["b"],["a"]]
> In [22]: ep._filter_table(table, pc.field('a') == 1)
> Out[22]: 
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1],[1]]
> b: [["a"],["b"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17212) [Python] Support lazy Dataset.filter

2022-07-26 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-17212:
-

 Summary: [Python] Support lazy Dataset.filter
 Key: ARROW-17212
 URL: https://issues.apache.org/jira/browse/ARROW-17212
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alessandro Molina
Assignee: Alessandro Molina
 Fix For: 10.0.0


Given that when possible we would like to keep Dataset and Table with a similar 
enough API that allows to perform most convenient operations on Dataset without 
having to materialise it to a table, it would be good to add proper support for 
a {{Dataset.filter}} method like the one we have on {{Table.filter}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17066) [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-21 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569462#comment-17569462
 ] 

Alessandro Molina commented on ARROW-17066:
---

Is this one an actual blocker? We are not really exposing Substrait usage to 
the public, so I guess it can land at any time during 10.0.0 too

> [C++][Substrait] "ignore_unknown_fields" should be specified when converting 
> JSON to binary
> ---
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Blocker
>  Labels: pull-request-available, substrait
> Fix For: 9.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16661) [Docs] Move verify Release candidate documentation to development guide

2022-07-07 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16661:
--
Parent: (was: ARROW-16655)
Issue Type: Improvement  (was: Sub-task)

> [Docs] Move verify Release candidate documentation to development guide
> ---
>
> Key: ARROW-16661
> URL: https://issues.apache.org/jira/browse/ARROW-16661
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16662) [Doc] Split release guide into maintenance and major release

2022-07-07 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16662:
--
Parent: (was: ARROW-16655)
Issue Type: Improvement  (was: Sub-task)

> [Doc] Split release guide into maintenance and major release
> 
>
> Key: ARROW-16662
> URL: https://issues.apache.org/jira/browse/ARROW-16662
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Raúl Cumplido
>Priority: Major
>
> The current Release guide has a single documentation for both major and 
> maintenance releases and it's difficult to follow. Split the guide in two 
> separate parts noting the distinctions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16432) [Docs] Update verify RC instructions - JDK8

2022-07-07 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16432:
--
Parent: (was: ARROW-16655)
Issue Type: Improvement  (was: Sub-task)

> [Docs] Update verify RC instructions - JDK8
> ---
>
> Key: ARROW-16432
> URL: https://issues.apache.org/jira/browse/ARROW-16432
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jacob Wujciak-Jens
>Priority: Major
> Fix For: 9.0.0
>
>
> The PPA for Oracle JDK is discontinued due to the license change 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16333) [Release] Improve Nightly Reports

2022-07-07 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina reassigned ARROW-16333:
-

Assignee: Alessandro Molina  (was: Raúl Cumplido)

> [Release] Improve Nightly Reports
> -
>
> Key: ARROW-16333
> URL: https://issues.apache.org/jira/browse/ARROW-16333
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Continuous Integration, Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 9.0.0
>
>
> This initiative tries to improve on some of the issues we currently have with 
> our nightly reports to get a clearer understanding on what is the status of 
> our nightly builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16333) [Release] Improve Nightly Reports

2022-07-07 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-16333.
---
Resolution: Fixed

> [Release] Improve Nightly Reports
> -
>
> Key: ARROW-16333
> URL: https://issues.apache.org/jira/browse/ARROW-16333
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Continuous Integration, Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> This initiative tries to improve on some of the issues we currently have with 
> our nightly reports to get a clearer understanding on what is the status of 
> our nightly builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16362) [Dev][CI] Add to nightlies dashboard list of Jira tickets open/in progress related with nightly failures

2022-07-07 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16362:
--
Parent: (was: ARROW-16333)
Issue Type: Improvement  (was: Sub-task)

> [Dev][CI] Add to nightlies dashboard list of Jira tickets open/in progress 
> related with nightly failures
> 
>
> Key: ARROW-16362
> URL: https://issues.apache.org/jira/browse/ARROW-16362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> Be able to see list of open jira tickets on the nightlies dashboard page by 
> querying the jira API filtering by label, example on python but gist to the 
> idea ([https://gist.github.com/raulcd/a033f5761f290ee4ab6fb349640e0d5b])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-15179) [Java] Ensure Support for modern Java versions

2022-07-06 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina closed ARROW-15179.
-

> [Java] Ensure Support for modern Java versions
> --
>
> Key: ARROW-15179
> URL: https://issues.apache.org/jira/browse/ARROW-15179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Alessandro Molina
>Assignee: David Dali Susanibar Arce
>Priority: Major
> Fix For: 9.0.0
>
>
> *Umbrella ticket for supporting recent java versions*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12724) [C++] Add documentation for authoring compute kernels

2022-07-05 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562784#comment-17562784
 ] 

Alessandro Molina commented on ARROW-12724:
---

Given [https://github.com/apache/arrow/pull/12460#issuecomment-1057520554]
What are we planning to do with this one? Has been around for a while.

> [C++] Add documentation for authoring compute kernels
> -
>
> Key: ARROW-12724
> URL: https://issues.apache.org/jira/browse/ARROW-12724
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> To help incoming developer's to work in the compute layer, it would be good 
> to have documentation on the process to follow for authoring a new compute 
> kernel. This document can help demystify the inner workings of the functions 
> and data structures in the compute layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12724) [C++] Add documentation for authoring compute kernels

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-12724:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [C++] Add documentation for authoring compute kernels
> -
>
> Key: ARROW-12724
> URL: https://issues.apache.org/jira/browse/ARROW-12724
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> To help incoming developer's to work in the compute layer, it would be good 
> to have documentation on the process to follow for authoring a new compute 
> kernel. This document can help demystify the inner workings of the functions 
> and data structures in the compute layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12724) [C++] Add documentation for authoring compute kernels

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-12724:
--
Priority: Critical  (was: Major)

> [C++] Add documentation for authoring compute kernels
> -
>
> Key: ARROW-12724
> URL: https://issues.apache.org/jira/browse/ARROW-12724
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> To help incoming developer's to work in the compute layer, it would be good 
> to have documentation on the process to follow for authoring a new compute 
> kernel. This document can help demystify the inner workings of the functions 
> and data structures in the compute layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13316) [C++][Doc] Fix warnings generated by sphinx when incorporating doxygen docs

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-13316:
--
Fix Version/s: (was: 9.0.0)

> [C++][Doc] Fix warnings generated by sphinx when incorporating doxygen docs
> ---
>
> Key: ARROW-13316
> URL: https://issues.apache.org/jira/browse/ARROW-13316
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Weston Pace
>Assignee: Eduardo Ponce
>Priority: Major
>
> Sphinx interprets the doxygen output to build the final documentation.  This 
> process generates some warnings.
>  
> This warning is generated when running doxygen:
> {code:java}
> warning: Tag 'COLS_IN_ALPHA_INDEX' at line 1118 of file 'Doxyfile' has become 
> obsolete.
>  To avoid this warning please remove this line from your 
> configuration file or upgrade it using "doxygen -u"
> {code}
> There are many warnings contributed to compute.rst that look like this (it is 
> unclear where this static constexpr static is coming from as it is not 
> present in the repo or doxygen that I can find):
> {code:java}
> /home/pace/dev/arrow/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ 
> declaration: Expected identifier in nested name, got keyword: static [error 
> at 23]
>   static constexpr static char const kTypeName []  = "ScalarAggregateOptions"
> {code}
> There is a duplicate definition warning (I think this one is because the doc 
> comment is present on both the definition and the override)
> {code:java}
> /home/pace/dev/arrow/docs/source/cpp/api/dataset.rst:69: WARNING: Duplicate 
> declaration, Result< std::shared_ptr< FileFragment > > MakeFragment 
> (FileSource source, compute::Expression partition_expression, 
> std::shared_ptr< Schema > physical_schema)
> {code}
> Finally, there is a specific issue with the GetRecordBatchGenerator function
> {code:java}
> /home/pace/dev/arrow/docs/source/cpp/api/formats.rst:80: WARNING: Error when 
> parsing function declaration.
> If the function has no return type:
>   Error in declarator or parameters-and-qualifiers
>   Main error:
> Invalid C++ declaration: Expecting "(" in parameters-and-qualifiers. 
> [error at 23]
>   virtual ::arrow::Result< std::function<::arrow::Future< 
> std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
> (std::shared_ptr< FileReader > reader, const std::vector< int > 
> row_group_indices, const std::vector< int > column_indices, 
> ::arrow::internal::Executor *cpu_executor=NULLPTR)=0
>   ---^
>   Potential other error:
> Error in parsing template argument list.
> If type argument:
>   Main error:
> Invalid C++ declaration: Expected "...>", ">" or "," in template 
> argument list. [error at 38]
>   virtual ::arrow::Result< std::function<::arrow::Future< 
> std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
> (std::shared_ptr< FileReader > reader, const std::vector< int > 
> row_group_indices, const std::vector< int > column_indices, 
> ::arrow::internal::Executor *cpu_executor=NULLPTR)=0
>   --^
>   Potential other error:
> Main error:
>   Invalid C++ declaration: Expected identifier in nested name. [error 
> at 38]
> virtual ::arrow::Result< std::function<::arrow::Future< 
> std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
> (std::shared_ptr< FileReader > reader, const std::vector< int > 
> row_group_indices, const std::vector< int > column_indices, 
> ::arrow::internal::Executor *cpu_executor=NULLPTR)=0
> --^
> Potential other error:
>   Error in parsing template argument list.
>   If type argument:
> Invalid C++ declaration: Expected "...>", ">" or "," in template 
> argument list. [error at 96]
>   virtual ::arrow::Result< std::function<::arrow::Future< 
> std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
> (std::shared_ptr< FileReader > reader, const std::vector< int > 
> row_group_indices, const std::vector< int > column_indices, 
> ::arrow::internal::Executor *cpu_executor=NULLPTR)=0
>   
> ^
>   If non-type argument:
> Invalid C++ declaration: Expected "...>", ">" or "," in template 
> argument list. [error at 96]
>   virtual ::arrow::Result< std::function<::arrow::Future< 
> std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
> (std::shared_ptr< FileReader > reader, const std::vector< int > 
> row_group_indices, const std::vector< int 

[jira] [Updated] (ARROW-8047) [Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-8047:
-
Priority: Critical  (was: Major)

> [Python][Documentation] Document migration from ParquetDataset to 
> pyarrow.datasets
> --
>
> Key: ARROW-8047
> URL: https://issues.apache.org/jira/browse/ARROW-8047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We need documentation describing a migration path from ParquetDataset, at 
> least for the basic user facing API of ParquetDataset (As I read it, that's: 
> construction, projection, filtering, threading, for a first pass). Following 
> this we could mark ParquetDataset as deprecated, building features needed by 
> power users like dask and adding those to the migration document



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-8047) [Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets

2022-07-05 Thread Alessandro Molina (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562782#comment-17562782
 ] 

Alessandro Molina commented on ARROW-8047:
--

Postponing this to version 10

> [Python][Documentation] Document migration from ParquetDataset to 
> pyarrow.datasets
> --
>
> Key: ARROW-8047
> URL: https://issues.apache.org/jira/browse/ARROW-8047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We need documentation describing a migration path from ParquetDataset, at 
> least for the basic user facing API of ParquetDataset (As I read it, that's: 
> construction, projection, filtering, threading, for a first pass). Following 
> this we could mark ParquetDataset as deprecated, building features needed by 
> power users like dask and adding those to the migration document



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12634) [Docs] Describe use of Jira Affects Version in Contributing docs

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-12634:
--
Fix Version/s: (was: 9.0.0)

> [Docs] Describe use of Jira Affects Version in Contributing docs
> 
>
> Key: ARROW-12634
> URL: https://issues.apache.org/jira/browse/ARROW-12634
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>
> Update the [*Contributing to Apache 
> Arrow*|https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst]
>  page to describe the preferred way to use *Affects Version* in Jira, as 
> described in this email to the dev list: 
> [https://mail-archives.apache.org/mod_mbox/arrow-dev/202104.mbox/%3cCABCGCVfqQgqrh1bGpzhK5no+_KYm1ctvYvs2ajJ_by-8=vr...@mail.gmail.com%3e]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-8047) [Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-8047:
-
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [Python][Documentation] Document migration from ParquetDataset to 
> pyarrow.datasets
> --
>
> Key: ARROW-8047
> URL: https://issues.apache.org/jira/browse/ARROW-8047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We need documentation describing a migration path from ParquetDataset, at 
> least for the basic user facing API of ParquetDataset (As I read it, that's: 
> construction, projection, filtering, threading, for a first pass). Following 
> this we could mark ParquetDataset as deprecated, building features needed by 
> power users like dask and adding those to the migration document



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16080) [R][Documentation] Document filename-based partitioning and filename-as-variable functionality

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16080:
--
Fix Version/s: (was: 9.0.0)

> [R][Documentation] Document filename-based partitioning and 
> filename-as-variable functionality
> --
>
> Key: ARROW-16080
> URL: https://issues.apache.org/jira/browse/ARROW-16080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>
> Filename-based partitioning has been implemented in C++; we should add 
> something to our docs about this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14280) [Doc] R package Architectural Overview

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14280:
--
Description: 
Add a description of how R-Arrow source code is structured in the New 
Contributors Guide, so that new contributors know where to look for if they 
want to fix/extend a feature.

See 
https://arrow.apache.org/docs/dev/developers/guide/architectural_overview.html#architectural-overview

  was:Add a description of how R-Arrow source code is structured in the New 
Contributors Guide, so that new contributors know where to look for if they 
want to fix/extend a feature.


> [Doc] R package Architectural Overview
> --
>
> Key: ARROW-14280
> URL: https://issues.apache.org/jira/browse/ARROW-14280
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Alessandro Molina
>Assignee: Jonathan Keane
>Priority: Critical
> Fix For: 10.0.0
>
>
> Add a description of how R-Arrow source code is structured in the New 
> Contributors Guide, so that new contributors know where to look for if they 
> want to fix/extend a feature.
> See 
> https://arrow.apache.org/docs/dev/developers/guide/architectural_overview.html#architectural-overview



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14280) [Doc] R package Architectural Overview

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14280:
--
Priority: Critical  (was: Major)

> [Doc] R package Architectural Overview
> --
>
> Key: ARROW-14280
> URL: https://issues.apache.org/jira/browse/ARROW-14280
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Alessandro Molina
>Assignee: Jonathan Keane
>Priority: Critical
> Fix For: 10.0.0
>
>
> Add a description of how R-Arrow source code is structured in the New 
> Contributors Guide, so that new contributors know where to look for if they 
> want to fix/extend a feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14280) [Doc] R package Architectural Overview

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-14280:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [Doc] R package Architectural Overview
> --
>
> Key: ARROW-14280
> URL: https://issues.apache.org/jira/browse/ARROW-14280
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Alessandro Molina
>Assignee: Jonathan Keane
>Priority: Major
> Fix For: 10.0.0
>
>
> Add a description of how R-Arrow source code is structured in the New 
> Contributors Guide, so that new contributors know where to look for if they 
> want to fix/extend a feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15117) [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15117:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects
> --
>
> Key: ARROW-15117
> URL: https://issues.apache.org/jira/browse/ARROW-15117
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See the mailing list 
> (https://mail-archives.apache.org/mod_mbox/arrow-dev/202112.mbox/%3CCALQtMBbiasQtXYc46kpw-TyQ-TQSPjNQ5%2BkoREuKvJ3hJSdWjw%40mail.gmail.com%3E)
>  and this google doc 
> (https://docs.google.com/document/d/1AXDNwU5CSnZ1cSeUISwy_xgvTzoYWeuqWApC8UEv97Q/edit?usp=sharing)
>  for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15754) [Java] ORC JNI bridge should use the C data interface

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15754:
--
Parent: ARROW-16979
Issue Type: Sub-task  (was: Improvement)

> [Java] ORC JNI bridge should use the C data interface
> -
>
> Key: ARROW-15754
> URL: https://issues.apache.org/jira/browse/ARROW-15754
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Antoine Pitrou
>Assignee: Larry White
>Priority: Major
>
> Right now the ORC JNI bridge uses some custom buffer passing which only seems 
> to handle primitive arrays correctly (child array buffers and dictionaries 
> are not considered):
> https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L263-L265
> Instead, it should use the C data interface, which is now implemented in Java.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16979) [Java] Further Consolidate JNI compilation

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-16979:
--
Description: See https://issues.apache.org/jira/browse/ARROW-15174 for the 
original effort that shipped in version 9.0.0

> [Java] Further Consolidate JNI compilation
> --
>
> Key: ARROW-16979
> URL: https://issues.apache.org/jira/browse/ARROW-16979
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Alessandro Molina
>Priority: Major
> Fix For: 10.0.0
>
>
> See https://issues.apache.org/jira/browse/ARROW-15174 for the original effort 
> that shipped in version 9.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16979) [Java] Further Consolidate JNI compilation

2022-07-05 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16979:
-

 Summary: [Java] Further Consolidate JNI compilation
 Key: ARROW-16979
 URL: https://issues.apache.org/jira/browse/ARROW-16979
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15174) [Java] Consolidate JNI compilation

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-15174.
---
Resolution: Fixed

> [Java] Consolidate JNI compilation
> --
>
> Key: ARROW-15174
> URL: https://issues.apache.org/jira/browse/ARROW-15174
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Alessandro Molina
>Assignee: Larry White
>Priority: Major
> Fix For: 9.0.0
>
>
> *Umbrella ticket for consolidating Java JNI compilation initiative*
> Seems we have spread the JNI code across the {{cpp}} and {{java}} 
> directories. As for other bindings (Python) we already discussed it would be 
> great to consolidate and move all cpp code related to PYthon into PyArrow, we 
> should do something equivalent for Java too and move all C++ code specific to 
> Java into the Java project.
> At the moment there are two JNI related directories:
>  * [https://github.com/apache/arrow/tree/master/java/c]
>  * [https://github.com/apache/arrow/tree/master/cpp/src/jni]
> Let's also research what's the best method to build those. The {{java/c}} 
> directory seems to be already integrated with the Java build process, let's 
> check if that approach is something we can reuse for the {{dataset}} 
> directory too



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16913) [Java] Implement ArrowArrayStream/C Stream Interface

2022-07-05 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina resolved ARROW-16913.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13465
[https://github.com/apache/arrow/pull/13465]

> [Java] Implement ArrowArrayStream/C Stream Interface
> 
>
> Key: ARROW-16913
> URL: https://issues.apache.org/jira/browse/ARROW-16913
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> ARROW-12965 implemented the core C Data Interface, but we still need to 
> implement the streaming interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >