[jira] [Created] (ARROW-13971) [C++][Compute] Improve top_k/bottom_k Selecters via CRTP

2021-09-09 Thread Alexander (Jira)
Alexander created ARROW-13971:
-

 Summary: [C++][Compute] Improve top_k/bottom_k Selecters via CRTP
 Key: ARROW-13971
 URL: https://issues.apache.org/jira/browse/ARROW-13971
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Alexander


As it was mentioned here: 

[https://github.com/apache/arrow/pull/11019#discussion_r701349253]

 

Selecters for SelectKUnstable all have a relatively similar core structure. It 
might be worth considering how some templating (e.g. via 
[CRTP|https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern], or 
via a set of helper comparator/iteration templates) could let you factor out 
the core algorithm and the container-specific bits.

It would then be easier to also try to share generated code between types with 
the same physical type (e.g. as mentioned, Int64, Timestamp, and Date64 should 
all use the same generated code underneath). This issue  related to  create 
template specialization functions for these types  was also mentioned here: 

https://github.com/apache/arrow/pull/11019#discussion_r700238908

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13970) [C++][Compute] Implement streaming version for SelectK

2021-09-09 Thread Alexander (Jira)
Alexander created ARROW-13970:
-

 Summary: [C++][Compute] Implement streaming  version for SelectK
 Key: ARROW-13970
 URL: https://issues.apache.org/jira/browse/ARROW-13970
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Alexander
 Fix For: 6.0.0


PR [https://github.com/apache/arrow/pull/11019] implements SelectKUnstable.

A streaming version of it using a Heap based solution seems to be the right 
direction as it was discussed here: 

https://github.com/apache/arrow/pull/11019#issuecomment-914419100 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13969) [C++][Compute] Implement SelectKStable

2021-09-09 Thread Alexander (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander reassigned ARROW-13969:
-

Assignee: (was: Alexander)

> [C++][Compute] Implement SelectKStable
> --
>
> Key: ARROW-13969
> URL: https://issues.apache.org/jira/browse/ARROW-13969
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander
>Priority: Major
>  Labels: analytics, query-engine
> Fix For: 6.0.0
>
>
> PR [https://github.com/apache/arrow/pull/11019] implements SelectKUnstable.
>  
> Some previous result of SelectKUnstable using StableHeap is shown here: 
> [https://github.com/apache/arrow/pull/11019#issuecomment-913977337] 
>  
> So, implementation of SelectKStable   should explore how to implement this 
> algorithm using StablePartition + stable_sorting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13969) [C++][Compute] Implement SelectKStable

2021-09-09 Thread Alexander (Jira)
Alexander created ARROW-13969:
-

 Summary: [C++][Compute] Implement SelectKStable
 Key: ARROW-13969
 URL: https://issues.apache.org/jira/browse/ARROW-13969
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Alexander
Assignee: Alexander
 Fix For: 6.0.0


PR [https://github.com/apache/arrow/pull/11019] implements SelectKUnstable.

 

Some previous result of SelectKUnstable using StableHeap is shown here: 
[https://github.com/apache/arrow/pull/11019#issuecomment-913977337] 

 

So, implementation of SelectKStable   should explore how to implement this 
algorithm using StablePartition + stable_sorting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1565) [C++][Compute] Implement TopK/BottomK

2021-09-09 Thread Alexander (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander updated ARROW-1565:
-
Summary: [C++][Compute] Implement TopK/BottomK  (was: [C++][Compute] 
Implement TopK/BottomK streaming execution nodes)

> [C++][Compute] Implement TopK/BottomK
> -
>
> Key: ARROW-1565
> URL: https://issues.apache.org/jira/browse/ARROW-1565
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Alexander
>Priority: Major
>  Labels: Analytics, pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> Heap-based topk can compute these indices in O(n log k) time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13965) [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance

2021-09-09 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412901#comment-17412901
 ] 

Yibo Cai commented on ARROW-13965:
--

This looks a nice improvement. Will you created a PR? Thanks.

> [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance
> --
>
> Key: ARROW-13965
> URL: https://issues.apache.org/jira/browse/ARROW-13965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
> Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 
> 11.5.2 (clang 11.0.0)
>Reporter: Edward Seidl
>Priority: Minor
> Attachments: arrow_downcast.patch
>
>
> The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), 
> and WriteValuesSpaced() in TypedColumnWriterImpl 
> (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ 
> object to either DictEncoder or ValueEncoderType pointers.  When calling 
> WriteBatch() with a large number of values this is ok, but when writing 
> batches of 1 (as when using the stream api), these dynamic casts can consume 
> a great deal of cpu.  Using gperftools against code I wrote to do a log 
> structured merge of several parquet files, I measured the dynamic_casts 
> taking as much as 25% of execution time.
> By modifying TypedColumnWriterImpl to save downcasted observer pointers of 
> the appropriate types, I was able to cut my execution time from 32 to 24 
> seconds, validating the gpertools results.  I've attached a patch to show 
> what I did.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13942) [Dev] cmake_format autotune doesn't work

2021-09-09 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-13942.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 2
[https://github.com/apache/arrow/pull/2]

> [Dev] cmake_format autotune doesn't work
> 
>
> Key: ARROW-13942
> URL: https://issues.apache.org/jira/browse/ARROW-13942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/runs/3550654193?check_suite_focus=true
> {noformat}
> + python3 -m pip install -r dev/archery/requirements-lint.txt
> Defaulting to user installation because normal site-packages is not writeable
> ERROR: Could not open requirements file: [Errno 2] No such file or directory: 
> 'dev/archery/requirements-lint.txt'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13968) [R] [CI] Add r-lib actions-based UCRT CI job

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-13968.
---
Resolution: Duplicate

> [R] [CI] Add r-lib actions-based UCRT CI job
> 
>
> Key: ARROW-13968
> URL: https://issues.apache.org/jira/browse/ARROW-13968
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Priority: Major
>
> https://github.com/r-lib/actions/commit/a89f65b5ed2e9ad25e9518e2796604a1495a2c55
>  has been added to r-lib/actions
> an example: 
> https://github.com/jeroen/openssl/commit/fa15ce5ca57f8662d9aa07344b5447ea07457df4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13967) [Go] Implement Concatenate function for Arrays

2021-09-09 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-13967:
--
Component/s: Go

> [Go] Implement Concatenate function for Arrays
> --
>
> Key: ARROW-13967
> URL: https://issues.apache.org/jira/browse/ARROW-13967
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is needed for proper handling of MakeArrayFromScalar when dealing with 
> nested types, and likely could be useful in other use cases too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13967) [Go] Implement Concatenate function for Arrays

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13967:
---
Labels: pull-request-available  (was: )

> [Go] Implement Concatenate function for Arrays
> --
>
> Key: ARROW-13967
> URL: https://issues.apache.org/jira/browse/ARROW-13967
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is needed for proper handling of MakeArrayFromScalar when dealing with 
> nested types, and likely could be useful in other use cases too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13968) [R] [CI] Add r-lib actions-based UCRT CI job

2021-09-09 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-13968:
--

 Summary: [R] [CI] Add r-lib actions-based UCRT CI job
 Key: ARROW-13968
 URL: https://issues.apache.org/jira/browse/ARROW-13968
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane


https://github.com/r-lib/actions/commit/a89f65b5ed2e9ad25e9518e2796604a1495a2c55
 has been added to r-lib/actions

an example: 
https://github.com/jeroen/openssl/commit/fa15ce5ca57f8662d9aa07344b5447ea07457df4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13967) [Go] Implement Concatenate function for Arrays

2021-09-09 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-13967:
-

 Summary: [Go] Implement Concatenate function for Arrays
 Key: ARROW-13967
 URL: https://issues.apache.org/jira/browse/ARROW-13967
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Matthew Topol
Assignee: Matthew Topol


This is needed for proper handling of MakeArrayFromScalar when dealing with 
nested types, and likely could be useful in other use cases too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions

2021-09-09 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-13964.
---
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11126
[https://github.com/apache/arrow/pull/11126]

> [Go] Remove Parquet bitmap reader/writer implementations and use the shared 
> arrow bitutils versions
> ---
>
> Key: ARROW-13964
> URL: https://issues.apache.org/jira/browse/ARROW-13964
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13940.
-
Resolution: Fixed

Issue resolved by pull request 8
[https://github.com/apache/arrow/pull/8]

> [R] Turn on multithreading with Arrow engine queries
> 
>
> Key: ARROW-13940
> URL: https://issues.apache.org/jira/browse/ARROW-13940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x 
> longer on conbench 
> https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/
> I'm also seeing only one core utilized when running queries locally as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13962) [R] Catch up on the NEWS

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13962.
-
Resolution: Fixed

Issue resolved by pull request 11122
[https://github.com/apache/arrow/pull/11122]

> [R] Catch up on the NEWS
> 
>
> Key: ARROW-13962
> URL: https://issues.apache.org/jira/browse/ARROW-13962
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13877) [C++] Added support for fixed sized list to compute functions that process lists

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13877:
---
Labels: kernel pull-request-available types  (was: kernel types)

> [C++] Added support for fixed sized list to compute functions that process 
> lists
> 
>
> Key: ARROW-13877
> URL: https://issues.apache.org/jira/browse/ARROW-13877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following functions do not support fixed size list (and should):
>  - list_flatten
>  - list_parent_indices (one could argue this doesn't need to be supported 
> since this should be obvious and fixed_size_list doesn't have an indices 
> array)
>  - list_value_length (should be easy)
> For reference, the following functions do correctly consume fixed_size_list 
> (there may be more, this isn't an exhaustive list):
>  - count
>  - drop_null
>  - is_null
>  - is_valid



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython

2021-09-09 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412791#comment-17412791
 ] 

Weston Pace commented on ARROW-13939:
-

The slice function should be O(1).  It does not actually copy memory or create 
a new array.  It simply creates a new view of the same data.  The same goes for 
"table[x:y]" which calls slice under the hood.

> how to do resampling of arrow table using cython
> 
>
> Key: ARROW-13939
> URL: https://issues.apache.org/jira/browse/ARROW-13939
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: krishna deepak
>Priority: Minor
>
> Please can someone point me to resources, how to write a resampling code in 
> cython for Arrow table.
>  # Will iterating the whole table be slow in cython?
>  # which is the best to use to append new elements to. Is there a way i 
> create an empty table of same schema and keep appending to it. Or should I 
> use vectors/list and then pass them to create a table.
> Performance is very important for me. Any help is highly appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13966) [C++] Comparison kernel(s) for decimals

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13966:


Assignee: David Li

> [C++] Comparison kernel(s) for decimals
> ---
>
> Key: ARROW-13966
> URL: https://issues.apache.org/jira/browse/ARROW-13966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: David Li
>Priority: Major
>  Labels: kernel, types
>
> Even decimal-decimal comparisons return an error:
> {code:r}
> > Scalar$create(1.5)$cast(decimal(15, 2)) > 
> > Scalar$create(1.1)$cast(decimal(15, 2))
> Error: NotImplemented: Function greater has no kernel matching input types 
> (scalar[decimal128(15, 2)], scalar[decimal128(15, 2)])
> {code}
> Ideally, we would also be able to (autocast in order to) compare 
> decimal-float or decimal-integer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13966) [C++] Comparison kernel(s) for decimals

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13966:
-
Labels: kernel types  (was: )

> [C++] Comparison kernel(s) for decimals
> ---
>
> Key: ARROW-13966
> URL: https://issues.apache.org/jira/browse/ARROW-13966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: kernel, types
>
> Even decimal-decimal comparisons return an error:
> {code:r}
> > Scalar$create(1.5)$cast(decimal(15, 2)) > 
> > Scalar$create(1.1)$cast(decimal(15, 2))
> Error: NotImplemented: Function greater has no kernel matching input types 
> (scalar[decimal128(15, 2)], scalar[decimal128(15, 2)])
> {code}
> Ideally, we would also be able to (autocast in order to) compare 
> decimal-float or decimal-integer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13966) [C++] Comparison kernel(s) for decimals

2021-09-09 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-13966:
--

 Summary: [C++] Comparison kernel(s) for decimals
 Key: ARROW-13966
 URL: https://issues.apache.org/jira/browse/ARROW-13966
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jonathan Keane


Even decimal-decimal comparisons return an error:

{code:r}
> Scalar$create(1.5)$cast(decimal(15, 2)) > Scalar$create(1.1)$cast(decimal(15, 
> 2))
Error: NotImplemented: Function greater has no kernel matching input types 
(scalar[decimal128(15, 2)], scalar[decimal128(15, 2)])
{code}

Ideally, we would also be able to (autocast in order to) compare decimal-float 
or decimal-integer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13961) [C++] iso_calendar may be uninitialized

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-13961.
--
Resolution: Fixed

Issue resolved by pull request 11121
[https://github.com/apache/arrow/pull/11121]

> [C++] iso_calendar may be uninitialized
> ---
>
> Key: ARROW-13961
> URL: https://issues.apache.org/jira/browse/ARROW-13961
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code}
> /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ 
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>   137 |   : PrimitiveScalarBase(std::move(type), true), value(value) {}
>   |^
> In file included from 
> /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7:
> /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: 
> ‘*((void*)& iso_calendar +8)’ was declared here
>   697 |   std::array iso_calendar;
> {code}
> fyi [~rokm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13965) [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance

2021-09-09 Thread Edward Seidl (Jira)
Edward Seidl created ARROW-13965:


 Summary: [C++] dynamic_casts in parquet TypedColumnWriterImpl 
impacting performance
 Key: ARROW-13965
 URL: https://issues.apache.org/jira/browse/ARROW-13965
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
 Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 
11.5.2 (clang 11.0.0)
Reporter: Edward Seidl
 Attachments: arrow_downcast.patch

The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), 
and WriteValuesSpaced() in TypedColumnWriterImpl 
(cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ 
object to either DictEncoder or ValueEncoderType pointers.  When calling 
WriteBatch() with a large number of values this is ok, but when writing batches 
of 1 (as when using the stream api), these dynamic casts can consume a great 
deal of cpu.  Using gperftools against code I wrote to do a log structured 
merge of several parquet files, I measured the dynamic_casts taking as much as 
25% of execution time.

By modifying TypedColumnWriterImpl to save downcasted observer pointers of the 
appropriate types, I was able to cut my execution time from 32 to 24 seconds, 
validating the gpertools results.  I've attached a patch to show what I did.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13293) [R] open_dataset followed by collect hangs (while compute works)

2021-09-09 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-13293:


Assignee: (was: Nicola Crane)

> [R] open_dataset followed by collect hangs (while compute works)
> 
>
> Key: ARROW-13293
> URL: https://issues.apache.org/jira/browse/ARROW-13293
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 4.0.1
> Environment: Windows 10 (see also session info included in reprex)
>Reporter: Hans Van Calster
>Priority: Minor
>
> Tried to make a reproducible example using the iris dataset, but it works as 
> expected for that dataset. So the issue might be specific to the dataset I am 
> using (which contains over 100 columns). The example below illustrates the 
> issue.
> The parquet data used in the example can be downloaded from [this 
> link|https://drive.google.com/file/d/1MHaq3KqlheqrNm8dk71we74n_ip9hMqJ/view?usp=sharing]
>  
> The issue I see is the following:
>  
>  * calling open_dataset() %>% filter() %>% collect() hangs on my machine 
> (while I would expect that a tibble 1,646 x 116 would be returned very fast)
>  * The two alternative calls (one using read_parquet on the specific parquet 
> file within the Dataset on which I filter, and the other using compute() 
> instead of collect()) seem to work as expected
>  
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet")
>  %>%
>  filter(nuts1 == "BE2")
> #> # A tibble: 1,646 x 116
> #> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante
> #>   
> #> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0 
> #> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0 
> #> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0 
> #> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0 
> #> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0 
> #> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0 
> #> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0 
> #> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0 
> #> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0 
> #> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0 
> #> # ... with 1,636 more rows, and 106 more variables: survey_date ,
> #> # car_latitude , car_ew , car_longitude , gps_proj ,
> #> # gps_prec , gps_altitude , gps_lat , gps_ew ,
> #> # gps_long , obs_dist , obs_direct , obs_type ,
> #> # obs_radius , letter_group , lc1 , lc1_label ,
> #> # lc1_spec , lc1_spec_label , lc1_perc , lc2 ,
> #> # lc2_label , lc2_spec , lc2_spec_label , lc2_perc ,
> #> # lu1 , lu1_label , lu1_type , lu1_type_label ,
> #> # lu1_perc , lu2 , lu2_label , lu2_type ,
> #> # lu2_type_label , lu2_perc , parcel_area_ha ,
> #> # tree_height_maturity , tree_height_survey , feature_width 
> ,
> #> # lm_stone_walls , crop_residues , lm_grass_margins ,
> #> # grazing , special_status , lc_lu_special_remark ,
> #> # cprn_cando , cprn_lc , cprn_lc_label , cprn_lc1n ,
> #> # cprnc_lc1e , cprnc_lc1s , cprnc_lc1w ,
> #> # cprn_lc1n_brdth , cprn_lc1e_brdth , cprn_lc1s_brdth ,
> #> # cprn_lc1w_brdth , cprn_lc1n_next , cprn_lc1s_next ,
> #> # cprn_lc1e_next , cprn_lc1w_next , cprn_urban ,
> #> # cprn_impervious_perc , inspire_plcc1 , inspire_plcc2 ,
> #> # inspire_plcc3 , inspire_plcc4 , inspire_plcc5 ,
> #> # inspire_plcc6 , inspire_plcc7 , inspire_plcc8 ,
> #> # eunis_complex , grassland_sample , grass_cando , wm ,
> #> # wm_source , wm_type , wm_delivery , erosion_cando ,
> #> # soil_stones_perc , bio_sample , soil_bio_taken ,
> #> # bulk0_10_sample , soil_blk_0_10_taken , bulk10_20_sample ,
> #> # soil_blk_10_20_taken , bulk20_30_sample ,
> #> # soil_blk_20_30_taken , standard_sample , soil_std_taken ,
> #> # organic_sample , soil_org_depth_cando , soil_taken ,
> #> # soil_crop , photo_point , photo_north , photo_south ,
> #> # photo_east , photo_west , transect , revisit , ...
> open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
>  filter(nuts1 == "BE2", year == 2018) %>%
>  compute() 
> #> Table
> #> 1646 rows x 117 columns
> #> $id 
> #> $point_id 
> #> $nuts0 
> #> $nuts1 
> #> $nuts2 
> #> $nuts3 
> #> $th_lat 
> #> $th_long 
> #> $office_pi 
> #> $ex_ante 
> #> $survey_date 
> #> $car_latitude 
> #> $car_ew 
> #> $car_longitude 
> #> $gps_proj 
> #> $gps_prec 
> #> $gps_altitude 
> #> $gps_lat 
> #> $gps_ew 
> #> $gps_long 
> #> $obs_dist 
> #> $obs_direct 
> #> $obs_type 
> #> 

[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython

2021-09-09 Thread krishna deepak (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412781#comment-17412781
 ] 

krishna deepak commented on ARROW-13939:


Please can you tell, what's the time complexity of cpp Slice function, and 
pyarrow table[x:y]?

> how to do resampling of arrow table using cython
> 
>
> Key: ARROW-13939
> URL: https://issues.apache.org/jira/browse/ARROW-13939
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: krishna deepak
>Priority: Minor
>
> Please can someone point me to resources, how to write a resampling code in 
> cython for Arrow table.
>  # Will iterating the whole table be slow in cython?
>  # which is the best to use to append new elements to. Is there a way i 
> create an empty table of same schema and keep appending to it. Or should I 
> use vectors/list and then pass them to create a table.
> Performance is very important for me. Any help is highly appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13877) [C++] Added support for fixed sized list to compute functions that process lists

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13877:


Assignee: David Li

> [C++] Added support for fixed sized list to compute functions that process 
> lists
> 
>
> Key: ARROW-13877
> URL: https://issues.apache.org/jira/browse/ARROW-13877
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: kernel, types
> Fix For: 6.0.0
>
>
> The following functions do not support fixed size list (and should):
>  - list_flatten
>  - list_parent_indices (one could argue this doesn't need to be supported 
> since this should be obvious and fixed_size_list doesn't have an indices 
> array)
>  - list_value_length (should be easy)
> For reference, the following functions do correctly consume fixed_size_list 
> (there may be more, this isn't an exhaustive list):
>  - count
>  - drop_null
>  - is_null
>  - is_valid



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13964:
---
Labels: pull-request-available  (was: )

> [Go] Remove Parquet bitmap reader/writer implementations and use the shared 
> arrow bitutils versions
> ---
>
> Key: ARROW-13964
> URL: https://issues.apache.org/jira/browse/ARROW-13964
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars

2021-09-09 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412766#comment-17412766
 ] 

Weston Pace commented on ARROW-13954:
-

Thanks for pointing this out.  You are correct, this is not about testing 
python's type support but actually testing the C++ compute kernels via python.  
I have attempted to clarify.

> [Python] Extend compute kernel type testing to supply scalars
> -
>
> Key: ARROW-13954
> URL: https://issues.apache.org/jira/browse/ARROW-13954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> ARROW-13952 introduced testing for the various computer kernel signatures.  
> The current compute kernel type testing passes in all arguments as arrays.  
> We should extend it to account for cases where an argument is allowed to be a 
> scalar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13955) [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13955:

Description: ARROW-13952 introduced testing for the various computer kernel 
signatures.  Some compute kernels (e.g. drop_nulls) have special handling for 
RecordBatch & Table.  These are not covered by the first pass of type testing.  
(was: Some compute kernels (e.g. drop_nulls) have special handling for 
RecordBatch & Table.  These are not covered by the first pass of type testing.)

> [Python] Extend compute kernel type testing to cover kernels which take 
> recordbatch / table input
> -
>
> Key: ARROW-13955
> URL: https://issues.apache.org/jira/browse/ARROW-13955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> ARROW-13952 introduced testing for the various computer kernel signatures.  
> Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch 
> & Table.  These are not covered by the first pass of type testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13954:

Description: ARROW-13952 introduced testing for the various computer kernel 
signatures.  The current compute kernel type testing passes in all arguments as 
arrays.  We should extend it to account for cases where an argument is allowed 
to be a scalar.  (was: The current compute kernel type testing passes in all 
arguments as arrays.  We should extend it to account for cases where an 
argument is allowed to be a scalar.)

> [Python] Extend compute kernel type testing to supply scalars
> -
>
> Key: ARROW-13954
> URL: https://issues.apache.org/jira/browse/ARROW-13954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> ARROW-13952 introduced testing for the various computer kernel signatures.  
> The current compute kernel type testing passes in all arguments as arrays.  
> We should extend it to account for cases where an argument is allowed to be a 
> scalar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13955) [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13955:

Description: Some compute kernels (e.g. drop_nulls) have special handling 
for RecordBatch & Table.  These are not covered by the first pass of type 
testing.  (was: Some kernels (e.g. drop_nulls) have special handling for 
RecordBatch & Table.  These are not covered by the first pass of type testing.)

> [Python] Extend compute kernel type testing to cover kernels which take 
> recordbatch / table input
> -
>
> Key: ARROW-13955
> URL: https://issues.apache.org/jira/browse/ARROW-13955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> Some compute kernels (e.g. drop_nulls) have special handling for RecordBatch 
> & Table.  These are not covered by the first pass of type testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13954:

Summary: [Python] Extend compute kernel type testing to supply scalars  
(was: [Python] Extend type testing to supply scalars)

> [Python] Extend compute kernel type testing to supply scalars
> -
>
> Key: ARROW-13954
> URL: https://issues.apache.org/jira/browse/ARROW-13954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> The current type testing passes in all arguments as arrays.  We should extend 
> it to account for cases where an argument is allowed to be a scalar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13955) [Python] Extend compute kernel type testing to cover kernels which take recordbatch / table input

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13955:

Summary: [Python] Extend compute kernel type testing to cover kernels which 
take recordbatch / table input  (was: [Python] Extend type testing to cover 
kernels which take recordbatch / table input)

> [Python] Extend compute kernel type testing to cover kernels which take 
> recordbatch / table input
> -
>
> Key: ARROW-13955
> URL: https://issues.apache.org/jira/browse/ARROW-13955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> Some kernels (e.g. drop_nulls) have special handling for RecordBatch & Table. 
>  These are not covered by the first pass of type testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package

2021-09-09 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-13963.
---
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11124
[https://github.com/apache/arrow/pull/11124]

> [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil 
> package
> 
>
> Key: ARROW-13963
> URL: https://issues.apache.org/jira/browse/ARROW-13963
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Move the implementations of the BitmapReader/Writers from an internal Parquet 
> module to the arrow bitutil package in order to share them between the Arrow 
> and Parquet repos.
> This covers the Arrow side of adding the implementations, it will be followed 
> by a change to the parquet module to remove them and point to the merged 
> arrow utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13954) [Python] Extend compute kernel type testing to supply scalars

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13954:

Description: The current compute kernel type testing passes in all 
arguments as arrays.  We should extend it to account for cases where an 
argument is allowed to be a scalar.  (was: The current type testing passes in 
all arguments as arrays.  We should extend it to account for cases where an 
argument is allowed to be a scalar.)

> [Python] Extend compute kernel type testing to supply scalars
> -
>
> Key: ARROW-13954
> URL: https://issues.apache.org/jira/browse/ARROW-13954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> The current compute kernel type testing passes in all arguments as arrays.  
> We should extend it to account for cases where an argument is allowed to be a 
> scalar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13953) [Python] Extend compute kernel type testing to also test for union types

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13953:

Description: ARROW-13952 introduced testing for the various computer kernel 
signatures.  Many kernels likely do not support union types (e.g. arithmetic or 
string kernels) but there are a number of kernels that should operate on any 
possible input type (e.g. drop_null, filter, take) and we should verify these 
kernels work correctly with union types.

> [Python] Extend compute kernel type testing to also test for union types
> 
>
> Key: ARROW-13953
> URL: https://issues.apache.org/jira/browse/ARROW-13953
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> ARROW-13952 introduced testing for the various computer kernel signatures.  
> Many kernels likely do not support union types (e.g. arithmetic or string 
> kernels) but there are a number of kernels that should operate on any 
> possible input type (e.g. drop_null, filter, take) and we should verify these 
> kernels work correctly with union types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13953) [Python] Extend compute kernel type testing to also test for union types

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13953:

Summary: [Python] Extend compute kernel type testing to also test for union 
types  (was: [Python] Extend type testing to also test for union types)

> [Python] Extend compute kernel type testing to also test for union types
> 
>
> Key: ARROW-13953
> URL: https://issues.apache.org/jira/browse/ARROW-13953
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13952) [Python] Add initial type testing for compute kernels

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13952:

Summary: [Python] Add initial type testing for compute kernels  (was: 
[Python] Add initial type testing for kernels)

> [Python] Add initial type testing for compute kernels
> -
>
> Key: ARROW-13952
> URL: https://issues.apache.org/jira/browse/ARROW-13952
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> We need tests that ensure that we are supporting all types that should be 
> supported for the various compute kernels.  I've created a first pass at this 
> (and filed a number of JIRAs for places we are missing support).  This PR is 
> to get the test itself upstreamed into Arrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13952) [Python] Add initial type testing for compute kernels

2021-09-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13952:

Description: We need tests that ensure that we are supporting all types 
that should be supported for the various compute kernels.  For example, 
arithmetic kernels should support any combination of numeric inputs.  I've 
created a first pass at this (and filed a number of JIRAs for places we are 
missing support).  This PR is to get the test itself upstreamed into Arrow.  
(was: We need tests that ensure that we are supporting all types that should be 
supported for the various compute kernels.  I've created a first pass at this 
(and filed a number of JIRAs for places we are missing support).  This PR is to 
get the test itself upstreamed into Arrow.)

> [Python] Add initial type testing for compute kernels
> -
>
> Key: ARROW-13952
> URL: https://issues.apache.org/jira/browse/ARROW-13952
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> We need tests that ensure that we are supporting all types that should be 
> supported for the various compute kernels.  For example, arithmetic kernels 
> should support any combination of numeric inputs.  I've created a first pass 
> at this (and filed a number of JIRAs for places we are missing support).  
> This PR is to get the test itself upstreamed into Arrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13398) [R] Update install.Rmd vignette

2021-09-09 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13398:
-
Description: 
Proposed changes:
 * Break up further to make more skimmable, by using more subheadings
 * Add flowchart to "how dependencies are resolved" section


  was:
Proposed changes:
 * Break up further to make more skimmable, by using more subheadings
 * Add flowchart to "how dependencies are resolved" section
* make sure the instructions on how to install from the nightly build set 
{{repos}} to a vector including the current default so R dependencies will be 
installed too


> [R] Update install.Rmd vignette
> ---
>
> Key: ARROW-13398
> URL: https://issues.apache.org/jira/browse/ARROW-13398
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, R
>Reporter: Nicola Crane
>Priority: Major
>
> Proposed changes:
>  * Break up further to make more skimmable, by using more subheadings
>  * Add flowchart to "how dependencies are resolved" section



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13842) [C++] Bump vendored date library version

2021-09-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13842.

Resolution: Fixed

Issue resolved by pull request 7
[https://github.com/apache/arrow/pull/7]

> [C++] Bump vendored date library version
> 
>
> Key: ARROW-13842
> URL: https://issues.apache.org/jira/browse/ARROW-13842
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This fix: [https://github.com/HowardHinnant/date/issues/696]
> should let us always re-enable this test:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print_test.cc#L454-L466



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13398) [R] Update install.Rmd vignette

2021-09-09 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13398:
-
Description: 
Proposed changes:
 * Break up further to make more skimmable, by using more subheadings
 * Add flowchart to "how dependencies are resolved" section
* make sure the instructions on how to install from the nightly build set 
{{repos}} to a vector so dependencies will be installed

  was:
Proposed changes:
 * Break up further to make more skimmable, by using more subheadings
 * Add flowchart to "how dependencies are resolved" section


> [R] Update install.Rmd vignette
> ---
>
> Key: ARROW-13398
> URL: https://issues.apache.org/jira/browse/ARROW-13398
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, R
>Reporter: Nicola Crane
>Priority: Major
>
> Proposed changes:
>  * Break up further to make more skimmable, by using more subheadings
>  * Add flowchart to "how dependencies are resolved" section
> * make sure the instructions on how to install from the nightly build set 
> {{repos}} to a vector so dependencies will be installed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13398) [R] Update install.Rmd vignette

2021-09-09 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13398:
-
Description: 
Proposed changes:
 * Break up further to make more skimmable, by using more subheadings
 * Add flowchart to "how dependencies are resolved" section
* make sure the instructions on how to install from the nightly build set 
{{repos}} to a vector including the current default so R dependencies will be 
installed too

  was:
Proposed changes:
 * Break up further to make more skimmable, by using more subheadings
 * Add flowchart to "how dependencies are resolved" section
* make sure the instructions on how to install from the nightly build set 
{{repos}} to a vector so dependencies will be installed


> [R] Update install.Rmd vignette
> ---
>
> Key: ARROW-13398
> URL: https://issues.apache.org/jira/browse/ARROW-13398
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, R
>Reporter: Nicola Crane
>Priority: Major
>
> Proposed changes:
>  * Break up further to make more skimmable, by using more subheadings
>  * Add flowchart to "how dependencies are resolved" section
> * make sure the instructions on how to install from the nightly build set 
> {{repos}} to a vector including the current default so R dependencies will be 
> installed too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions

2021-09-09 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-13964:
--
Component/s: Parquet
 Go

> [Go] Remove Parquet bitmap reader/writer implementations and use the shared 
> arrow bitutils versions
> ---
>
> Key: ARROW-13964
> URL: https://issues.apache.org/jira/browse/ARROW-13964
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package

2021-09-09 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-13963:
--
Component/s: Go

> [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil 
> package
> 
>
> Key: ARROW-13963
> URL: https://issues.apache.org/jira/browse/ARROW-13963
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Move the implementations of the BitmapReader/Writers from an internal Parquet 
> module to the arrow bitutil package in order to share them between the Arrow 
> and Parquet repos.
> This covers the Arrow side of adding the implementations, it will be followed 
> by a change to the parquet module to remove them and point to the merged 
> arrow utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13963:
---
Labels: pull-request-available  (was: )

> [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil 
> package
> 
>
> Key: ARROW-13963
> URL: https://issues.apache.org/jira/browse/ARROW-13963
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Move the implementations of the BitmapReader/Writers from an internal Parquet 
> module to the arrow bitutil package in order to share them between the Arrow 
> and Parquet repos.
> This covers the Arrow side of adding the implementations, it will be followed 
> by a change to the parquet module to remove them and point to the merged 
> arrow utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13964) [Go] Remove Parquet bitmap reader/writer implementations and use the shared arrow bitutils versions

2021-09-09 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-13964:
-

 Summary: [Go] Remove Parquet bitmap reader/writer implementations 
and use the shared arrow bitutils versions
 Key: ARROW-13964
 URL: https://issues.apache.org/jira/browse/ARROW-13964
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13963) [Go] Shift Bitmap Reader/Writer implementations from Parquet to Arrow bituil package

2021-09-09 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-13963:
-

 Summary: [Go] Shift Bitmap Reader/Writer implementations from 
Parquet to Arrow bituil package
 Key: ARROW-13963
 URL: https://issues.apache.org/jira/browse/ARROW-13963
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Matthew Topol
Assignee: Matthew Topol


Move the implementations of the BitmapReader/Writers from an internal Parquet 
module to the arrow bitutil package in order to share them between the Arrow 
and Parquet repos.

This covers the Arrow side of adding the implementations, it will be followed 
by a change to the parquet module to remove them and point to the merged arrow 
utils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13655:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" 
> error with Thrift 0.14
> --
>
> Key: ARROW-13655
> URL: https://issues.apache.org/jira/browse/ARROW-13655
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.1, 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/issues/8027
> Apache Thrift introduced a `MaxMessageSize` configuration option 
> (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize)
>  in version 0.14 (THRIFT-5237). 
> I think this is the cause of an issue reported originally at 
> https://github.com/dask/dask/issues/8027, where one can get a _"OSError: 
> Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a 
> large Parquet (metadata-only) file. 
> In the original report, the file was writting using the python fastparquet 
> library (which uses the python thrift bindings, which still use Thrift 0.13), 
> but I was able to construct a reproducible code example with pyarrow.
> Create a large metadata Parquet file with pyarrow in an environment with 
> Arrow built against Thrift 0.13 (eg with a local install from source, or 
> installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
> pq.write_table(table, "__temp_file_for_metadata.parquet")
> metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
> metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")
> [metadata.append_row_groups(metadata2) for _ in range(4000)]
> metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
> {code}
> And then reading this file again in the same environment works fine, but 
> reading it in an environment with recent Thrift 0.14 (eg installing latest 
> pyarrow with conda-forge) gives the following error:
> {code:python}
> In [1]: import pyarrow.parquet as pq
> In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
> ...
> OSError: Couldn't deserialize thrift: MaxMessageSize reached
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13959) [R] Update tests for extracting components from date32 objects

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13959:

Fix Version/s: 6.0.0

> [R] Update tests for extracting components from date32 objects 
> ---
>
> Key: ARROW-13959
> URL: https://issues.apache.org/jira/browse/ARROW-13959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The R tests implemented in the PR which adds C++ functionality for extracting 
> components from date32 objects don't compare Arrow dplyr code with R dplyr 
> code - these tests should be updated to do so.
> https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13924:

Fix Version/s: 6.0.0
   Labels: good-first-issue kernel  (was: good-first-issue)

> [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and 
> base::endsWith
> 
>
> Key: ARROW-13924
> URL: https://issues.apache.org/jira/browse/ARROW-13924
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Minor
>  Labels: good-first-issue, kernel
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13940:

Fix Version/s: 6.0.0

> [R] Turn on multithreading with Arrow engine queries
> 
>
> Key: ARROW-13940
> URL: https://issues.apache.org/jira/browse/ARROW-13940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x 
> longer on conbench 
> https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/
> I'm also seeing only one core utilized when running queries locally as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13927) [R] Add Karl to the contributors list for the pacakge

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13927:

Fix Version/s: 6.0.0

> [R] Add Karl to the contributors list for the pacakge
> -
>
> Key: ARROW-13927
> URL: https://issues.apache.org/jira/browse/ARROW-13927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
> Fix For: 6.0.0
>
>
> [~karldw] : As recognition of the contributions you have made, especially the 
> herculean effort with ARROW-12981 we would like to add you to the 
> contributors list of the R package. 
> If you are ok with this, would you mind giving us the email you would like to 
> have listed there (and ORCID, if you have one and want it there as well)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14

2021-09-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13655:
---
Fix Version/s: 5.0.1

> [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" 
> error with Thrift 0.14
> --
>
> Key: ARROW-13655
> URL: https://issues.apache.org/jira/browse/ARROW-13655
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 5.0.1, 6.0.0
>
>
> From https://github.com/dask/dask/issues/8027
> Apache Thrift introduced a `MaxMessageSize` configuration option 
> (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize)
>  in version 0.14 (THRIFT-5237). 
> I think this is the cause of an issue reported originally at 
> https://github.com/dask/dask/issues/8027, where one can get a _"OSError: 
> Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a 
> large Parquet (metadata-only) file. 
> In the original report, the file was writting using the python fastparquet 
> library (which uses the python thrift bindings, which still use Thrift 0.13), 
> but I was able to construct a reproducible code example with pyarrow.
> Create a large metadata Parquet file with pyarrow in an environment with 
> Arrow built against Thrift 0.13 (eg with a local install from source, or 
> installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
> pq.write_table(table, "__temp_file_for_metadata.parquet")
> metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
> metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")
> [metadata.append_row_groups(metadata2) for _ in range(4000)]
> metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
> {code}
> And then reading this file again in the same environment works fine, but 
> reading it in an environment with recent Thrift 0.14 (eg installing latest 
> pyarrow with conda-forge) gives the following error:
> {code:python}
> In [1]: import pyarrow.parquet as pq
> In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
> ...
> OSError: Couldn't deserialize thrift: MaxMessageSize reached
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11885) [R] Turn off some capabilities when LIBARROW_MINIMAL=true

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-11885.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11109
[https://github.com/apache/arrow/pull/11109]

> [R] Turn off some capabilities when LIBARROW_MINIMAL=true
> -
>
> Key: ARROW-11885
> URL: https://issues.apache.org/jira/browse/ARROW-11885
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently when {{LIBARROW_MINIMAL}} takes a value other than {{false}}, the 
> Arrow C++ library is built with mimalloc, S3, and compression algos turned 
> off. Consider whether to turn off some other capabilities when 
> {{LIBARROW_MINIMAL}} is explicitly set to {{true}}, including Arrow Dataset 
> and Parquet.
> The code that controls this is in {{r/inst/build_arrow_static.sh}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13962) [R] Catch up on the NEWS

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13962:
---
Labels: pull-request-available  (was: )

> [R] Catch up on the NEWS
> 
>
> Key: ARROW-13962
> URL: https://issues.apache.org/jira/browse/ARROW-13962
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray

2021-09-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Percy Camilo Triveño Aucahuasi reassigned ARROW-12669:
--

Assignee: Percy Camilo Triveño Aucahuasi

> [C++] Kernel to return Array of elements at index of list in ListArray
> --
>
> Key: ARROW-12669
> URL: https://issues.apache.org/jira/browse/ARROW-12669
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel, types
> Fix For: 6.0.0
>
>
> It would be useful to have a compute function that takes a 
> {{ListArray>}} and an integer index {{n}} and returns an 
> {{Array}} containing the {{n}} th item from each list.
> This would be useful in combination with existing functions that return 
> list-type output, for example the string splitting functions.
> Let's please ensure that this also works on fixed size list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray

2021-09-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Percy Camilo Triveño Aucahuasi updated ARROW-12669:
---
Labels: Kernels kernel types  (was: kernel types)

> [C++] Kernel to return Array of elements at index of list in ListArray
> --
>
> Key: ARROW-12669
> URL: https://issues.apache.org/jira/browse/ARROW-12669
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: Kernels, kernel, types
> Fix For: 6.0.0
>
>
> It would be useful to have a compute function that takes a 
> {{ListArray>}} and an integer index {{n}} and returns an 
> {{Array}} containing the {{n}} th item from each list.
> This would be useful in combination with existing functions that return 
> list-type output, for example the string splitting functions.
> Let's please ensure that this also works on fixed size list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray

2021-09-09 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce reassigned ARROW-12669:
-

Assignee: (was: Eduardo Ponce)

> [C++] Kernel to return Array of elements at index of list in ListArray
> --
>
> Key: ARROW-12669
> URL: https://issues.apache.org/jira/browse/ARROW-12669
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Priority: Critical
>  Labels: kernel, types
> Fix For: 6.0.0
>
>
> It would be useful to have a compute function that takes a 
> {{ListArray>}} and an integer index {{n}} and returns an 
> {{Array}} containing the {{n}} th item from each list.
> This would be useful in combination with existing functions that return 
> list-type output, for example the string splitting functions.
> Let's please ensure that this also works on fixed size list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13718) [Doc][Cookbook] Creating Arrays - R

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13718:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Creating Arrays - R
> ---
>
> Key: ARROW-13718
> URL: https://issues.apache.org/jira/browse/ARROW-13718
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13709) [Doc][Cookbook] Reading JSON Files - R

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13709:
---
Labels: pull-request-available  (was: )

> [Doc][Cookbook] Reading JSON Files - R
> --
>
> Key: ARROW-13709
> URL: https://issues.apache.org/jira/browse/ARROW-13709
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Alessandro Molina
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12669) [C++] Kernel to return Array of elements at index of list in ListArray

2021-09-09 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-12669:
-
Priority: Critical  (was: Major)

> [C++] Kernel to return Array of elements at index of list in ListArray
> --
>
> Key: ARROW-12669
> URL: https://issues.apache.org/jira/browse/ARROW-12669
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Critical
>  Labels: kernel, types
> Fix For: 6.0.0
>
>
> It would be useful to have a compute function that takes a 
> {{ListArray>}} and an integer index {{n}} and returns an 
> {{Array}} containing the {{n}} th item from each list.
> This would be useful in combination with existing functions that return 
> list-type output, for example the string splitting functions.
> Let's please ensure that this also works on fixed size list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14

2021-09-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-13655:
--

Assignee: Antoine Pitrou

> [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" 
> error with Thrift 0.14
> --
>
> Key: ARROW-13655
> URL: https://issues.apache.org/jira/browse/ARROW-13655
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> From https://github.com/dask/dask/issues/8027
> Apache Thrift introduced a `MaxMessageSize` configuration option 
> (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize)
>  in version 0.14 (THRIFT-5237). 
> I think this is the cause of an issue reported originally at 
> https://github.com/dask/dask/issues/8027, where one can get a _"OSError: 
> Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a 
> large Parquet (metadata-only) file. 
> In the original report, the file was writting using the python fastparquet 
> library (which uses the python thrift bindings, which still use Thrift 0.13), 
> but I was able to construct a reproducible code example with pyarrow.
> Create a large metadata Parquet file with pyarrow in an environment with 
> Arrow built against Thrift 0.13 (eg with a local install from source, or 
> installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
> pq.write_table(table, "__temp_file_for_metadata.parquet")
> metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
> metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")
> [metadata.append_row_groups(metadata2) for _ in range(4000)]
> metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
> {code}
> And then reading this file again in the same environment works fine, but 
> reading it in an environment with recent Thrift 0.14 (eg installing latest 
> pyarrow with conda-forge) gives the following error:
> {code:python}
> In [1]: import pyarrow.parquet as pq
> In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
> ...
> OSError: Couldn't deserialize thrift: MaxMessageSize reached
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13961) [C++] iso_calendar may be uninitialized

2021-09-09 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412658#comment-17412658
 ] 

Neal Richardson commented on ARROW-13961:
-

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=11314=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=6c939d89-0d1a-51f2-8b30-091a7a82e98c=461

> [C++] iso_calendar may be uninitialized
> ---
>
> Key: ARROW-13961
> URL: https://issues.apache.org/jira/browse/ARROW-13961
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code}
> /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ 
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>   137 |   : PrimitiveScalarBase(std::move(type), true), value(value) {}
>   |^
> In file included from 
> /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7:
> /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: 
> ‘*((void*)& iso_calendar +8)’ was declared here
>   697 |   std::array iso_calendar;
> {code}
> fyi [~rokm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14

2021-09-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13655:
---
Fix Version/s: 6.0.0

> [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" 
> error with Thrift 0.14
> --
>
> Key: ARROW-13655
> URL: https://issues.apache.org/jira/browse/ARROW-13655
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 6.0.0
>
>
> From https://github.com/dask/dask/issues/8027
> Apache Thrift introduced a `MaxMessageSize` configuration option 
> (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize)
>  in version 0.14 (THRIFT-5237). 
> I think this is the cause of an issue reported originally at 
> https://github.com/dask/dask/issues/8027, where one can get a _"OSError: 
> Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a 
> large Parquet (metadata-only) file. 
> In the original report, the file was writting using the python fastparquet 
> library (which uses the python thrift bindings, which still use Thrift 0.13), 
> but I was able to construct a reproducible code example with pyarrow.
> Create a large metadata Parquet file with pyarrow in an environment with 
> Arrow built against Thrift 0.13 (eg with a local install from source, or 
> installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
> pq.write_table(table, "__temp_file_for_metadata.parquet")
> metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
> metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")
> [metadata.append_row_groups(metadata2) for _ in range(4000)]
> metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
> {code}
> And then reading this file again in the same environment works fine, but 
> reading it in an environment with recent Thrift 0.14 (eg installing latest 
> pyarrow with conda-forge) gives the following error:
> {code:python}
> In [1]: import pyarrow.parquet as pq
> In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
> ...
> OSError: Couldn't deserialize thrift: MaxMessageSize reached
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13961) [C++] iso_calendar may be uninitialized

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13961:
---
Labels: pull-request-available  (was: )

> [C++] iso_calendar may be uninitialized
> ---
>
> Key: ARROW-13961
> URL: https://issues.apache.org/jira/browse/ARROW-13961
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ 
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>   137 |   : PrimitiveScalarBase(std::move(type), true), value(value) {}
>   |^
> In file included from 
> /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7:
> /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: 
> ‘*((void*)& iso_calendar +8)’ was declared here
>   697 |   std::array iso_calendar;
> {code}
> fyi [~rokm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13960) [C++] Fix use of non-const references in temporal kernels

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-13960.

Resolution: Duplicate

> [C++] Fix use of non-const references in temporal kernels
> -
>
> Key: ARROW-13960
> URL: https://issues.apache.org/jira/browse/ARROW-13960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> See final comments in https://github.com/apache/arrow/pull/11075



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13033) [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

2021-09-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13033.

Resolution: Fixed

Issue resolved by pull request 10610
[https://github.com/apache/arrow/pull/10610]

> [C++] Kernel to localize naive timestamps to a timezone (preserving 
> clock-time)
> ---
>
> Key: ARROW-13033
> URL: https://issues.apache.org/jira/browse/ARROW-13033
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available, timestamp, timezone
> Fix For: 6.0.0
>
>  Time Spent: 15.5h
>  Remaining Estimate: 0h
>
> Given a tz-naive timestamp, "localize" would interpret that timestamp as 
> local in a given timezone, and return a tz-aware timestamp keeping the same 
> "clock time" (the same year/month/day/hour/etc in the printed 
> representation). Under the hood this converts the timestamp value from that 
> timezone to UTC, since tz-aware timestamps are stored as UTC.
> References: 
> [tz_localize|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.tz_localize.html]
>  in pandas, or 
> [force_tz|https://lubridate.tidyverse.org/reference/force_tz.html] in R's 
> lubridate package
> This will (eventually) also have to deal with ambiguous or non-existing times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13549) [C++] Implement timestamp to date/time cast that extracts value

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13549:
-
Description: 
Change casting from timestamp to date/time to extract the value, instead of 
just truncating as we currently do (which rounds, giving incorrect answers, in 
some cases). This should also be a safe cast by default (unless you want to do 
something like cast from timestamp[ns] to time32[s] which may overflow).

This should behave like Postgres DATE/CAST(... as TIME), or Pandas 
Timestamp.date/Timestamp.time.

  was:
Add a kernel that can extract just the date or the time from a timestamp.

This should behave like Postgres DATE/CAST(... as TIME), or Pandas 
Timestamp.date/Timestamp.time.

Extracting the date appears to be doable with an unsafe cast, but it might be 
more convenient to have an explicit kernel (and an unsafe cast, at least in the 
Python bindings, disables all checks, not just the check we care about).


> [C++] Implement timestamp to date/time cast that extracts value
> ---
>
> Key: ARROW-13549
> URL: https://issues.apache.org/jira/browse/ARROW-13549
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Change casting from timestamp to date/time to extract the value, instead of 
> just truncating as we currently do (which rounds, giving incorrect answers, 
> in some cases). This should also be a safe cast by default (unless you want 
> to do something like cast from timestamp[ns] to time32[s] which may overflow).
> This should behave like Postgres DATE/CAST(... as TIME), or Pandas 
> Timestamp.date/Timestamp.time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13549) [C++] Implement timestamp to date/time cast that extracts value

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13549:
-
Summary: [C++] Implement timestamp to date/time cast that extracts value  
(was: [C++] Implement kernel to extract date/time from timestamp)

> [C++] Implement timestamp to date/time cast that extracts value
> ---
>
> Key: ARROW-13549
> URL: https://issues.apache.org/jira/browse/ARROW-13549
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Add a kernel that can extract just the date or the time from a timestamp.
> This should behave like Postgres DATE/CAST(... as TIME), or Pandas 
> Timestamp.date/Timestamp.time.
> Extracting the date appears to be doable with an unsafe cast, but it might be 
> more convenient to have an explicit kernel (and an unsafe cast, at least in 
> the Python bindings, disables all checks, not just the check we care about).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10213) [C++] Temporal cast from timestamp to date rounds instead of extracting date component

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-10213.

Fix Version/s: 6.0.0
 Assignee: David Li
   Resolution: Duplicate

In ARROW-13549 we're implementing a timestamp->datea cast which extracts the 
components instead of truncating. Closing this as a duplicate.

> [C++] Temporal cast from timestamp to date rounds instead of extracting date 
> component
> --
>
> Key: ARROW-10213
> URL: https://issues.apache.org/jira/browse/ARROW-10213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: David Li
>Assignee: David Li
>Priority: Minor
> Fix For: 6.0.0
>
>
> I'd expect this code to give 1950-01-01 twice (i.e. a timestamp -> date cast 
> extracts the date component, ignoring the time component):
> {code:python}
> import datetime
> import pyarrow as pa
> arr = pa.array([
> datetime.datetime(1950, 1, 1, 0, 0, 0),
> datetime.datetime(1950, 1, 1, 12, 0, 0),
> ], type=pa.timestamp("ns"))
> print(arr)
> print(arr.cast(pa.date32(), safe=False)) {code}
> However it gives 1950-01-02 in the second case:
> {noformat}
> [
>   1950-01-01 00:00:00.0,
>   1950-01-01 12:00:00.0
> ]
> [
>   1950-01-01,
>   1950-01-02
> ]
> {noformat}
> The reason is that the temporal cast simply divides, and C truncates towards 
> 0 (note: Python truncates towards -Infinity, so it would give the right 
> answer in this case!), resulting in -7304 days instead of -7305.
> Depending on the intended semantics of a temporal cast, either it should be 
> fixed to extract the date component, or the rounding behavior should be noted 
> and a separate kernel should be implemented for extracting the date component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13961) [C++] iso_calendar may be uninitialized

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13961:


Assignee: David Li

> [C++] iso_calendar may be uninitialized
> ---
>
> Key: ARROW-13961
> URL: https://issues.apache.org/jira/browse/ARROW-13961
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
> Fix For: 6.0.0
>
>
> {code}
> /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ 
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>   137 |   : PrimitiveScalarBase(std::move(type), true), value(value) {}
>   |^
> In file included from 
> /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7:
> /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: 
> ‘*((void*)& iso_calendar +8)’ was declared here
>   697 |   std::array iso_calendar;
> {code}
> fyi [~rokm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13960) [C++] Fix use of non-const references in temporal kernels

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13960:


Assignee: David Li

> [C++] Fix use of non-const references in temporal kernels
> -
>
> Key: ARROW-13960
> URL: https://issues.apache.org/jira/browse/ARROW-13960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> See final comments in https://github.com/apache/arrow/pull/11075



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13958) [Python] Migrate Python ORC bindings to use new Result-based APIs

2021-09-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-13958:
-

Assignee: Joris Van den Bossche

> [Python] Migrate Python ORC bindings to use new Result-based APIs
> -
>
> Key: ARROW-13958
> URL: https://issues.apache.org/jira/browse/ARROW-13958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Needed follow-up on ARROW-13793 (currently compiling pyarrow gives 
> deprecation warnings about it)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13958) [Python] Migrate Python ORC bindings to use new Result-based APIs

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13958:
---
Labels: pull-request-available  (was: )

> [Python] Migrate Python ORC bindings to use new Result-based APIs
> -
>
> Key: ARROW-13958
> URL: https://issues.apache.org/jira/browse/ARROW-13958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Needed follow-up on ARROW-13793 (currently compiling pyarrow gives 
> deprecation warnings about it)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13959) [R] Update tests for extracting components from date32 objects

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13959:
---
Labels: pull-request-available  (was: )

> [R] Update tests for extracting components from date32 objects 
> ---
>
> Key: ARROW-13959
> URL: https://issues.apache.org/jira/browse/ARROW-13959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The R tests implemented in the PR which adds C++ functionality for extracting 
> components from date32 objects don't compare Arrow dplyr code with R dplyr 
> code - these tests should be updated to do so.
> https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13961) [C++] iso_calendar may be uninitialized

2021-09-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412576#comment-17412576
 ] 

David Li commented on ARROW-13961:
--

I just merged something here from [~aucahuasi] that may be relevant. (I could 
take care of this when I go back and fix ARROW-13960 as well since it's all 
around the same code.)

> [C++] iso_calendar may be uninitialized
> ---
>
> Key: ARROW-13961
> URL: https://issues.apache.org/jira/browse/ARROW-13961
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 6.0.0
>
>
> {code}
> /arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ 
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>   137 |   : PrimitiveScalarBase(std::move(type), true), value(value) {}
>   |^
> In file included from 
> /tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7:
> /arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: 
> ‘*((void*)& iso_calendar +8)’ was declared here
>   697 |   std::array iso_calendar;
> {code}
> fyi [~rokm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13962) [R] Catch up on the NEWS

2021-09-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-13962:
---

 Summary: [R] Catch up on the NEWS
 Key: ARROW-13962
 URL: https://issues.apache.org/jira/browse/ARROW-13962
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 6.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13961) [C++] iso_calendar may be uninitialized

2021-09-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-13961:
---

 Summary: [C++] iso_calendar may be uninitialized
 Key: ARROW-13961
 URL: https://issues.apache.org/jira/browse/ARROW-13961
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson
 Fix For: 6.0.0


{code}
/arrow/cpp/src/arrow/scalar.h:137:64: warning: ‘*((void*)& iso_calendar +8)’ 
may be used uninitialized in this function [-Wmaybe-uninitialized]
  137 |   : PrimitiveScalarBase(std::move(type), true), value(value) {}
  |^
In file included from 
/tmp/RtmpoS4YCn/file8773f4430f/src/arrow/CMakeFiles/arrow_objlib.dir/Unity/unity_17_cxx.cxx:7:
/arrow/cpp/src/arrow/compute/kernels/scalar_temporal.cc:697:30: note: 
‘*((void*)& iso_calendar +8)’ was declared here
  697 |   std::array iso_calendar;
{code}

fyi [~rokm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13959) [R] Update tests for extracting components from date32 objects

2021-09-09 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-13959:


Assignee: Nicola Crane

> [R] Update tests for extracting components from date32 objects 
> ---
>
> Key: ARROW-13959
> URL: https://issues.apache.org/jira/browse/ARROW-13959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>
> The R tests implemented in the PR which adds C++ functionality for extracting 
> components from date32 objects don't compare Arrow dplyr code with R dplyr 
> code - these tests should be updated to do so.
> https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13940:
---
Labels: pull-request-available  (was: )

> [R] Turn on multithreading with Arrow engine queries
> 
>
> Key: ARROW-13940
> URL: https://issues.apache.org/jira/browse/ARROW-13940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x 
> longer on conbench 
> https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/
> I'm also seeing only one core utilized when running queries locally as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13940) [R] Turn on multithreading with Arrow engine queries

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13940:

Summary: [R] Turn on multithreading with Arrow engine queries  (was: [R] 
Multi-threading with Arrow engine queries?)

> [R] Turn on multithreading with Arrow engine queries
> 
>
> Key: ARROW-13940
> URL: https://issues.apache.org/jira/browse/ARROW-13940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>
> Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x 
> longer on conbench 
> https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/
> I'm also seeing only one core utilized when running queries locally as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13960) [C++] Fix use of non-const references in temporal kernels

2021-09-09 Thread David Li (Jira)
David Li created ARROW-13960:


 Summary: [C++] Fix use of non-const references in temporal kernels
 Key: ARROW-13960
 URL: https://issues.apache.org/jira/browse/ARROW-13960
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


See final comments in https://github.com/apache/arrow/pull/11075



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13959) [R] Update tests for extracting components from date32 objects

2021-09-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-13959:


 Summary: [R] Update tests for extracting components from date32 
objects 
 Key: ARROW-13959
 URL: https://issues.apache.org/jira/browse/ARROW-13959
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The R tests implemented in the PR which adds C++ functionality for extracting 
components from date32 objects don't compare Arrow dplyr code with R dplyr code 
- these tests should be updated to do so.

https://github.com/apache/arrow/commit/4b5ed4eb5583cf24d8daff05a865c8d1cb616576#diff-1cddae31f0151681f8b551bf834f1b9fb5ebac6061521efc70084b5057f7





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13940) [R] Multi-threading with Arrow engine queries?

2021-09-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-13940:
---

Assignee: Neal Richardson

> [R] Multi-threading with Arrow engine queries?
> --
>
> Key: ARROW-13940
> URL: https://issues.apache.org/jira/browse/ARROW-13940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>
> Since ARROW-13740 was merged, we're seeing dataset queries take close to 2x 
> longer on conbench 
> https://conbench.ursa.dev/benchmarks/e54ae362090b4a868bee929d45936400/
> I'm also seeing only one core utilized when running queries locally as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13842) [C++] Bump vendored date library version

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13842:
---
Labels: pull-request-available  (was: )

> [C++] Bump vendored date library version
> 
>
> Key: ARROW-13842
> URL: https://issues.apache.org/jira/browse/ARROW-13842
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This fix: [https://github.com/HowardHinnant/date/issues/696]
> should let us always re-enable this test:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print_test.cc#L454-L466



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13957) [C++] Make Windows S3FileSystem/Minio tests more reliable

2021-09-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412547#comment-17412547
 ] 

Antoine Pitrou commented on ARROW-13957:


One possibility would be to generate a different bucket name for each test, to 
make them more independent of each other.

> [C++] Make Windows S3FileSystem/Minio tests more reliable
> -
>
> Key: ARROW-13957
> URL: https://issues.apache.org/jira/browse/ARROW-13957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>
> [Example 
> log|https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/40696885/job/5t25hl7biwxdipe9]
> {noformat}
> [ RUN  ] TestS3FS.FileSystemFromUri
> WARNING: maximum file descriptor limit 0 is too low for production servers. 
> At least 4096 is recommended. Fix with "ulimit -n 4096"
> C:/projects/arrow/cpp/src/arrow/filesystem/s3fs_test.cc(387): error: Failed
> 'OutcomeToStatus(client_->CreateBucket(req))' failed with IOError: AWS Error 
> [code 130]: Your previous request to create the named bucket succeeded and 
> you already own it.
> C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete 
> temporary directory: IOError: Cannot delete directory entry 
> 'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-s6295hb6/.minio.sys/tmp/3cb9aaa7-6716-4c53-a30e-c2348f122148':
>  . Detail: [Windows error 145] The directory is not empty.
> [  FAILED  ] TestS3FS.FileSystemFromUri (7172 ms)
> [ RUN  ] TestS3FS.CustomRetryStrategy
> WARNING: maximum file descriptor limit 0 is too low for production servers. 
> At least 4096 is recommended. Fix with "ulimit -n 4096"
> C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete 
> temporary directory: IOError: Cannot delete directory entry 
> 'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-wm32qa0y/.minio.sys': . 
> Detail: [Windows error 145] The directory is not empty.
> [   OK ] TestS3FS.CustomRetryStrategy (814 ms)
> [--] 23 tests from TestS3FS (51710 ms total) {noformat}
> The tests are quite slow, and it seems in part because the bucket is being 
> recreated/deleted on every test; also because some things seem to be 
> eventually consistent(?) so we aren't cleaning files up properly.
> It would also be nice here if the error from CreateBucket contained the 
> bucket name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13842) [C++] Bump vendored date library version

2021-09-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-13842:
--

Assignee: Antoine Pitrou

> [C++] Bump vendored date library version
> 
>
> Key: ARROW-13842
> URL: https://issues.apache.org/jira/browse/ARROW-13842
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 6.0.0
>
>
> This fix: [https://github.com/HowardHinnant/date/issues/696]
> should let us always re-enable this test:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/pretty_print_test.cc#L454-L466



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13958) [Python] Migrate Python ORC bindings to use new Result-based APIs

2021-09-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-13958:
-

 Summary: [Python] Migrate Python ORC bindings to use new 
Result-based APIs
 Key: ARROW-13958
 URL: https://issues.apache.org/jira/browse/ARROW-13958
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 6.0.0


Needed follow-up on ARROW-13793 (currently compiling pyarrow gives deprecation 
warnings about it)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13138) [C++] Implement kernel to extract datetime components (year, month, day, etc) from date type objects

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-13138.
--
Resolution: Fixed

Issue resolved by pull request 11075
[https://github.com/apache/arrow/pull/11075]

> [C++] Implement kernel to extract datetime components (year, month, day, etc) 
> from date type objects
> 
>
> Key: ARROW-13138
> URL: https://issues.apache.org/jira/browse/ARROW-13138
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>  Labels: Kernels, kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> ARROW-11759 implemented extraction of datetime components for timestamp 
> objects; please can we have the equivalent extraction functions implemented 
> for date objects too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13957) [C++] Make Windows S3FileSystem/Minio tests more reliable

2021-09-09 Thread David Li (Jira)
David Li created ARROW-13957:


 Summary: [C++] Make Windows S3FileSystem/Minio tests more reliable
 Key: ARROW-13957
 URL: https://issues.apache.org/jira/browse/ARROW-13957
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


[Example 
log|https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/40696885/job/5t25hl7biwxdipe9]
{noformat}
[ RUN  ] TestS3FS.FileSystemFromUri
WARNING: maximum file descriptor limit 0 is too low for production servers. At 
least 4096 is recommended. Fix with "ulimit -n 4096"
C:/projects/arrow/cpp/src/arrow/filesystem/s3fs_test.cc(387): error: Failed
'OutcomeToStatus(client_->CreateBucket(req))' failed with IOError: AWS Error 
[code 130]: Your previous request to create the named bucket succeeded and you 
already own it.
C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete 
temporary directory: IOError: Cannot delete directory entry 
'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-s6295hb6/.minio.sys/tmp/3cb9aaa7-6716-4c53-a30e-c2348f122148':
 . Detail: [Windows error 145] The directory is not empty.
[  FAILED  ] TestS3FS.FileSystemFromUri (7172 ms)
[ RUN  ] TestS3FS.CustomRetryStrategy
WARNING: maximum file descriptor limit 0 is too low for production servers. At 
least 4096 is recommended. Fix with "ulimit -n 4096"
C:/projects/arrow/cpp/src/arrow/util/io_util.cc:1523: When trying to delete 
temporary directory: IOError: Cannot delete directory entry 
'C:/Users/appveyor/AppData/Local/Temp/1/s3fs-test-wm32qa0y/.minio.sys': . 
Detail: [Windows error 145] The directory is not empty.
[   OK ] TestS3FS.CustomRetryStrategy (814 ms)
[--] 23 tests from TestS3FS (51710 ms total) {noformat}
The tests are quite slow, and it seems in part because the bucket is being 
recreated/deleted on every test; also because some things seem to be eventually 
consistent(?) so we aren't cleaning files up properly.

It would also be nice here if the error from CreateBucket contained the bucket 
name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13680) [C++] Create an asynchronous nursery to simplify capture logic

2021-09-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-13680.
--
Resolution: Fixed

Issue resolved by pull request 11084
[https://github.com/apache/arrow/pull/11084]

> [C++] Create an asynchronous nursery to simplify capture logic
> --
>
> Key: ARROW-13680
> URL: https://issues.apache.org/jira/browse/ARROW-13680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> The asynchronous nursery manages a set of asynchronous tasks and objects.  
> The nursery will not exit until all of those tasks have finished.  This 
> allows one to safely capture fields for asynchronous callbacks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13882) [C++] Add compute function min_max support for more types

2021-09-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412530#comment-17412530
 ] 

David Li commented on ARROW-13882:
--

The linked PR already adds string support too (for the scalar aggregate only, 
not for the hash aggregate which I want to split out).

I don't think std::minmax is relevant.

> [C++] Add compute function min_max support for more types
> -
>
> Key: ARROW-13882
> URL: https://issues.apache.org/jira/browse/ARROW-13882
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available, types
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The min_max compute function does not support the following types but should:
>  - time32
>  - time64
>  - timestamp
>  - null
>  - binary
>  - large_binary
>  - fixed_size_binary
>  - string
>  - large_string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12755) [C++][Compute] Add quotient and modulo kernels

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12755:
---
Labels: beginner kernel pull-request-available  (was: beginner kernel)

> [C++][Compute] Add quotient and modulo kernels
> --
>
> Key: ARROW-12755
> URL: https://issues.apache.org/jira/browse/ARROW-12755
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: beginner, kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Add a pair of binary kernels to compute the:
>  * quotient (result after division, discarding any fractional part, a.k.a 
> integer division)
>  * mod or modulo (remainder after division, a.k.a {{%}} / {{%%}} / modulus).
> The returned array should have the same data type as the input arrays or 
> promote to an appropriate type to avoid loss of precision if the input types 
> differ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13898) [C++][Compute] Add support for string binary transforms

2021-09-09 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce closed ARROW-13898.
-
Resolution: Abandoned

> [C++][Compute] Add support for string binary transforms
> ---
>
> Key: ARROW-13898
> URL: https://issues.apache.org/jira/browse/ARROW-13898
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Add kernel exec generator for string binary functions (similar to 
> StringTransformExecBase) that always expect the first parameter to be of 
> string type and the second parameter is generic. It should also support 
> scalar and array inputs for the following common cases:
> * Scalar, Scalar
> * Array, Scalar - scalar is broadcasted and paired with all values from array
> * Array, Array - process arrays element-wise
> The Scalar, Array case is not necessary as it is difficult to generalize, and 
> there are not many functions with this pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython

2021-09-09 Thread krishna deepak (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412481#comment-17412481
 ] 

krishna deepak commented on ARROW-13939:


Hi Will,

Yes, resampling timeseries table, eg 1min buckets to 5 min buckets table etc. 
Same as dataframe.resample. Does arrow provide this functionality already?


So how to go about iterating the table. from documentation all I could use is 
only *Slice* function. But does not feel like a proper iterator of sorts. Is 
there anything better to iterate properly?

By "build arrays", you mean arrow chunkedarrays, arraybuilders or cpp vectors?

 

Thanks

 

 

> how to do resampling of arrow table using cython
> 
>
> Key: ARROW-13939
> URL: https://issues.apache.org/jira/browse/ARROW-13939
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: krishna deepak
>Priority: Minor
>
> Please can someone point me to resources, how to write a resampling code in 
> cython for Arrow table.
>  # Will iterating the whole table be slow in cython?
>  # which is the best to use to append new elements to. Is there a way i 
> create an empty table of same schema and keep appending to it. Or should I 
> use vectors/list and then pass them to create a table.
> Performance is very important for me. Any help is highly appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10415) [R] Support for dplyr::distinct()

2021-09-09 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-10415:


Assignee: Nicola Crane

> [R] Support for dplyr::distinct()
> -
>
> Key: ARROW-10415
> URL: https://issues.apache.org/jira/browse/ARROW-10415
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Affects Versions: 2.0.0
>Reporter: Christian M
>Assignee: Nicola Crane
>Priority: Minor
>  Labels: dplyr, query-engine
> Fix For: 6.0.0
>
> Attachments: image-2020-10-28-15-01-54-198.png
>
>
> It would be nice if dplyr::distinct worked with arrow tables: 
>  
> !image-2020-10-28-15-01-54-198.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12084) [C++][Compute] Add remainder and quotient compute::Function

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12084:
---
Labels: kernel pull-request-available  (was: kernel)

> [C++][Compute] Add remainder and quotient compute::Function
> ---
>
> Key: ARROW-12084
> URL: https://issues.apache.org/jira/browse/ARROW-12084
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to {{divide}} which returns only the quotient, it'd be useful to 
> have a function which returns both quotient and remainder (these are 
> efficient to compute simultaneously), probably as a {{struct remainder: T>}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13954) [Python] Extend type testing to supply scalars

2021-09-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412420#comment-17412420
 ] 

Antoine Pitrou commented on ARROW-13954:


Can you update these JIRAs to be more precise about what "type testing" is? 
Right now I'm thinking {{test_types.py}}.

> [Python] Extend type testing to supply scalars
> --
>
> Key: ARROW-13954
> URL: https://issues.apache.org/jira/browse/ARROW-13954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: kernel, query-engine
>
> The current type testing passes in all arguments as arrays.  We should extend 
> it to account for cases where an argument is allowed to be a scalar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13956) [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the Status

2021-09-09 Thread Junwang Zhao (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412386#comment-17412386
 ] 

Junwang Zhao commented on ARROW-13956:
--

hi [~apitrou], could you pls review this issue, I came up with this when I see 
this kind of using in [0], thought

it might be better if we provide this macro.

 

[0]: 
https://github.com/apache/arrow/pull/10991/files/5ca6aa26b84d9fd89384032383c36bd48259e60b#

> [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the 
> Status
> ---
>
> Key: ARROW-13956
> URL: https://issues.apache.org/jira/browse/ARROW-13956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Junwang Zhao
>Assignee: Junwang Zhao
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As Result is encourged to be used, it's better to supply a Marcro to 
> change the internal Status.  We could do this by using RETURN_NOT_OK_ELSE, 
> one example is:
>  
> {code:java}
> auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool);
> RETURN_NOT_OK_ELSE(reader, _s.WithMessage("Could not open ORC input source 
> '", source.path(), "': ", _s.message()));
> return reader;
> {code}
> but it's ugly since it use the _s of the macro.
>  
> Recommend fix:
> Add a macro RETURN_NOT_OK_ELSE_WITH_STATUS:
> {code:java}
> // Use this when you want to change the status in else_ expr
> #define RETURN_NOT_OK_ELSE_WITH_STATUS(s, _s, else_)\ 
>   do {  \
> ::arrow::Status _s = ::arrow::internal::GenericToStatus(s); \
> if (!_s.ok()) { \  
>   else_;\  
>   return _s;\
> }   \  
>   } while (false)
> {code}
> And the following statements are more natural
>  
> {code:java}
> auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool);
> RETURN_NOT_OK_ELSE_WITH_STATUS(reader, status, status.WithMessage("Could not 
> open ORC input source '",  source.path(), "': ", status.message()));
> return reader;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13956) [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the Status

2021-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13956:
---
Labels: pull-request-available  (was: )

> [C++] Add a RETURN_NOT_OK_ELSE_WITH_STATUS macro to support changing the 
> Status
> ---
>
> Key: ARROW-13956
> URL: https://issues.apache.org/jira/browse/ARROW-13956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Junwang Zhao
>Assignee: Junwang Zhao
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As Result is encourged to be used, it's better to supply a Marcro to 
> change the internal Status.  We could do this by using RETURN_NOT_OK_ELSE, 
> one example is:
>  
> {code:java}
> auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool);
> RETURN_NOT_OK_ELSE(reader, _s.WithMessage("Could not open ORC input source 
> '", source.path(), "': ", _s.message()));
> return reader;
> {code}
> but it's ugly since it use the _s of the macro.
>  
> Recommend fix:
> Add a macro RETURN_NOT_OK_ELSE_WITH_STATUS:
> {code:java}
> // Use this when you want to change the status in else_ expr
> #define RETURN_NOT_OK_ELSE_WITH_STATUS(s, _s, else_)\ 
>   do {  \
> ::arrow::Status _s = ::arrow::internal::GenericToStatus(s); \
> if (!_s.ok()) { \  
>   else_;\  
>   return _s;\
> }   \  
>   } while (false)
> {code}
> And the following statements are more natural
>  
> {code:java}
> auto reader = arrow::adapters::orc::ORCFileReader::Open(input, pool);
> RETURN_NOT_OK_ELSE_WITH_STATUS(reader, status, status.WithMessage("Could not 
> open ORC input source '",  source.path(), "': ", status.message()));
> return reader;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >