[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow

2021-08-09 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Summary: How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow  (was: 
How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0")

> How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> env: GCC 7.5 cmake 3.16
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"

2021-08-09 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Description: 
env: GCC 7.5 cmake 3.16

Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet

  was:
Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet


> How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> env: GCC 7.5 cmake 3.16
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13585) [GLib] Add support for C ABI interface

2021-08-09 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-13585.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10900
[https://github.com/apache/arrow/pull/10900]

> [GLib] Add support for C ABI interface
> --
>
> Key: ARROW-13585
> URL: https://issues.apache.org/jira/browse/ARROW-13585
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"

2021-08-09 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Description: 
Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet

  was:
Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

command:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
make arrow
make parquet


> How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"

2021-08-09 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Description: 
Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

command:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
make arrow
make parquet

  was:Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file 
was not recognized, and the file was truncated  


> How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> command:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
> make arrow
> make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13592) How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"

2021-08-09 Thread wangdapeng (Jira)
wangdapeng created ARROW-13592:
--

 Summary: How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"
 Key: ARROW-13592
 URL: https://issues.apache.org/jira/browse/ARROW-13592
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 5.0.0
Reporter: wangdapeng


Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2021-08-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396431#comment-17396431
 ] 

Micah Kornfield commented on ARROW-13240:
-

I think [~westonpace] fixed this while fixing statistics for dictionaries.

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Minor
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6908) [C++] Add support for Bazel

2021-08-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396398#comment-17396398
 ] 

Micah Kornfield commented on ARROW-6908:


no I think we can close for now.  It was a nicer developer experience but hard 
to maintain without buy in from active developers to maintain it once it is in 
place.

> [C++] Add support for Bazel
> ---
>
> Key: ARROW-6908
> URL: https://issues.apache.org/jira/browse/ARROW-6908
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Aryan Naraghi
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I would like to use Arrow in a C++ project that uses Bazel.
>  
> Would it be possible to add support for building Arrow using Bazel?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13172) [Java] Make TYPE_WIDTH in Vector public

2021-08-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-13172:
---

Assignee: Eduard Tudenhoefner

> [Java] Make TYPE_WIDTH in Vector public
> ---
>
> Key: ARROW-13172
> URL: https://issues.apache.org/jira/browse/ARROW-13172
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Eduard Tudenhoefner
>Assignee: Eduard Tudenhoefner
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Some Vector classes already expose the TYPE_WIDTH publicly. It would be 
> helpful if all Vectors would do that



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13172) [Java] Make TYPE_WIDTH in Vector public

2021-08-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-13172.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10600
[https://github.com/apache/arrow/pull/10600]

> [Java] Make TYPE_WIDTH in Vector public
> ---
>
> Key: ARROW-13172
> URL: https://issues.apache.org/jira/browse/ARROW-13172
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Eduard Tudenhoefner
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Some Vector classes already expose the TYPE_WIDTH publicly. It would be 
> helpful if all Vectors would do that



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13479) [Format] Make requirement around dense union offsets less ambiguous

2021-08-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396408#comment-17396408
 ] 

Micah Kornfield commented on ARROW-13479:
-

Hmm, I would have thought strictly, but I don't know the implications of 
changing it at this point.

> [Format] Make requirement around dense union offsets less ambiguous
> ---
>
> Key: ARROW-13479
> URL: https://issues.apache.org/jira/browse/ARROW-13479
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Format
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> Currently, the spec states that dense union offsets for each child array must 
> be "in order / increasing". There is an ambiguity: should they be strictly 
> increasing, or are equal values supported?
> The C++ implementation currently considers that equal offsets are acceptable.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13506) Upgrade ORC to 1.6.9

2021-08-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-13506:
---

Assignee: Dongjoon Hyun

> Upgrade ORC to 1.6.9
> 
>
> Key: ARROW-13506
> URL: https://issues.apache.org/jira/browse/ARROW-13506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Affects Versions: 6.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Currently, C++ module uses ORC 1.6.6 and Java module uses ORC 1.5.5.
> The latest version is Apache ORC 1.6.9.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3822) [C++] parquet::arrow::FileReader::GetRecordBatchReader may not iterate through chunked columns completely

2021-08-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396401#comment-17396401
 ] 

Micah Kornfield commented on ARROW-3822:


https://issues.apache.org/jira/browse/ARROW-11607 is the only bug I touched 
recently that I looked at that seems like it might be related

> [C++] parquet::arrow::FileReader::GetRecordBatchReader may not iterate 
> through chunked columns completely
> -
>
> Key: ARROW-3822
> URL: https://issues.apache.org/jira/browse/ARROW-3822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> EDIT: https://github.com/apache/arrow/pull/3911#issuecomment-473679153
> We don't currently test that all data is iterated through when reading from a 
> Parquet file where the result is chunked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6908) [C++] Add support for Bazel

2021-08-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6908.

Resolution: Won't Fix

> [C++] Add support for Bazel
> ---
>
> Key: ARROW-6908
> URL: https://issues.apache.org/jira/browse/ARROW-6908
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Aryan Naraghi
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I would like to use Arrow in a C++ project that uses Bazel.
>  
> Would it be possible to add support for building Arrow using Bazel?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11828) [C++] Expose CSVWriter object in api

2021-08-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396397#comment-17396397
 ] 

Micah Kornfield commented on ARROW-11828:
-

This was fixed as as part of work of David Li to expose this for datasets.

> [C++] Expose CSVWriter object in api
> 
>
> Key: ARROW-11828
> URL: https://issues.apache.org/jira/browse/ARROW-11828
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>
> Based on feedback from initial CSV PR this is likely the preferred API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11828) [C++] Expose CSVWriter object in api

2021-08-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-11828.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

> [C++] Expose CSVWriter object in api
> 
>
> Key: ARROW-11828
> URL: https://issues.apache.org/jira/browse/ARROW-11828
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: 6.0.0
>
>
> Based on feedback from initial CSV PR this is likely the preferred API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13591) [C++] Make index kernel work in exec plans

2021-08-09 Thread David Li (Jira)
David Li created ARROW-13591:


 Summary: [C++] Make index kernel work in exec plans
 Key: ARROW-13591
 URL: https://issues.apache.org/jira/browse/ARROW-13591
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li
 Fix For: 6.0.0


As written, it gives the wrong result because it's order-dependent. Needs 
ARROW-13540 for ordering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

2021-08-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396329#comment-17396329
 ] 

David Li commented on ARROW-12873:
--

I think so, but I still think this is very close to the original proposal, just 
using Datums in place of void*. Also this requires nodes to know which metadata 
kernels can make use of (which I guess I argued they must know anyways) but it 
might not be as clean as just passing it implicitly as part of the ExecBatch 
(though implicit may not be what we want here!)

In any case: my main argument is that I think passing metadata as part of the 
values in the ExecBatch and as part of arguments is likely too 
brittle/inextensible to be what we want.

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---
>
> Key: ARROW-12873
> URL: https://issues.apache.org/jira/browse/ARROW-12873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for 
> tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, 
> since they may not originate from the arrow library. For an example within 
> the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and 
> a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} 
> of origin. However adding {{ExecBatch::fragment}} would result in a cyclic 
> dependency.
> To facilitate this tagging capability, we would need a type erased container 
> something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13590) Ensure dataset writing applies back pressure

2021-08-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-13590:

Labels: query-engine  (was: )

> Ensure dataset writing applies back pressure
> 
>
> Key: ARROW-13590
> URL: https://issues.apache.org/jira/browse/ARROW-13590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Dataset writing via exec plan (ARROW-13542) does not apply back pressure 
> currently and will take up far more RAM than it should when writing a large 
> dataset.  The node should be applying back pressure.  However, the preferred 
> back pressure method (via scheduling) will need to wait for ARROW-13576.
> Once those two are finished this can be studied in more detail.  Also, the 
> vm.dirty_ratio might be experimented with.  In theory we should be applying 
> our own back pressure and have no need of dirty pages.  In practice, it may 
> be more work than we want to tackle right now and we just let it do its thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13590) Ensure dataset writing applies back pressure

2021-08-09 Thread Weston Pace (Jira)
Weston Pace created ARROW-13590:
---

 Summary: Ensure dataset writing applies back pressure
 Key: ARROW-13590
 URL: https://issues.apache.org/jira/browse/ARROW-13590
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
 Fix For: 6.0.0


Dataset writing via exec plan (ARROW-13542) does not apply back pressure 
currently and will take up far more RAM than it should when writing a large 
dataset.  The node should be applying back pressure.  However, the preferred 
back pressure method (via scheduling) will need to wait for ARROW-13576.

Once those two are finished this can be studied in more detail.  Also, the 
vm.dirty_ratio might be experimented with.  In theory we should be applying our 
own back pressure and have no need of dirty pages.  In practice, it may be more 
work than we want to tackle right now and we just let it do its thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13590) Ensure dataset writing applies back pressure

2021-08-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-13590:
---

Assignee: Weston Pace

> Ensure dataset writing applies back pressure
> 
>
> Key: ARROW-13590
> URL: https://issues.apache.org/jira/browse/ARROW-13590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
> Fix For: 6.0.0
>
>
> Dataset writing via exec plan (ARROW-13542) does not apply back pressure 
> currently and will take up far more RAM than it should when writing a large 
> dataset.  The node should be applying back pressure.  However, the preferred 
> back pressure method (via scheduling) will need to wait for ARROW-13576.
> Once those two are finished this can be studied in more detail.  Also, the 
> vm.dirty_ratio might be experimented with.  In theory we should be applying 
> our own back pressure and have no need of dirty pages.  In practice, it may 
> be more work than we want to tackle right now and we just let it do its thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored

2021-08-09 Thread Charlie Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Gao updated ARROW-13588:

Description: 
Date-times in the POSIXct format have a 'tzone' attribute that by default is 
set to "", an empty character vector (not NULL) when created.

This however is not stored in the Arrow feather file. When the file is read 
back, the original and restored dataframes are not identical as per the below 
reprex.

I am thinking that this should not be the intention? My workaround at the 
moment is making a check when reading back to write the empty string if the 
tzone attribute does not exist.

Just to confirm, the attribute is stored correctly when it is not empty.

Thanks.
{code:java}
``` r
 dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
 attributes(dates)
 #> $class
 #> [1] "POSIXct" "POSIXt" 
 #> 
 #> $tzone
 #> [1] ""

 values <- c(1:3)
 original <- data.frame(dates, values)
 original
 #> dates values
 #> 1 2020-01-01 1
 #> 2 2020-01-02 2
 #> 3 2020-01-02 3

tempfile <- tempfile()
arrow::write_feather(original, tempfile)

restored <- arrow::read_feather(tempfile)

identical(original, restored)
 #> [1] FALSE

 waldo::compare(original, restored)
 #> `attr(old$dates, 'tzone')` is a character vector ('')
 #> `attr(new$dates, 'tzone')` is absent

unlink(tempfile)
 ```
{code}
 

  was:
Date-times in the POSIXct format have a 'tzone' attribute that by default is 
set to "", an empty character vector (not NULL) when created.

This however is not stored in the Arrow feather file. When the file is read 
back, the original and restored dataframes are not identical as per the below 
reprex.

I am thinking that this should not be the intention? My workaround at the 
moment is making a check when reading back to write the empty string if the 
tzone attribute does not exist.

Just to confirm, the attribute is stored correctly when it is not empty.

Thanks.
{code:java}
``` r
 dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
 attributes(dates)
 #> $class
 #> [1] "POSIXct" "POSIXt" 
 #> 
 #> $tzone
 #> [1] ""
 values <- c(1:3)
 original <- data.frame(dates, values)
 original
 #> dates values
 #> 1 2020-01-01 1
 #> 2 2020-01-02 2
 #> 3 2020-01-02 3
tempfile <- tempfile()
 arrow::write_feather(original, tempfile)
restored <- arrow::read_feather(tempfile)
identical(original, restored)
 #> [1] FALSE
 waldo::compare(original, restored)
 #> `attr(old$dates, 'tzone')` is a character vector ('')
 #> `attr(new$dates, 'tzone')` is absent
unlink(tempfile)
 ```
{code}
 


> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>  Labels: attributes, feather
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored

2021-08-09 Thread Charlie Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Gao updated ARROW-13588:

Labels: attributes feather  (was: attributes)

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>  Labels: attributes, feather
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
>  arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored

2021-08-09 Thread Charlie Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Gao updated ARROW-13588:

Labels: attributes  (was: )

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>  Labels: attributes
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
>  arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored

2021-08-09 Thread Charlie Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Gao updated ARROW-13588:

Description: 
Date-times in the POSIXct format have a 'tzone' attribute that by default is 
set to "", an empty character vector (not NULL) when created.

This however is not stored in the Arrow feather file. When the file is read 
back, the original and restored dataframes are not identical as per the below 
reprex.

I am thinking that this should not be the intention? My workaround at the 
moment is making a check when reading back to write the empty string if the 
tzone attribute does not exist.

Just to confirm, the attribute is stored correctly when it is not empty.

Thanks.
{code:java}
``` r
 dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
 attributes(dates)
 #> $class
 #> [1] "POSIXct" "POSIXt" 
 #> 
 #> $tzone
 #> [1] ""
 values <- c(1:3)
 original <- data.frame(dates, values)
 original
 #> dates values
 #> 1 2020-01-01 1
 #> 2 2020-01-02 2
 #> 3 2020-01-02 3
tempfile <- tempfile()
 arrow::write_feather(original, tempfile)
restored <- arrow::read_feather(tempfile)
identical(original, restored)
 #> [1] FALSE
 waldo::compare(original, restored)
 #> `attr(old$dates, 'tzone')` is a character vector ('')
 #> `attr(new$dates, 'tzone')` is absent
unlink(tempfile)
 ```
{code}
 

  was:
I have come across an issue in the process of incorporating arrow in a package 
I develop.

Date-times in the POSIXct format have a 'tzone' attribute that by default is 
set to "", an empty character vector (not NULL) when created.

This however is not stored in the Arrow feather file. When the file is read 
back, the original and restored dataframes are not identical as per the below 
reprex.

I am thinking that this should not be the intention? My workaround at the 
moment is making a check when reading back to write the empty string if the 
tzone attribute does not exist.

Just to confirm, this is not an issue when the attribute is not empty - it gets 
stored correctly.

Thanks.

``` r
 dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
 attributes(dates)
 #> $class
 #> [1] "POSIXct" "POSIXt" 
 #> 
 #> $tzone
 #> [1] ""
 values <- c(1:3)
 original <- data.frame(dates, values)
 original
 #> dates values
 #> 1 2020-01-01 1
 #> 2 2020-01-02 2
 #> 3 2020-01-02 3

tempfile <- tempfile()
 arrow::write_feather(original, tempfile)

restored <- arrow::read_feather(tempfile)

identical(original, restored)
 #> [1] FALSE
 waldo::compare(original, restored)
 #> `attr(old$dates, 'tzone')` is a character vector ('')
 #> `attr(new$dates, 'tzone')` is absent

unlink(tempfile)
 ```


> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
>  arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored

2021-08-09 Thread Charlie Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Gao updated ARROW-13588:

Summary: [R] Empty character attributes not stored  (was: R empty character 
attributes not stored)

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>
> I have come across an issue in the process of incorporating arrow in a 
> package I develop.
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, this is not an issue when the attribute is not empty - it 
> gets stored correctly.
> Thanks.
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
>  arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13473) [C++] Add super-scalar impl for BitUtil::SetBitTo

2021-08-09 Thread Niranda Perera (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranda Perera closed ARROW-13473.
--
Resolution: Abandoned

This issue is abandoned, because there wasn't any substantial perf improvement.

> [C++] Add super-scalar impl for BitUtil::SetBitTo 
> --
>
> Key: ARROW-13473
> URL: https://issues.apache.org/jira/browse/ARROW-13473
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Niranda Perera
>Assignee: Cristhian Gonzales Castillo
>Priority: Major
>  Labels: beginner
>
> {code:java}
> void SetBitTo(uint8_t* bits, int64_t i, bool bit_is_set){code}
> Add the super-scalar variant to set a bit, as described in here.
> [https://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching]
> Add the implementation and run benchmarks (or create one if not exists)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12513) [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls

2021-08-09 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-12513.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10729
[https://github.com/apache/arrow/pull/10729]

> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics 
> for dictionary-encoded array with nulls
> 
>
> Key: ARROW-12513
> URL: https://issues.apache.org/jira/browse/ARROW-12513
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 1.0.1, 2.0.0, 3.0.0
> Environment: RHEL6
>Reporter: David Beach
>Assignee: Weston Pace
>Priority: Critical
>  Labels: parquet-statistics, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> When writing a Table as Parquet, when the table contains columns represented 
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0 
> in the Parquet metadata.  If the same data is saved without 
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ 
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13573) [C++] Support dictionaries directly in case_when kernel

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13573:
-
Labels: kernel  (was: )

> [C++] Support dictionaries directly in case_when kernel
> ---
>
> Key: ARROW-13573
> URL: https://issues.apache.org/jira/browse/ARROW-13573
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> case_when (and other similar kernels) currently dictionary-decode inputs, 
> then operate on the decoded values. This is both inefficient and unexpected. 
> We should instead operate directly on dictionary indices.
> Of course, this introduces more edge cases. If the dictionaries of inputs do 
> not match, we have the following choices:
>  # Raise an error.
>  # Unify the dictionaries.
>  # Use one of the dictionaries, and raise an error if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary.
>  # Use one of the dictionaries, and emit null if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary. (This is 
> what base dplyr if_else does with factors.)
> All of these options are reasonable, so we should introduce an options 
> struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly 
> necessary, as the user can unify the dictionaries manually first, but it may 
> be more efficient to do it this way. Similarly, #1 isn't strictly necessary.
> #3 and #4 are justifiable (beyond just "it's what R does") since users may 
> filter down disjoint dictionaries into a set of common values and then expect 
> to combine the remaining values with a kernel like case_when.
> As described on 
> [GitHub|https://github.com/apache/arrow/pull/10724#discussion_r682671015].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13390) [C++] Improve type support for 'coalesce' kernel

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13390:
-
Labels: kernel  (was: )

> [C++] Improve type support for 'coalesce' kernel
> 
>
> Key: ARROW-13390
> URL: https://issues.apache.org/jira/browse/ARROW-13390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> * Direct support for dictionaries (unifying dictionaries or building a new 
> dictionary as necessary)
>  * Fixed-length list
>  * Union, struct, list, other variable-width types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13390) [C++] Improve type support for 'coalesce' kernel

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13390:
-
Fix Version/s: 6.0.0

> [C++] Improve type support for 'coalesce' kernel
> 
>
> Key: ARROW-13390
> URL: https://issues.apache.org/jira/browse/ARROW-13390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
> Fix For: 6.0.0
>
>
> * Direct support for dictionaries (unifying dictionaries or building a new 
> dictionary as necessary)
>  * Fixed-length list
>  * Union, struct, list, other variable-width types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13573) [C++] Support dictionaries directly in case_when kernel

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13573:
-
Fix Version/s: 6.0.0

> [C++] Support dictionaries directly in case_when kernel
> ---
>
> Key: ARROW-13573
> URL: https://issues.apache.org/jira/browse/ARROW-13573
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
> Fix For: 6.0.0
>
>
> case_when (and other similar kernels) currently dictionary-decode inputs, 
> then operate on the decoded values. This is both inefficient and unexpected. 
> We should instead operate directly on dictionary indices.
> Of course, this introduces more edge cases. If the dictionaries of inputs do 
> not match, we have the following choices:
>  # Raise an error.
>  # Unify the dictionaries.
>  # Use one of the dictionaries, and raise an error if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary.
>  # Use one of the dictionaries, and emit null if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary. (This is 
> what base dplyr if_else does with factors.)
> All of these options are reasonable, so we should introduce an options 
> struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly 
> necessary, as the user can unify the dictionaries manually first, but it may 
> be more efficient to do it this way. Similarly, #1 isn't strictly necessary.
> #3 and #4 are justifiable (beyond just "it's what R does") since users may 
> filter down disjoint dictionaries into a set of common values and then expect 
> to combine the remaining values with a kernel like case_when.
> As described on 
> [GitHub|https://github.com/apache/arrow/pull/10724#discussion_r682671015].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13574) [C++] Implement "count(*)" hash aggregate kernel

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13574:
-
Fix Version/s: 6.0.0

> [C++] Implement "count(*)" hash aggregate kernel
> 
>
> Key: ARROW-13574
> URL: https://issues.apache.org/jira/browse/ARROW-13574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current "count" hash aggregate kernel counts either all non-null or all 
> null values, but doesn't count all values regardless of nullity (i.e. SQL 
> "count(*)" or Pandas size()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13574) [C++] Implement "count(*)" hash aggregate kernel

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13574:
-
Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement "count(*)" hash aggregate kernel
> 
>
> Key: ARROW-13574
> URL: https://issues.apache.org/jira/browse/ARROW-13574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current "count" hash aggregate kernel counts either all non-null or all 
> null values, but doesn't count all values regardless of nullity (i.e. SQL 
> "count(*)" or Pandas size()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13575) [C++] Implement product aggregate & hash aggregate kernels

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13575:
-
Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement product aggregate & hash aggregate kernels
> --
>
> Key: ARROW-13575
> URL: https://issues.apache.org/jira/browse/ARROW-13575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Like Pandas 
> [prod|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.prod.html].
>  Note that Pandas has a min_count option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13222) [C++] Support variable-width types in case_when function

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13222:
-
Fix Version/s: 6.0.0

> [C++] Support variable-width types in case_when function
> 
>
> Key: ARROW-13222
> URL: https://issues.apache.org/jira/browse/ARROW-13222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The initial PR only adds support for fixed-width types. We should also 
> support strings, lists, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13222) [C++] Support variable-width types in case_when function

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13222:
-
Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Support variable-width types in case_when function
> 
>
> Key: ARROW-13222
> URL: https://issues.apache.org/jira/browse/ARROW-13222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The initial PR only adds support for fixed-width types. We should also 
> support strings, lists, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13575) [C++] Implement product aggregate & hash aggregate kernels

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13575:
-
Fix Version/s: 6.0.0

> [C++] Implement product aggregate & hash aggregate kernels
> --
>
> Key: ARROW-13575
> URL: https://issues.apache.org/jira/browse/ARROW-13575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Like Pandas 
> [prod|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.prod.html].
>  Note that Pandas has a min_count option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13508) [C++] Allow custom RetryStrategy objects to be passed to S3FileSystem

2021-08-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-13508:
--

Assignee: Neil Burke

> [C++] Allow custom RetryStrategy objects to be passed to S3FileSystem
> -
>
> Key: ARROW-13508
> URL: https://issues.apache.org/jira/browse/ARROW-13508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neil Burke
>Assignee: Neil Burke
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> There's currently no way to provide a user-defined RetryStrategy object to be 
> used in the construction of S3FS's Aws::S3::S3Client. This makes handling S3 
> errors extremely difficult if the default RetryStrategy doesn't suit your 
> needs.
> Arrow should make the RetryStrategy available as an S3Options parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13508) [C++] Allow custom RetryStrategy objects to be passed to S3FileSystem

2021-08-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13508.

Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10877
[https://github.com/apache/arrow/pull/10877]

> [C++] Allow custom RetryStrategy objects to be passed to S3FileSystem
> -
>
> Key: ARROW-13508
> URL: https://issues.apache.org/jira/browse/ARROW-13508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neil Burke
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> There's currently no way to provide a user-defined RetryStrategy object to be 
> used in the construction of S3FS's Aws::S3::S3Client. This makes handling S3 
> errors extremely difficult if the default RetryStrategy doesn't suit your 
> needs.
> Arrow should make the RetryStrategy available as an S3Options parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

2021-08-09 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396229#comment-17396229
 ] 

Weston Pace commented on ARROW-12873:
-

If we want strings would it be sufficient to be able to index node inputs / 
function inputs by name?  For example:

{code:cpp}
// In a node, retrieving inputs
auto group_ids = inputs_["group_ids_in"]
auto batch = inputs_["batch_in"]
{code}

{code:cpp}
// In a node, sending outputs
outputs["group_ids_out"]->InputReceived(this, seq, group_ids);
outputs["batch_out"]->InputReceived(this, seq, batch);
{code}

{code:cpp}
// Building the graph
// First argument is a vector of (input_name, node, output_name) for each input 
parameter
compute::MakeAggregateSumNode({{"group_ids_in", grouper, "group_ids_out"}, 
{"batch_in", grouper, "batch_out"}}, ...);
{code}

Then for compute functions you could do something like...

{code:cpp}
CallFunction("aggregate_sum", {{"batch", batch}, {"group_ids": group_ids}}, 
&options, ctx);
{code}

 

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---
>
> Key: ARROW-12873
> URL: https://issues.apache.org/jira/browse/ARROW-12873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for 
> tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, 
> since they may not originate from the arrow library. For an example within 
> the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and 
> a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} 
> of origin. However adding {{ExecBatch::fragment}} would result in a cyclic 
> dependency.
> To facilitate this tagging capability, we would need a type erased container 
> something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12748) [C++] Arithmetic kernels for numeric arrays

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12748:

Labels: Analytics kernel  (was: Analytics)

> [C++] Arithmetic kernels for numeric arrays
> ---
>
> Key: ARROW-12748
> URL: https://issues.apache.org/jira/browse/ARROW-12748
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>  Labels: Analytics, kernel
>
> This is a parent Jira for the buildout of element-wise ("scalar") 
> mathematical or arithmetic functions that operate on numeric arrays and 
> scalars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13130) [C++][Compute] Add abs, negate kernel for decimal inputs

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13130:

Labels: kernel  (was: )

> [C++][Compute] Add abs, negate kernel for decimal inputs
> 
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: kernel
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12084) [C++][Compute] Add remainder and quotient compute::Function

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12084:

Labels: kernel  (was: )

> [C++][Compute] Add remainder and quotient compute::Function
> ---
>
> Key: ARROW-12084
> URL: https://issues.apache.org/jira/browse/ARROW-12084
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: kernel
>
> In addition to {{divide}} which returns only the quotient, it'd be useful to 
> have a function which returns both quotient and remainder (these are 
> efficient to compute simultaneously), probably as a {{struct remainder: T>}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12413) [C++][Compute] Improve power kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12413:

Labels:   (was: k)

> [C++][Compute] Improve power kernel
> ---
>
> Key: ARROW-12413
> URL: https://issues.apache.org/jira/browse/ARROW-12413
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>
> See comments at 
> https://github.com/apache/arrow/pull/9841#issuecomment-818526907
> Power kernel is modified from add/sub/mul/div kernels. But power is much more 
> complicated than basic arithmetic operations. Some limitations in current 
> implementation:
> - Base and exponent must be the same type, this is not reasonable
> - Null handling is delayed to the end, and power is always calculated even 
> for nulls (with whatever base/exponent happen to be in the slots). This is a 
> big waste of time for expensive power calculations.
> For checked power, integer overflow is checked for every multiplication. As 
> the integer upper bound is already know, it can be calculate directly if the 
> power will overflow. It will remove overflow checks in each iteration, but 
> will introduce one expensive logarithm operation. Maybe deserve a try.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12755) [C++][Compute] Add quotient and modulo kernels

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12755:

Labels: beginner kernel  (was: beginner)

> [C++][Compute] Add quotient and modulo kernels
> --
>
> Key: ARROW-12755
> URL: https://issues.apache.org/jira/browse/ARROW-12755
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: beginner, kernel
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Add a pair of binary kernels to compute the:
>  * quotient (result after division, discarding any fractional part, a.k.a 
> integer division)
>  * mod or modulo (remainder after division, a.k.a {{%}} / {{%%}} / modulus).
> The returned array should have the same data type as the input arrays or 
> promote to an appropriate type to avoid loss of precision if the input types 
> differ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12413) [C++][Compute] Improve power kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12413:

Labels: k  (was: )

> [C++][Compute] Improve power kernel
> ---
>
> Key: ARROW-12413
> URL: https://issues.apache.org/jira/browse/ARROW-12413
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: k
>
> See comments at 
> https://github.com/apache/arrow/pull/9841#issuecomment-818526907
> Power kernel is modified from add/sub/mul/div kernels. But power is much more 
> complicated than basic arithmetic operations. Some limitations in current 
> implementation:
> - Base and exponent must be the same type, this is not reasonable
> - Null handling is delayed to the end, and power is always calculated even 
> for nulls (with whatever base/exponent happen to be in the slots). This is a 
> big waste of time for expensive power calculations.
> For checked power, integer overflow is checked for every multiplication. As 
> the integer upper bound is already know, it can be calculate directly if the 
> power will overflow. It will remove overflow checks in each iteration, but 
> will introduce one expensive logarithm operation. Maybe deserve a try.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12413) [C++][Compute] Improve power kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12413:

Labels: kernel  (was: )

> [C++][Compute] Improve power kernel
> ---
>
> Key: ARROW-12413
> URL: https://issues.apache.org/jira/browse/ARROW-12413
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: kernel
>
> See comments at 
> https://github.com/apache/arrow/pull/9841#issuecomment-818526907
> Power kernel is modified from add/sub/mul/div kernels. But power is much more 
> complicated than basic arithmetic operations. Some limitations in current 
> implementation:
> - Base and exponent must be the same type, this is not reasonable
> - Null handling is delayed to the end, and power is always calculated even 
> for nulls (with whatever base/exponent happen to be in the slots). This is a 
> big waste of time for expensive power calculations.
> For checked power, integer overflow is checked for every multiplication. As 
> the integer upper bound is already know, it can be calculate directly if the 
> power will overflow. It will remove overflow checks in each iteration, but 
> will introduce one expensive logarithm operation. Maybe deserve a try.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12744) [C++][Compute] Add rounding kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12744:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++][Compute] Add rounding kernel
> --
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13345) [C++] Implement logN compute function

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13345:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement logN compute function
> -
>
> Key: ARROW-13345
> URL: https://issues.apache.org/jira/browse/ARROW-13345
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Nic Crane
>Assignee: Rommel Quintanilla
>Priority: Minor
>  Labels: kernel, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Just writing bindings from R functions to the various C++ log functions, but 
> one that we don't have is logN (i.e. where N is a user-supplied value); 
> please could this be implemented?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13298) [C++] Implement hash_aggregate any/all Boolean kernels

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13298:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement hash_aggregate any/all Boolean kernels
> --
>
> Key: ARROW-13298
> URL: https://issues.apache.org/jira/browse/ARROW-13298
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> These would mimic Pandas's 
> [DataFrameGroupBy.all/DataFrameGroupBy.any|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.all.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13434) [R] group_by() with an unnammed expression

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13434:

Labels: pull-request-available query-engine  (was: pull-request-available)

> [R] group_by() with an unnammed expression
> --
>
> Key: ARROW-13434
> URL: https://issues.apache.org/jira/browse/ARROW-13434
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> With dplyr, when we group_by with an unnamed expression, a column is added to 
> the dataframe that has the result of the expression.
> {code}
> > example_data %>% 
> +   group_by(int < 4) %>% collect()
> # A tibble: 10 x 8
> # Groups:   int < 4 [3]
>  int   dbl  dbl2 lgl   false chr   fct   `int < 4`
>   
>  1 1   1.1 5 TRUE  FALSE a a TRUE 
>  2 2   2.1 5 NAFALSE b b TRUE 
>  3 3   3.1 5 TRUE  FALSE c c TRUE 
>  4NA   4.1 5 FALSE FALSE d d NA   
>  5 5   5.1 5 TRUE  FALSE e NAFALSE
>  6 6   6.1 5 NAFALSE NANAFALSE
>  7 7   7.1 5 NAFALSE g g FALSE
>  8 8   8.1 5 FALSE FALSE h h FALSE
>  9 9  NA   5 FALSE FALSE i i FALSE
> 1010  10.1 5 NAFALSE j j FALSE
> {code}
> Arrow doesn't do this, however because we (currently) only add columns when 
> the expression is named.
> {code}
> > Table$create(example_data) %>% 
> +   group_by(int < 4) %>% collect()
>  Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
> dbl: double
> dbl2: double
> lgl: bool
> false: bool
> chr: string
> fct: dictionary 
> {code}
> This isn't a big deal right now since grouped aggregations aren't (quite) 
> here yet, but once we start having support for that, we will have people 
> using examples like this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12759) [C++][Compute] Wrap grouped aggregation in an ExecNode

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12759:

Labels: pull-request-available query-engine  (was: kernel 
pull-request-available)

> [C++][Compute] Wrap grouped aggregation in an ExecNode
> --
>
> Key: ARROW-12759
> URL: https://issues.apache.org/jira/browse/ARROW-12759
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Michal Nowakiewicz
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 5.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> ARROW-11928 adds ExecNodes, to which GroupByNode should be added so that a 
> dataset scan can terminate in a grouped aggregation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12759) [C++][Compute] Wrap grouped aggregation in an ExecNode

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12759:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++][Compute] Wrap grouped aggregation in an ExecNode
> --
>
> Key: ARROW-12759
> URL: https://issues.apache.org/jira/browse/ARROW-12759
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Michal Nowakiewicz
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> ARROW-11928 adds ExecNodes, to which GroupByNode should be added so that a 
> dataset scan can terminate in a grouped aggregation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8928) [C++] Measure microperformance associated with ExecBatchIterator

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8928:
---
Labels: pull-request-available query-engine  (was: pull-request-available)

> [C++] Measure microperformance associated with ExecBatchIterator
> 
>
> Key: ARROW-8928
> URL: https://issues.apache.org/jira/browse/ARROW-8928
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> {{arrow::compute::ExecBatch}} uses a vector of {{arrow::Datum}} to contain a 
> collection of ArrayData and Scalar objects for kernel execution. It would be 
> helpful to know how many nanoseconds of overhead is associated with basic 
> interactions with this data structure to know the cost of using our vendored 
> variant, and other such issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13295) [C++] Implement hash_aggregate mean/stdev/variance kernels

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13295:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement hash_aggregate mean/stdev/variance kernels
> --
>
> Key: ARROW-13295
> URL: https://issues.apache.org/jira/browse/ARROW-13295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We have scalar aggregate kernels for these already that can serve as the 
> basis for a hash aggregate implementation.
> Depends on ARROW-12759 as that refactors  how hash aggregate kernels are 
> implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13495) [C++] UBSAN error in BitUtil when writing dataset

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13495:

Labels: pull-request-available query-engine  (was: pull-request-available)

> [C++] UBSAN error in BitUtil when writing dataset
> -
>
> Key: ARROW-13495
> URL: https://issues.apache.org/jira/browse/ARROW-13495
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Michal Nowakiewicz
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0, 5.0.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://www.stats.ox.ac.uk/pub/bdr/memtests/gcc-UBSAN/arrow/arrow-Ex.Rout
> {code}
> > write_dataset(airquality, tf3, partitioning = c("Month", "Day"), hive_style 
> > = FALSE)
> /tmp/RtmpWw0Jb4/file21ecfe42e86b84/apache-arrow-5.0.0/cpp/src/arrow/compute/exec/util.cc:34:18:
>  runtime error: store to misaligned address 0x631b48fd for type 
> 'uint16_t', which requires 2 byte alignment
> 0x631b48fd: note: pointer points here
>  00 00 00 03 00 00 00  0b 00 00 00 2a 00 00 00  02 00 00 00 12 00 00 00  2b 
> 00 00 00 3a 00 00 00  13
>  ^ 
> #0 0x7f343e9a7984 in void 
> arrow::util::BitUtil::bits_to_indexes_internal<0, false>(long, int, unsigned 
> char const*, unsigned short const*, int*, unsigned short*) [clone .isra.0] 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x1587d984)
> #1 0x7f343e9fca36 in arrow::compute::SwissTable::map(int, unsigned int 
> const*, unsigned int*) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x158d2a36)
> #2 0x7f343efcd989 in arrow::compute::internal::(anonymous 
> namespace)::GrouperFastImpl::Consume(arrow::compute::ExecBatch const&) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x15ea3989)
> #3 0x7f343befae8b in 
> arrow::dataset::KeyValuePartitioning::Partition(std::shared_ptr
>  const&) const [clone .localalias] 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12dd0e8b)
> #4 0x7f343beb2f45 in arrow::dataset::(anonymous 
> namespace)::WriteNextBatch(arrow::dataset::(anonymous 
> namespace)::WriteState*, std::shared_ptr const&, 
> std::shared_ptr) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12d88f45)
> #5 0x7f343bed06b6 in std::_Function_handler (std::shared_ptr), arrow::dataset::(anonymous 
> namespace)::WriteInternal(arrow::dataset::ScanOptions const&, 
> arrow::dataset::(anonymous namespace)::WriteState*, 
> std::vector, 
> std::allocator > 
> >)::{lambda()#1}::operator()() 
> const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data
>  const&, std::shared_ptr&&) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12da66b6)
> #6 0x7f343c1d79ab in std::_Function_handler (std::shared_ptr), 
> arrow::dataset::FilterAndProjectScanTask::SafeVisit(arrow::internal::Executor*,
>  std::function (std::shared_ptr)>)::{lambda(std::shared_ptr
>  const&)#1}>::_M_invoke(std::_Any_data const&, 
> std::shared_ptr&&) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x130ad9ab)
> #7 0x7f343c0ccc35 in arrow::Status 
> arrow::Iterator 
> >::Visit (std::shared_ptr)>&>(std::function (std::shared_ptr)>&) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12fa2c35)
> #8 0x7f343bda95ec in 
> arrow::dataset::ScanTask::SafeVisit(arrow::internal::Executor*, 
> std::function)>) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12c7f5ec)
> #9 0x7f343c011c08 in 
> arrow::dataset::FilterAndProjectScanTask::SafeVisit(arrow::internal::Executor*,
>  std::function)>) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12ee7c08)
> #10 0x7f343bcdff7a in 
> arrow::internal::FnOnce 
> (arrow::internal::Executor*)>::FnImpl namespace)::WriteInternal(arrow::dataset::ScanOptions const&, 
> arrow::dataset::(anonymous namespace)::WriteState*, 
> std::vector, 
> std::allocator > 
> >)::{lambda()#1}::operator()() 
> const::{lambda(arrow::internal::Executor*)#2}>::invoke(arrow::internal::Executor*&&)
>  
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x12bb5f7a)
> #11 0x7f343c4ac329 in arrow::Status 
> arrow::internal::SerialExecutor::RunInSerialExecutor arrow::Future, 
> arrow::Status>(arrow::internal::FnOnce 
> (arrow::internal::Executor*)>) 
> (/data/gannet/ripley/R/packages/tests-gcc-SAN/arrow.Rcheck/arrow/libs/arrow.so+0x13382329)
> #12 0x7f343c4ae6d4 in arrow::Future::SyncType 
> arrow::internal::RunSynchronously, 
> arrow::internal

[jira] [Updated] (ARROW-12946) [C++] String swap case kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12946:

Labels: beginner kernel pull-request-available  (was: beginner 
pull-request-available)

> [C++] String swap case kernel
> -
>
> Key: ARROW-12946
> URL: https://issues.apache.org/jira/browse/ARROW-12946
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Christian Cordova
>Priority: Major
>  Labels: beginner, kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Changes lowercase characters to uppercase and uppercase characters to 
> lowercase, like Python {{str.swapcase()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6072) [C++] Implement casting List <-> LargeList

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6072:
---
Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement casting List <-> LargeList
> --
>
> Key: ARROW-6072
> URL: https://issues.apache.org/jira/browse/ARROW-6072
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We should implement bidirectional casts from List to LargeList and vice-versa.
> In the narrowing direction, the offset width should be checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13520) [C++] Implement hash_aggregate approximate quantile kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13520:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement hash_aggregate approximate quantile kernel
> --
>
> Key: ARROW-13520
> URL: https://issues.apache.org/jira/browse/ARROW-13520
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Implement a hash aggregate approximate quantile kernel. We have a scalar 
> aggregate version of this (named {{tdigest}}) using t-digest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10373) [C++] ValidateFull() does not validate null_count

2021-08-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10373.

Resolution: Fixed

Issue resolved by pull request 10871
[https://github.com/apache/arrow/pull/10871]

> [C++] ValidateFull() does not validate null_count
> -
>
> Key: ARROW-10373
> URL: https://issues.apache.org/jira/browse/ARROW-10373
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Derek Denny-Brown
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
> Attachments: arrow_null_count.cpp
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> It is possible to corrupt a serialized record batch such that the null_count 
> of the parsed result is invalid.  Attached is a repro where the the parsed 
> null_count = -2, but ValidateFull returns success.  I would expect 
> ValidateFull to verify the null_count against the serialized columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13220) [C++] Add a 'choose' kernel/scalar compute function

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13220:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Add a 'choose' kernel/scalar compute function
> ---
>
> Key: ARROW-13220
> URL: https://issues.apache.org/jira/browse/ARROW-13220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Emulate SQL's choose or (a very limited subset of) NumPy's choose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12944) [C++] String capitalize kernel

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12944:

Labels: beginner kernel pull-request-available  (was: beginner 
pull-request-available)

> [C++] String capitalize kernel
> --
>
> Key: ARROW-12944
> URL: https://issues.apache.org/jira/browse/ARROW-12944
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: beginner, kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Capitalizes the first character in the string, like Python 
> {{str.capitalize()}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12540) [C++] Implement cast from date32[day] to utf8

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12540:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Implement cast from date32[day] to utf8 
> --
>
> Key: ARROW-12540
> URL: https://issues.apache.org/jira/browse/ARROW-12540
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Nic Crane
>Assignee: Rommel Quintanilla
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> I'm writing bindings between the C++ write CSV functionality and R.  When I 
> test my code on a table with no dates columns, the tests pass, but when I 
> test the code with a table that includes dates, I get the following error 
> message:
> `Error: NotImplemented: Unsupported cast from date32[day] to utf8 using 
> function cast_string`
> Code that causes error: [https://github.com/apache/arrow/pull/10141]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13509) [C++] Take compute function should pass through ChunkedArray type to handle empty input arrays

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13509:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Take compute function should pass through ChunkedArray type to handle 
> empty input arrays
> --
>
> Key: ARROW-13509
> URL: https://issues.apache.org/jira/browse/ARROW-13509
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 4.0.0
>Reporter: &res
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Minor
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I'm trying to explode a table (in the pandas sense: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)]
> As it's not yet supported, I've writen some code to do it using a mix of 
> list_flatten and list_parent_indices. It works well, excepted it crashed when 
> for empty tables where it crashes.
> {code:python}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0730 15:16:05.164858 13612 chunked_array.cc:48]  Check failed: 
> (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and 
> omitted type
> *** Check failure stack trace: ***Process finished with exit code 134 
> (interrupted by signal 6: SIGABRT)
> {code}
> Here's a reproducable example:
> {code:python}
> import sys
> import pyarrow as pa
> from pyarrow import compute
> import pandas as pd
> table = pa.Table.from_arrays(
> [
> pa.array([101, 102, 103], pa.int32()),
> pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
> ],
> names=['key', 'list']
> )
> def explode(table) -> pd.DataFrame:
> exploded_list = compute.list_flatten(table['list'])
> indices = compute.list_parent_indices(table['list'])
> assert indices.type == pa.int32()
> keys = compute.take(table['key'], indices)  # <--- Crashes here
> return pa.Table.from_arrays(
> [keys, exploded_list],
> names=['key', 'list_element']
> )
> explode(table).to_pandas().to_markdown(sys.stdout)
> explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't 
> work
> {code}
>  
> I've narrowed it down to the following: 
> when list_parent_indices is called on an empty table it returns this empty 
> chunk array:
> {code}
> pa.chunked_array([], pa.int32())
> {code}
> Instead of this chunked array with 1 empty chunk:
> {code}
> pa.chunked_array([pa.array([], pa.int32())])
> {code}
> In turn take doesn't work with the empty chunked aray:
> {code:python}
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
>  pa.chunked_array([], pa.int32())) # Bad
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
>  pa.chunked_array([pa.array([], pa.int32())])) # Good
> {code}
> Now in terms of how to fix it there's two solutions:
> * take could accept empty chunked array
> * list_parent_indices could return a chunked array with an empty chunk
> PS: the error message isn't accurate. It says "cannot construct ChunkedArray 
> from empty vector and omitted type". But the array being passed has got a 
> type (int32) but no chunk. It makes me suspect that something in take strip 
> the type of the empty chunked array.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12321) [R][C++] Arrow opens too many files at once when writing a dataset

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12321:

Labels: query-engine  (was: )

> [R][C++] Arrow opens too many files at once when writing a dataset
> --
>
> Key: ARROW-12321
> URL: https://issues.apache.org/jira/browse/ARROW-12321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 3.0.0
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Assignee: Weston Pace
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> _Related to:_ https://issues.apache.org/jira/browse/ARROW-12315
> Please see 
> https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
>  where I added the raw data and the output.
> This works:
> {code:java}
> library(data.table)
> library(dplyr)
> library(arrow)
> d <- fread(
> input = 
> "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
> colClasses = list(
>   character = "Commodity Code",
>   numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
> ))
> d <- d %>%
>   mutate(
> `Reporter ISO` = case_when(
>   `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>   TRUE ~ `Reporter ISO`
> ),
> `Partner ISO` = case_when(
>   `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>   TRUE ~ `Partner ISO`
> )
>   )
> # d %>%
> #   select(Year, `Reporter ISO`, `Partner ISO`) %>%
> #   distinct() %>%
> #   dim()
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 1024L)
> {code}
> But, if I add an additional column for partioning and increases the max 
> partitions to 12808 (to pass exactly the number of partitions that it needs), 
> I get the error:
> {code:java}
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 12808)
> Error: IOError: Failed to open local file 
> '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
>  Detail: [errno 24] Too many open files
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10439:

Labels: beginner dataset query-engine  (was: beginner dataset)

> [C++][Dataset] Add max file size as a dataset writing option
> 
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12803) [C++] [Dataset] Write dataset with scanner does not support async scan

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12803:

Labels: query-engine  (was: )

> [C++] [Dataset] Write dataset with scanner does not support async scan
> --
>
> Key: ARROW-12803
> URL: https://issues.apache.org/jira/browse/ARROW-12803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> The file system write uses the legacy Scan method which is not supported by 
> the async scanner.  This should be addressed.  It would be useful, for 
> example, when copying a dataset from S3 to local disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13542) [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13542:

Fix Version/s: 6.0.0

> [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an 
> ExecPlan to disk
> 
>
> Key: ARROW-13542
> URL: https://issues.apache.org/jira/browse/ARROW-13542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset
> Fix For: 6.0.0
>
>
> This will serve as a sink ExecNode which dumps all the batches it receives to 
> disk. The PR should probably also replace {{FileSystemDataset::Write}} with 
> an ExecPlan based implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12803) [C++] [Dataset] Write dataset with scanner does not support async scan

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12803:

Fix Version/s: 6.0.0

> [C++] [Dataset] Write dataset with scanner does not support async scan
> --
>
> Key: ARROW-12803
> URL: https://issues.apache.org/jira/browse/ARROW-12803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
> Fix For: 6.0.0
>
>
> The file system write uses the legacy Scan method which is not supported by 
> the async scanner.  This should be addressed.  It would be useful, for 
> example, when copying a dataset from S3 to local disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13542) [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13542:

Labels: dataset query-engine  (was: dataset)

> [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an 
> ExecPlan to disk
> 
>
> Key: ARROW-13542
> URL: https://issues.apache.org/jira/browse/ARROW-13542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset, query-engine
> Fix For: 6.0.0
>
>
> This will serve as a sink ExecNode which dumps all the batches it receives to 
> disk. The PR should probably also replace {{FileSystemDataset::Write}} with 
> an ExecPlan based implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13560) [R] Allow Scanner$create() to accept filter / project even with arrow_dplyr_querys

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13560:

Labels: pull-request-available  (was: pull-request-available query-engine)

> [R] Allow Scanner$create() to accept filter / project even with 
> arrow_dplyr_querys
> --
>
> Key: ARROW-13560
> URL: https://issues.apache.org/jira/browse/ARROW-13560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, when calling {{Scanner$create()}} on an {{arrow_dplyr_query}} the 
> `projection` and `filter` arguments are ignored silently (since the dplyr 
> query filters should be applied). 
> This is helpful for predicate pushdowns 
> One proposal is at 
> https://gist.github.com/hannesmuehleisen/61d655bb345e0af374d10ca303894eef



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13560) [R] Allow Scanner$create() to accept filter / project even with arrow_dplyr_querys

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13560:

Labels: pull-request-available query-engine  (was: pull-request-available)

> [R] Allow Scanner$create() to accept filter / project even with 
> arrow_dplyr_querys
> --
>
> Key: ARROW-13560
> URL: https://issues.apache.org/jira/browse/ARROW-13560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, when calling {{Scanner$create()}} on an {{arrow_dplyr_query}} the 
> `projection` and `filter` arguments are ignored silently (since the dplyr 
> query filters should be applied). 
> This is helpful for predicate pushdowns 
> One proposal is at 
> https://gist.github.com/hannesmuehleisen/61d655bb345e0af374d10ca303894eef



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13560) [R] Allow Scanner$create() to accept filter / project even with arrow_dplyr_querys

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13560:

Fix Version/s: 6.0.0

> [R] Allow Scanner$create() to accept filter / project even with 
> arrow_dplyr_querys
> --
>
> Key: ARROW-13560
> URL: https://issues.apache.org/jira/browse/ARROW-13560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, when calling {{Scanner$create()}} on an {{arrow_dplyr_query}} the 
> `projection` and `filter` arguments are ignored silently (since the dplyr 
> query filters should be applied). 
> This is helpful for predicate pushdowns 
> One proposal is at 
> https://gist.github.com/hannesmuehleisen/61d655bb345e0af374d10ca303894eef



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13587) [R] Handle --use-LTO override

2021-08-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13587.
-
Resolution: Fixed

Issue resolved by pull request 10894
[https://github.com/apache/arrow/pull/10894]

> [R] Handle --use-LTO override
> -
>
> Key: ARROW-13587
> URL: https://issues.apache.org/jira/browse/ARROW-13587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.1, 6.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We added UseLTO: false to the DESCRIPTION, but a user can override this with 
> R CMD INSTALL --use-LTO, so we can't rely on UseLTO always/on CRAN. Handle 
> what happens if we still end up in LTOland



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5244) [C++] Review experimental / unstable APIs

2021-08-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5244.
---
Resolution: Fixed

Issue resolved by pull request 10886
[https://github.com/apache/arrow/pull/10886]

> [C++] Review experimental / unstable APIs
> -
>
> Key: ARROW-5244
> URL: https://issues.apache.org/jira/browse/ARROW-5244
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some C++ APIs are currently marked experimental (or perhaps unstable). We 
> should review them and perhaps promote some of them to stable, or perhaps 
> remove those that have proven unhelpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13589) [C++] Reconcile ValidateArray and ValidateArrayFull

2021-08-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-13589:
--

 Summary: [C++] Reconcile ValidateArray and ValidateArrayFull
 Key: ARROW-13589
 URL: https://issues.apache.org/jira/browse/ARROW-13589
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


Instead of having entirely separate implementations for these functions, it 
seems the expensive checks can simply be toggled using a boolean flag inside a 
common implementation (as is already done for {{Scalar::Validate}}). Whether or 
not this makes the code more readable / shorter / more maintainable must be 
evaluated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13482) [C++][Compute] Provide a registry for ExecNode implementations

2021-08-09 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-13482.
--
Resolution: Fixed

Issue resolved by pull request 10793
[https://github.com/apache/arrow/pull/10793]

> [C++][Compute] Provide a registry for ExecNode implementations
> --
>
> Key: ARROW-13482
> URL: https://issues.apache.org/jira/browse/ARROW-13482
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> ExecNodes' factory functions are confusingly non-uniform, which means a lot 
> of boilerplate when composing them into even a simple ExecPlan. Providing a 
> standard factory interface for these and a registry in which to store them by 
> name will simplify declaration of ExecPlans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13451) [C++][Compute] Consider removing ScalarAggregateKernel

2021-08-09 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396170#comment-17396170
 ] 

Ben Kietzman commented on ARROW-13451:
--

At the same time, it'd be nice to make the helper functions 
{{internal::GetKernels}} etc public and well tested.

See also: https://github.com/apache/arrow/pull/10793#discussion_r685358176

> [C++][Compute] Consider removing ScalarAggregateKernel
> --
>
> Key: ARROW-13451
> URL: https://issues.apache.org/jira/browse/ARROW-13451
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Scalar aggregation does not incur large memory overhead for the associated 
> KernelState objects, so maybe it'd be acceptable to remove explicit scalar 
> aggregation kernels in favor of reusing grouped aggregation kernels with a 
> single group. This would decrease our maintenance burden significantly, and 
> if the benchmarks don't show a regression for single-group aggregation then 
> there's no reason not to.
> Even if there is a performance regression we could bundle the scalar and 
> grouped aggregate kernels in the same compute::Function and decide between 
> them in Dispatch*, rather than confusingly defining distinct "sum" and 
> "hash_sum" functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13132) [C++] Add Scalar validation

2021-08-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13132.

Resolution: Fixed

Issue resolved by pull request 10862
[https://github.com/apache/arrow/pull/10862]

> [C++] Add Scalar validation
> ---
>
> Key: ARROW-13132
> URL: https://issues.apache.org/jira/browse/ARROW-13132
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In simple cases, scalar validation would probably be a no-op, but some types 
> may deserve some checks (e.g. UTF8 validation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13588) R empty character attributes not stored

2021-08-09 Thread Charlie Gao (Jira)
Charlie Gao created ARROW-13588:
---

 Summary: R empty character attributes not stored
 Key: ARROW-13588
 URL: https://issues.apache.org/jira/browse/ARROW-13588
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 5.0.0
 Environment: Ubuntu 20.04 R 4.1 release
Reporter: Charlie Gao


I have come across an issue in the process of incorporating arrow in a package 
I develop.

Date-times in the POSIXct format have a 'tzone' attribute that by default is 
set to "", an empty character vector (not NULL) when created.

This however is not stored in the Arrow feather file. When the file is read 
back, the original and restored dataframes are not identical as per the below 
reprex.

I am thinking that this should not be the intention? My workaround at the 
moment is making a check when reading back to write the empty string if the 
tzone attribute does not exist.

Just to confirm, this is not an issue when the attribute is not empty - it gets 
stored correctly.

Thanks.

``` r
 dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
 attributes(dates)
 #> $class
 #> [1] "POSIXct" "POSIXt" 
 #> 
 #> $tzone
 #> [1] ""
 values <- c(1:3)
 original <- data.frame(dates, values)
 original
 #> dates values
 #> 1 2020-01-01 1
 #> 2 2020-01-02 2
 #> 3 2020-01-02 3

tempfile <- tempfile()
 arrow::write_feather(original, tempfile)

restored <- arrow::read_feather(tempfile)

identical(original, restored)
 #> [1] FALSE
 waldo::compare(original, restored)
 #> `attr(old$dates, 'tzone')` is a character vector ('')
 #> `attr(new$dates, 'tzone')` is absent

unlink(tempfile)
 ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13530) [C++] Implement cumulative sum compute function

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13530:


Assignee: David Li

> [C++] Implement cumulative sum compute function
> ---
>
> Key: ARROW-13530
> URL: https://issues.apache.org/jira/browse/ARROW-13530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7179) [C++][Compute] Array support for fill_null

2021-08-09 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-7179.
---
Fix Version/s: (was: 6.0.0)
   5.0.0
 Assignee: David Li  (was: Ben Kietzman)
   Resolution: Duplicate

Comparing the two issues, the kernels look identical, so I'm going to close as 
duplicate.

> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: David Li
>Priority: Major
>  Labels: analytics
> Fix For: 5.0.0
>
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6870) [C#] Add Support for Dictionary Arrays and Dictionary Encoding

2021-08-09 Thread Eric Erhardt (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Erhardt resolved ARROW-6870.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10527
[https://github.com/apache/arrow/pull/10527]

> [C#] Add Support for Dictionary Arrays and Dictionary Encoding
> --
>
> Key: ARROW-6870
> URL: https://issues.apache.org/jira/browse/ARROW-6870
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Daniel Parubotchy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> The C# implementaiton of Arrow doesn't support dictionary arrays, 
> serialization/deserialization of dictionary batches, or dictionary encoding.
> Dictionary arrays and dictionary encoding could provide a huge performance 
> benefit for certain data sets.
> I propose creating dictionary array types that correspond to the existing 
> array types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1565) [C++][Compute] Implement TopK/BottomK streaming execution nodes

2021-08-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396072#comment-17396072
 ] 

David Li commented on ARROW-1565:
-

Daniel Lemire compared various approaches (modifications of the heap approach 
or modifications of quickselect) and found that the heap wins in his tests: 
https://lemire.me/blog/2017/06/21/top-speed-for-top-k-queries/ 

If we want this to perfectly emulate Pandas's nlargest/nsmallest then there are 
some additional complications (inputs must have a defined order to resolve 
ties, and/or we must be able to track and output duplicate values).

> [C++][Compute] Implement TopK/BottomK streaming execution nodes
> ---
>
> Key: ARROW-1565
> URL: https://issues.apache.org/jira/browse/ARROW-1565
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics, query-engine
> Fix For: 6.0.0
>
>
> Heap-based topk can compute these indices in O(n log k) time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13587) [R] Handle --use-LTO override

2021-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13587:
---
Labels: pull-request-available  (was: )

> [R] Handle --use-LTO override
> -
>
> Key: ARROW-13587
> URL: https://issues.apache.org/jira/browse/ARROW-13587
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0, 5.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We added UseLTO: false to the DESCRIPTION, but a user can override this with 
> R CMD INSTALL --use-LTO, so we can't rely on UseLTO always/on CRAN. Handle 
> what happens if we still end up in LTOland



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13587) [R] Handle --use-LTO override

2021-08-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-13587:
---

 Summary: [R] Handle --use-LTO override
 Key: ARROW-13587
 URL: https://issues.apache.org/jira/browse/ARROW-13587
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 6.0.0, 5.0.1


We added UseLTO: false to the DESCRIPTION, but a user can override this with R 
CMD INSTALL --use-LTO, so we can't rely on UseLTO always/on CRAN. Handle what 
happens if we still end up in LTOland



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13586) [R][CI] Clean up crossbow R templates

2021-08-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-13586:
---

 Summary: [R][CI] Clean up crossbow R templates
 Key: ARROW-13586
 URL: https://issues.apache.org/jira/browse/ARROW-13586
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Jonathan Keane
 Fix For: 6.0.0


* "flags" is misnamed
* Other additional env vars could probably be consolidated into that more 
transparently (rather than an alternation of CAPITALIZED and lower_case 
spellings)
* Is there a jinja-way to iterate over a list of params and do things like 
prepend {{-e }}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

2021-08-09 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396011#comment-17396011
 ] 

David Li commented on ARROW-12873:
--

To me this is mostly isomorphic to the original proposal, except using Datum 
instead of void*. You could imagine attaching the secondary vector to 
ExecBatch. This also still has the issue that we need to preserve some global 
ordering of metadata and/or know which metadata go with which types of kernels, 
something that using a map-like interface as originally proposed would at least 
avoid.

To your last point, until relatively recently, there was no way to define 
"VarArgs with at least N" parameters without jumping through some hoops. But 
even with that support, there's still the issue of ordering the output 
metadata/parameters between different kinds of kernels.

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---
>
> Key: ARROW-12873
> URL: https://issues.apache.org/jira/browse/ARROW-12873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for 
> tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, 
> since they may not originate from the arrow library. For an example within 
> the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and 
> a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} 
> of origin. However adding {{ExecBatch::fragment}} would result in a cyclic 
> dependency.
> To facilitate this tagging capability, we would need a type erased container 
> something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13585) [GLib] Add support for C ABI interface

2021-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13585:
---
Labels: pull-request-available  (was: )

> [GLib] Add support for C ABI interface
> --
>
> Key: ARROW-13585
> URL: https://issues.apache.org/jira/browse/ARROW-13585
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13585) [GLib] Add support for C ABI interface

2021-08-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-13585:


 Summary: [GLib] Add support for C ABI interface
 Key: ARROW-13585
 URL: https://issues.apache.org/jira/browse/ARROW-13585
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)