[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421634#comment-17421634
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Related https://issues.apache.org/jira/browse/ARROW-14158

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421478#comment-17421478
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Draft PR https://github.com/apache/arrow/pull/11257

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-23 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419368#comment-17419368
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Thanks David!

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-23 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419366#comment-17419366
 ] 

David Li commented on ARROW-14035:
--

{{value_counts}} gives you a histogram where the x-axis are the distinct values 
and the y-axis is the number of occurrences of that value. {{count_distinct}} 
is  just {{COUNT(DISTINCT *)}}.

Also, {{value_counts}} is a vector kernel whereas this should be a scalar 
aggregate kernel.

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-23 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419364#comment-17419364
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Thanks [~icook], another question: What is the difference between 
_value_counts_ and _count_distinct_?

[https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst#associative-transforms]

 

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-22 Thread Ian Cook (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418866#comment-17418866
 ] 

Ian Cook commented on ARROW-14035:
--

{quote}1. Do we need to compute the same thing of hash_count_distinct but 
without using the hash table from the hash group?
{quote}
Yes
{quote}Are we going to offer non hash version for all hash_x functions too? 
(hash_distinct, hash_count, hash_sum)
{quote}
Yes I think we should aim for that (or nearly that; there might be a few 
exceptions where it does not make sense.) Comparing the lists of aggregation 
functions and hash (grouped) aggregation functions in 
[compute.rst|https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst],
 they are mostly the same already, with just a few differences. I think this 
issue and ARROW-13309 are the most important two additions to bring these two 
lists closer to parity.

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418863#comment-17418863
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Can you please elaborate more about this requirement?
 # Do we need to compute the same thing of hash_distinct but without using the 
hash table from the hash group?
 # Are we going to offer non hash version for all hash_x functions too? 
(hash_distinct, hash_count, hash_sum)

cc [~icook] @lidavidm

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)