[ 
https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16488452#comment-16488452
 ] 

James Taylor commented on PHOENIX-4751:
---------------------------------------

Yes, you're right [~sangudi] - since the aggregation is done after the join, 
the SORT-MERGE-JOIN will be completely done on the client (and hence would 
benefit from being able to do a hash aggregation instead). The 
SpillableGroupByCache is used on the server-side hash aggregation. It would 
work on the client side as well (if you can write to the file system). It 
basically tries to do everything in memory and then past a memory threshold 
will spill to disk.

> Support client-side hash aggregation with SORT_MERGE_JOIN
> ---------------------------------------------------------
>
>                 Key: PHOENIX-4751
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4751
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.14.0, 4.13.1
>            Reporter: Gerald Sangudi
>            Priority: Major
>
> A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash 
> aggregation in some cases, for improved performance.
> When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash 
> aggregation. It instead performs a CLIENT SORT followed by a CLIENT 
> AGGREGATE. The performance can be improved if (a) the GROUP BY output does 
> not need to be sorted, and (b) the GROUP BY input is large enough and has low 
> cardinality.
> The hash aggregation can initially be a hint. Here is an example from Phoenix 
> 4.13.1 that would benefit from hash aggregation if the GROUP BY input is 
> large with low cardinality.
> CREATE TABLE unsalted (
>  keyA BIGINT NOT NULL,
>  keyB BIGINT NOT NULL,
>  val SMALLINT,
>  CONSTRAINT pk PRIMARY KEY (keyA, keyB)
>  );
> EXPLAIN
>  SELECT /*+ USE_SORT_MERGE_JOIN */ 
>  t1.val v1, t2.val v2, COUNT(\*) c 
>  FROM unsalted t1 JOIN unsalted t2 
>  ON (t1.keyA = t2.keyA) 
>  GROUP BY t1.val, t2.val;
>  
> +-------------------------------------------------------------+----------------++------------------+
> |PLAN|EST_BYTES_READ|EST_ROWS_READ| |
> +-------------------------------------------------------------+----------------++------------------+
> |SORT-MERGE-JOIN (INNER) TABLES|null|null| |
> |    CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| |
> |AND|null|null| |
> |    CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| |
> |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| |
> |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| |
> +-------------------------------------------------------------+----------------++------------------+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to