[ 
https://issues.apache.org/jira/browse/MADLIB-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil updated MADLIB-1367:
---------------------------
    Description: 
As seen in this [JIRA|https://issues.apache.org/jira/browse/MADLIB-1320]  
{{distributed by}} on multiple columns caused slowness of the query as GPDB 
redistributes data. We had not addressed similar issue in case of grouping as 
part of the previous story.

{{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and 
{{vertex_id}}. This has to be changed since our original assumption was that 
data would be distributed by grouping cols first, followed by vertex_id. But 
instead, the distribution in this case happens over the array of the values of 
the keys.

Acceptance:
 1. Perf test with grouping to repro the performance issue with grouping.
 2. Fix possible perf issue with grouping.
 3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs for 
the same.

  was:
As seen in this [PR|https://github.com/apache/madlib/pull/364]  {{distributed 
by}} on multiple columns caused slowness of the query as GPDB redistributes 
data. We had not addressed similar issue in case of grouping as part of the 
previous story.

{{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and 
{{vertex_id}}. This has to be changed since our original assumption was that 
data would be distributed by grouping cols first, followed by vertex_id. But 
instead, the distribution in this case happens over the array of the values of 
the keys.

Acceptance:
 1. Perf test with grouping to repro the performance issue with grouping.
 2. Fix possible perf issue with grouping.
 3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs for 
the same.


> WCC: Improve performance with grouping
> --------------------------------------
>
>                 Key: MADLIB-1367
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1367
>             Project: Apache MADlib
>          Issue Type: Bug
>            Reporter: Nikhil
>            Priority: Major
>             Fix For: v1.17
>
>
> As seen in this [JIRA|https://issues.apache.org/jira/browse/MADLIB-1320]  
> {{distributed by}} on multiple columns caused slowness of the query as GPDB 
> redistributes data. We had not addressed similar issue in case of grouping as 
> part of the previous story.
> {{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and 
> {{vertex_id}}. This has to be changed since our original assumption was that 
> data would be distributed by grouping cols first, followed by vertex_id. But 
> instead, the distribution in this case happens over the array of the values 
> of the keys.
> Acceptance:
>  1. Perf test with grouping to repro the performance issue with grouping.
>  2. Fix possible perf issue with grouping.
>  3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs 
> for the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to