[ https://issues.apache.org/jira/browse/MADLIB-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikhil updated MADLIB-1367: --------------------------- Description: As seen in this [JIRA|https://issues.apache.org/jira/browse/MADLIB-1320] {{distributed by}} on multiple columns caused slowness of the query as GPDB redistributes data. We had not addressed similar issue in case of grouping as part of the previous story. {{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and {{vertex_id}}. This has to be changed since our original assumption was that data would be distributed by grouping cols first, followed by vertex_id. But instead, the distribution in this case happens over the array of the values of the keys. Acceptance: 1. Perf test with grouping to repro the performance issue with grouping. 2. Fix possible perf issue with grouping. 3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs for the same. was: As seen in this [PR|https://github.com/apache/madlib/pull/364] {{distributed by}} on multiple columns caused slowness of the query as GPDB redistributes data. We had not addressed similar issue in case of grouping as part of the previous story. {{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and {{vertex_id}}. This has to be changed since our original assumption was that data would be distributed by grouping cols first, followed by vertex_id. But instead, the distribution in this case happens over the array of the values of the keys. Acceptance: 1. Perf test with grouping to repro the performance issue with grouping. 2. Fix possible perf issue with grouping. 3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs for the same. > WCC: Improve performance with grouping > -------------------------------------- > > Key: MADLIB-1367 > URL: https://issues.apache.org/jira/browse/MADLIB-1367 > Project: Apache MADlib > Issue Type: Bug > Reporter: Nikhil > Priority: Major > Fix For: v1.17 > > > As seen in this [JIRA|https://issues.apache.org/jira/browse/MADLIB-1320] > {{distributed by}} on multiple columns caused slowness of the query as GPDB > redistributes data. We had not addressed similar issue in case of grouping as > part of the previous story. > {{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and > {{vertex_id}}. This has to be changed since our original assumption was that > data would be distributed by grouping cols first, followed by vertex_id. But > instead, the distribution in this case happens over the array of the values > of the keys. > Acceptance: > 1. Perf test with grouping to repro the performance issue with grouping. > 2. Fix possible perf issue with grouping. > 3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs > for the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005)