Nikhil created MADLIB-1367:
------------------------------
Summary: WCC: Improve performance with grouping
Key: MADLIB-1367
URL: https://issues.apache.org/jira/browse/MADLIB-1367
Project: Apache MADlib
Issue Type: Bug
Reporter: Nikhil
Fix For: v1.17
As seen in thisĀ [PR|https://github.com/apache/madlib/pull/364] {{distributed
by}} on multiple columns caused slowness of the query as GPDB redistributes
data. We had not addressed similar issue in case of grouping as part of the
previous story.
{{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and
{{vertex_id}}. This has to be changed since our original assumption was that
data would be distributed by grouping cols first, followed by vertex_id. But
instead, the distribution in this case happens over the array of the values of
the keys.
Acceptance:
1. Perf test with grouping to repro the performance issue with grouping.
2. Fix possible perf issue with grouping.
3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs for
the same.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)