[ 
https://issues.apache.org/jira/browse/MADLIB-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878178#comment-16878178
 ] 

Nandish Jayaram commented on MADLIB-1367:
-----------------------------------------

w/ [~okislal]
The run time degrades if we change {{distributed by (grouping_cols, 
vertex_id)}} to {{distributed by (grouping_cols)}} quite consistently. The 
following are the run times with a smaller dataset:
1) Dataset creation

{code}
drop table if exists vertex, edge;

create table vertex (id int) distributed by (id);
insert into vertex select i from generate_series(1,100000)i;

create table edge (src int, dest int) distributed by (src);

insert into edge (
    select random()*100000 as src, random()*100000 as dest from 
generate_series(1,65000))
    ;

drop table edge_group;
create table edge_group as select *, 0 as g1, 0 as g2 from edge;
insert into edge_group select *, 0 as g1, 1 as g2 from edge;
insert into edge_group select *,1 as g1, 0 as g2 from edge;
insert into edge_group select *,1 as g1, 1 as g2 from edge;
{code}

2) Query

{code}
drop table if exists out_wcc_116, out_wcc_116_summary;
select madlib.weakly_connected_components('vertex', 'id', 
'edge_group',NULL,'out_wcc_116','g1,g2');
{code}

3) Run time with 1.16 master (distributed by both grouping cols and vertex id).

{code}
Time: 35092.122 ms
{code}

4) Run time with changes to distribute by only grouping cols.

{code}
Time: 44585.368 ms
{code}
So, the plan is to not fix anything here, and revisit this if a user finds 
issues with grouping and WCC.

> WCC: Improve performance with grouping
> --------------------------------------
>
>                 Key: MADLIB-1367
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1367
>             Project: Apache MADlib
>          Issue Type: Bug
>            Reporter: Nikhil
>            Priority: Major
>             Fix For: v1.17
>
>
> As seen in thisĀ [JIRA|https://issues.apache.org/jira/browse/MADLIB-1320]  
> {{distributed by}} on multiple columns caused slowness of the query as GPDB 
> redistributes data. We had not addressed similar issue in case of grouping as 
> part of the previous story.
> {{newupdate}} and {{message}} tables are distributed on {{grouping_cols}} and 
> {{vertex_id}}. This has to be changed since our original assumption was that 
> data would be distributed by grouping cols first, followed by vertex_id. But 
> instead, the distribution in this case happens over the array of the values 
> of the keys.
> Acceptance:
>  1. Perf test with grouping to repro the performance issue with grouping.
>  2. Fix possible perf issue with grouping.
>  3. We may have similar issues in HITS and Pagerank, create follow-on JIRAs 
> for the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to