[ 
https://issues.apache.org/jira/browse/MADLIB-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1368:
------------------------------------
    Fix Version/s:     (was: v1.17)
                   v2.0

> Identify potential performance issues in modules using distributed by clause
> ----------------------------------------------------------------------------
>
>                 Key: MADLIB-1368
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1368
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Graph
>            Reporter: Nandish Jayaram
>            Priority: Major
>             Fix For: v2.0
>
>
> Based on our findings in this JIRA, there may be some performance hits in 
> other modules due to the way we use {{distributed by}} clause at the moment. 
> After going through the code, we noticed the following issues that we may 
> want to explore a bit:
>  *Graph modules:*
>  1. {{apsp.py_in}} This does not use distributed by the wrong way, but we 
> noticed it creates an index for Postgres.
>  2. {{sssp.py_in}} This does not use distributed by the wrong way, but we 
> noticed it creates an index for Postgres. Jira to track this and the previous 
> issue is https://issues.apache.org/jira/browse/MADLIB-1369
>  3. {{hits.py_in}} Uses distributed by with grouping, must be changed.
>  4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed.
>  5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to 
> track this is https://issues.apache.org/jira/browse/MADLIB-1367
> *Non-Graph modules that use distributed by:*
>  1. {{logistic.py_in}} This is the only module that uses group iteration 
> controller from {{group_control.py_in}} which distributes rel_state table 
> based on grouping columns. The fix here could be to remove the distributed by 
> clause present in {{group_control.py_in}}.
>  2. {{path.py_in}} A temporary table created in path distributes it using 
> multiple columns, we must check if that was intentional.
>  3. {{encode_categorical.py_in}} The output table creation query has a 
> distributed by clause which uses the distribution key provided by the user as 
> an input param. What was the intention behind that optional param, or rather 
> what is the expected behavior for a given param value?
>  4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if 
> that was intentional.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to