[jira] [Created] (MADLIB-1368) Identify potential performance issues in modules using distributed by clause

Nandish Jayaram (JIRA) Tue, 02 Jul 2019 14:37:48 -0700

Nandish Jayaram created MADLIB-1368:
---------------------------------------


             Summary: Identify potential performance issues in modules using 
distributed by clause
                 Key: MADLIB-1368
                 URL: https://issues.apache.org/jira/browse/MADLIB-1368
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: Graph
            Reporter: Nandish Jayaram
             Fix For: v1.17


Based on our findings in this JIRA, there may be some performance hits in other 
modules due to the way we use {{distributed by}} clause at the moment. After 
going through the code, we noticed the following issues that we may want to 
explore a bit:
 *Graph modules:*
 1. {{apsp.py_in}} This does not use distributed by the wrong way, but we 
noticed it creates an index for Postgres.
2. {{sssp.py_in}} This does not use distributed by the wrong way, but we 
noticed it creates an index for Postgres.
3. {{hits.py_in}} Uses distributed by with grouping, must be changed.
4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed.
5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to 
track this is https://issues.apache.org/jira/browse/MADLIB-1367

*Non-Graph modules that use distributed by:*
 1. {{logistic.py_in}} This is the only module that uses group iteration 
controller from {{group_control.py_in}} which distributes rel_state table based 
on grouping columns. The fix here could be to remove the distributed by clause 
present in {{group_control.py_in}}.
2. {{path.py_in}} A temporary table created in path distributes it using 
multiple columns, we must check if that was intentional.
3. {{encode_categorical.py_in}} The output table creation query has a 
distributed by clause which uses the distribution key provided by the user as 
an input param. What was the intention behind that optional param, or rather 
what is the expected behavior for a given param value?
4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if that 
was intentional.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MADLIB-1368) Identify potential performance issues in modules using distributed by clause

Reply via email to