[
https://issues.apache.org/jira/browse/MADLIB-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1368:
------------------------------------
Fix Version/s: (was: v1.17)
v2.0
> Identify potential performance issues in modules using distributed by clause
> ----------------------------------------------------------------------------
>
> Key: MADLIB-1368
> URL: https://issues.apache.org/jira/browse/MADLIB-1368
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Graph
> Reporter: Nandish Jayaram
> Priority: Major
> Fix For: v2.0
>
>
> Based on our findings in this JIRA, there may be some performance hits in
> other modules due to the way we use {{distributed by}} clause at the moment.
> After going through the code, we noticed the following issues that we may
> want to explore a bit:
> *Graph modules:*
> 1. {{apsp.py_in}} This does not use distributed by the wrong way, but we
> noticed it creates an index for Postgres.
> 2. {{sssp.py_in}} This does not use distributed by the wrong way, but we
> noticed it creates an index for Postgres. Jira to track this and the previous
> issue is https://issues.apache.org/jira/browse/MADLIB-1369
> 3. {{hits.py_in}} Uses distributed by with grouping, must be changed.
> 4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed.
> 5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to
> track this is https://issues.apache.org/jira/browse/MADLIB-1367
> *Non-Graph modules that use distributed by:*
> 1. {{logistic.py_in}} This is the only module that uses group iteration
> controller from {{group_control.py_in}} which distributes rel_state table
> based on grouping columns. The fix here could be to remove the distributed by
> clause present in {{group_control.py_in}}.
> 2. {{path.py_in}} A temporary table created in path distributes it using
> multiple columns, we must check if that was intentional.
> 3. {{encode_categorical.py_in}} The output table creation query has a
> distributed by clause which uses the distribution key provided by the user as
> an input param. What was the intention behind that optional param, or rather
> what is the expected behavior for a given param value?
> 4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if
> that was intentional.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)