[ https://issues.apache.org/jira/browse/MADLIB-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882131#comment-16882131 ]
Frank McQuillan commented on MADLIB-1368: ----------------------------------------- highlighting findings from WCC grouping https://issues.apache.org/jira/browse/MADLIB-1367 lwe may move this JIRA out of 1.17 but leaving it for the time being. > Identify potential performance issues in modules using distributed by clause > ---------------------------------------------------------------------------- > > Key: MADLIB-1368 > URL: https://issues.apache.org/jira/browse/MADLIB-1368 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Graph > Reporter: Nandish Jayaram > Priority: Major > Fix For: v1.17 > > > Based on our findings in this JIRA, there may be some performance hits in > other modules due to the way we use {{distributed by}} clause at the moment. > After going through the code, we noticed the following issues that we may > want to explore a bit: > *Graph modules:* > 1. {{apsp.py_in}} This does not use distributed by the wrong way, but we > noticed it creates an index for Postgres. > 2. {{sssp.py_in}} This does not use distributed by the wrong way, but we > noticed it creates an index for Postgres. Jira to track this and the previous > issue is https://issues.apache.org/jira/browse/MADLIB-1369 > 3. {{hits.py_in}} Uses distributed by with grouping, must be changed. > 4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed. > 5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to > track this is https://issues.apache.org/jira/browse/MADLIB-1367 > *Non-Graph modules that use distributed by:* > 1. {{logistic.py_in}} This is the only module that uses group iteration > controller from {{group_control.py_in}} which distributes rel_state table > based on grouping columns. The fix here could be to remove the distributed by > clause present in {{group_control.py_in}}. > 2. {{path.py_in}} A temporary table created in path distributes it using > multiple columns, we must check if that was intentional. > 3. {{encode_categorical.py_in}} The output table creation query has a > distributed by clause which uses the distribution key provided by the user as > an input param. What was the intention behind that optional param, or rather > what is the expected behavior for a given param value? > 4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if > that was intentional. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)