Eli Reisman created GIRAPH-247:
----------------------------------

             Summary: Introduce edge based partitioning for InputSplits
                 Key: GIRAPH-247
                 URL: https://issues.apache.org/jira/browse/GIRAPH-247
             Project: Giraph
          Issue Type: Improvement
          Components: graph
    Affects Versions: 0.2.0
            Reporter: Eli Reisman
            Assignee: Eli Reisman
            Priority: Minor
             Fix For: 0.2.0
         Attachments: GIRAPH-247-1.patch

Experiments on larger data input sets while maintaining low memory profile has 
revealed that typical social graph data is very lumpy and partitioning by 
vertices can easily overload some unlucky worker nodes who end up with 
partitions containing highly-connected vertices while other nodes process 
partitions with the same number of vertices but far fewer out-edges per vertex. 
This often results in cascading failures during data load-in even on tiny data 
sets.

By partitioning using edges (the default I set in 
GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT is 200,000 per partition, or the old 
default # of vertices, whichever the user's input format reaches first when 
reading InputSplits) I have seen dramatic "de-lumpification" of data, allow the 
processing of 8x larger data sets before memory problems occur at a given 
configuration setting.

This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that can 
be set to more edges/partition as your data sets grow or memory limitations 
shrink. This might be considered a first attempt, perhaps simply allowing us to 
default to this type of partitioning or the old version would be more 
compatible with existing users' needs? That would not be a hard feature to add 
to this. But I think this method of partition production has merit for typical 
large-scale graph data that Giraph is designed to process.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to