[ 
https://issues.apache.org/jira/browse/GIRAPH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413024#comment-13413024
 ] 

Alessandro Presta commented on GIRAPH-247:
------------------------------------------

Looks good to me. Let's see what the committers say.
                
> Introduce edge based partitioning for InputSplits
> -------------------------------------------------
>
>                 Key: GIRAPH-247
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-247
>             Project: Giraph
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-247-1.patch, GIRAPH-247-2.patch
>
>
> Experiments on larger data input sets while maintaining low memory profile 
> has revealed that typical social graph data is very lumpy and partitioning by 
> vertices can easily overload some unlucky worker nodes who end up with 
> partitions containing highly-connected vertices while other nodes process 
> partitions with the same number of vertices but far fewer out-edges per 
> vertex. This often results in cascading failures during data load-in even on 
> tiny data sets.
> By partitioning using edges (the default I set in 
> GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT is 200,000 per partition, or the 
> old default # of vertices, whichever the user's input format reaches first 
> when reading InputSplits) I have seen dramatic "de-lumpification" of data, 
> allow the processing of 8x larger data sets before memory problems occur at a 
> given configuration setting.
> This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that 
> can be set to more edges/partition as your data sets grow or memory 
> limitations shrink. This might be considered a first attempt, perhaps simply 
> allowing us to default to this type of partitioning or the old version would 
> be more compatible with existing users' needs? That would not be a hard 
> feature to add to this. But I think this method of partition production has 
> merit for typical large-scale graph data that Giraph is designed to process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to