[ 
https://issues.apache.org/jira/browse/GIRAPH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412952#comment-13412952
 ] 

Eli Reisman commented on GIRAPH-247:
------------------------------------

Good point, I'll look at that. I wanted to be sure the count was per-partition 
and was not seeing a clean way to get that from a fresh new partition without 
calling that. Perhaps the VertexEdgeCount would be available at that spot in 
the code, let me take another look. I will look at GIRAPH-249 too, finding a 
way to lower the memory profile per-worker is of great interest to me.

                
> Introduce edge based partitioning for InputSplits
> -------------------------------------------------
>
>                 Key: GIRAPH-247
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-247
>             Project: Giraph
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-247-1.patch
>
>
> Experiments on larger data input sets while maintaining low memory profile 
> has revealed that typical social graph data is very lumpy and partitioning by 
> vertices can easily overload some unlucky worker nodes who end up with 
> partitions containing highly-connected vertices while other nodes process 
> partitions with the same number of vertices but far fewer out-edges per 
> vertex. This often results in cascading failures during data load-in even on 
> tiny data sets.
> By partitioning using edges (the default I set in 
> GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT is 200,000 per partition, or the 
> old default # of vertices, whichever the user's input format reaches first 
> when reading InputSplits) I have seen dramatic "de-lumpification" of data, 
> allow the processing of 8x larger data sets before memory problems occur at a 
> given configuration setting.
> This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that 
> can be set to more edges/partition as your data sets grow or memory 
> limitations shrink. This might be considered a first attempt, perhaps simply 
> allowing us to default to this type of partitioning or the old version would 
> be more compatible with existing users' needs? That would not be a hard 
> feature to add to this. But I think this method of partition production has 
> merit for typical large-scale graph data that Giraph is designed to process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to