[ 
https://issues.apache.org/jira/browse/GIRAPH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414474#comment-13414474
 ] 

Eli Reisman commented on GIRAPH-247:
------------------------------------

UPDATE: I'm working on another issue (to be posted soon) that has given me some 
insight into this issue. This fix cannot delumpify the graph entirely, as it 
turns out that each of these Partition objects are sent to their new home when 
flushed as a collection of vertices to be simple combined at the other end 
(including locally) with an existing partition object of the same owner Id.

This means, even if you set the maxEdgesPerPartition or maxVerticesPerPartition 
very low, you are merely altering the granularity of the network messages 
(groups of vertices) being sent to their new owners after being read from 
InputSplits at a given local worker.

It turns out this is still very helpful, but has given me insight into why 
Netty fails when I ratchet up the maximum edges to what seems like a more 
reasonable default, and why it seems to thrive when I set it to my original 
lower setting (100k edges per set of vertices sent out over the wire)

In conclusion: this patch is still a very valuable way to regulate the size of 
outgoing groups of vertices on their way across the wire to a new home during 
INPUT_SUPERSTEP, but the default should be lower (perhaps even 50k total edges 
per message) and it further validates why checking for # of edges and then # of 
vertices in a collection that is to be sent out will catch oversizes messages 
will catch oversized messages before they get too big much more effectively 
than the old "10k vertices or more" check in the existing codebase.

I will upload an updated patch with a lower default that is more in keeping 
with the cluster metrics I'm getting in terms of optimal time and memory use 
during INPUT_SUPERSTEP given this new insight. I think the patch (which not 
doing what the title claims) is still very useful and effective for the reasons 
I've outlined here.

                
> Introduce edge based partitioning for InputSplits
> -------------------------------------------------
>
>                 Key: GIRAPH-247
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-247
>             Project: Giraph
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-247-1.patch, GIRAPH-247-2.patch, 
> GIRAPH-247-3.patch
>
>
> Experiments on larger data input sets while maintaining low memory profile 
> has revealed that typical social graph data is very lumpy and partitioning by 
> vertices can easily overload some unlucky worker nodes who end up with 
> partitions containing highly-connected vertices while other nodes process 
> partitions with the same number of vertices but far fewer out-edges per 
> vertex. This often results in cascading failures during data load-in even on 
> tiny data sets.
> By partitioning using edges (the default I set in 
> GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT is 200,000 per partition, or the 
> old default # of vertices, whichever the user's input format reaches first 
> when reading InputSplits) I have seen dramatic "de-lumpification" of data, 
> allow the processing of 8x larger data sets before memory problems occur at a 
> given configuration setting.
> This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that 
> can be set to more edges/partition as your data sets grow or memory 
> limitations shrink. This might be considered a first attempt, perhaps simply 
> allowing us to default to this type of partitioning or the old version would 
> be more compatible with existing users' needs? That would not be a hard 
> feature to add to this. But I think this method of partition production has 
> merit for typical large-scale graph data that Giraph is designed to process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to