[ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Ching updated GIRAPH-11: ------------------------------ Attachment: GIRAPH-11.4.diff The diff used for reviewboard based on Jakob's review. > Improve the graph distribution of Giraph > ---------------------------------------- > > Key: GIRAPH-11 > URL: https://issues.apache.org/jira/browse/GIRAPH-11 > Project: Giraph > Issue Type: Improvement > Affects Versions: 0.70.0 > Reporter: Avery Ching > Assignee: Avery Ching > Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, > GIRAPH-11.diff > > > Currently, Giraph assumes that the data from the VertexInputFormat is sorted. > If the user data is not sorted by the vertex id, they must first run a > MapReduce or Pig job to generate a sorted dataset. This is often a bit > inconvenient. > Giraph graph partitioning is currently range based and there are some > advantages and disadvantages of this approach. The proposal of this JIRA > would be to allow for both range and hash based partitioning and provide more > flexibility to the user. > Design goals for the graph distribution: > * Allow vertices to be unordered or unordered > * Ability to repartition > * Select the partitioning scheme based on user needs (i.e. hash or range > based) > * Ability to provide user-specific hints about partitions > Hash-based partitioning > * Good vertex balancing across ranges for random data > * Bad at vertex id locality > Range-based partitioning > * Good at vertex id locality > * Ability to split ranges easily > * Can cause hotspots for hot ranges -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira