[ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101710#comment-13101710 ]
Avery Ching commented on GIRAPH-11: ----------------------------------- Regarding the difference in hash based and hash rang based, it refers to how the hash code is assigned to a partition. The application dev will implement hashCode() for their vertex id and then the assignment of the hashCode() to a partition can be hashed (i.e. hashCode() % # partitions) or range based ([0-a),[a-b)...etc). Hope that's more clear. Code will help. It's coming soon, by mid next week I hope. > Improve the graph distribution of Giraph > ---------------------------------------- > > Key: GIRAPH-11 > URL: https://issues.apache.org/jira/browse/GIRAPH-11 > Project: Giraph > Issue Type: Improvement > Reporter: Avery Ching > Assignee: Avery Ching > > Currently, Giraph assumes that the data from the VertexInputFormat is sorted. > If the user data is not sorted by the vertex id, they must first run a > MapReduce or Pig job to generate a sorted dataset. This is often a bit > inconvenient. > Giraph graph partitioning is currently range based and there are some > advantages and disadvantages of this approach. The proposal of this JIRA > would be to allow for both range and hash based partitioning and provide more > flexibility to the user. > Design goals for the graph distribution: > * Allow vertices to be unordered or unordered > * Ability to repartition > * Select the partitioning scheme based on user needs (i.e. hash or range > based) > * Ability to provide user-specific hints about partitions > Hash-based partitioning > * Good vertex balancing across ranges for random data > * Bad at vertex id locality > Range-based partitioning > * Good at vertex id locality > * Ability to split ranges easily > * Can cause hotspots for hot ranges -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira