[
https://issues.apache.org/jira/browse/GIRAPH-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421443#comment-13421443
]
Alessandro Presta commented on GIRAPH-244:
------------------------------------------
I backported the new SSSP benchmark in order to do some more comparisons.
The TLDR is: performance is essentially the same with both EdgeList and
HashMapVertex.
10K edges is probably not enough to see a significative improvement in
iteration speed.
All in all, I think this is good to go now.
Benchmark results:
ShortestPathsBenchmark, using EdgeListVertex
100K vertices, 10K edges per vertex, 50 workers
trunk + GIRAPH-244:
{code}
12/07/24 06:05:34 INFO mapred.JobClient: Giraph Stats
12/07/24 06:05:34 INFO mapred.JobClient: Aggregate edges=1000000000
12/07/24 06:05:34 INFO mapred.JobClient: Superstep=26
12/07/24 06:05:34 INFO mapred.JobClient: Current workers=50
12/07/24 06:05:34 INFO mapred.JobClient: Last checkpointed superstep=0
12/07/24 06:05:34 INFO mapred.JobClient: Current master task partition=0
12/07/24 06:05:34 INFO mapred.JobClient: Sent messages=0
12/07/24 06:05:34 INFO mapred.JobClient: Aggregate finished vertices=100000
12/07/24 06:05:34 INFO mapred.JobClient: Aggregate vertices=100000
12/07/24 06:05:34 INFO mapred.JobClient: FileSystemCounters
12/07/24 06:05:34 INFO mapred.JobClient: HDFS_FILES_CREATED=104
12/07/24 06:05:34 INFO mapred.JobClient: Map-Reduce Framework
12/07/24 06:05:34 INFO mapred.JobClient: Map input records=51
12/07/24 06:05:34 INFO mapred.JobClient: Total physical memory in
bytes=1325149147136
12/07/24 06:05:34 INFO mapred.JobClient: Spilled Records=0
12/07/24 06:05:34 INFO mapred.JobClient: MAP_TASK_WALLCLOCK=15131488
12/07/24 06:05:34 INFO mapred.JobClient: Total cumulative CPU
milliseconds=48163170
12/07/24 06:05:34 INFO mapred.JobClient: Total virtual memory in
bytes=4487093018624
12/07/24 06:05:34 INFO mapred.JobClient: Map output records=0
real 5m13.062s
user 0m8.228s
sys 0m0.977s
{code}
trunk:
{code}
12/07/24 06:11:29 INFO mapred.JobClient: Giraph Stats
12/07/24 06:11:29 INFO mapred.JobClient: Aggregate edges=1000000000
12/07/24 06:11:29 INFO mapred.JobClient: Superstep=26
12/07/24 06:11:29 INFO mapred.JobClient: Current workers=50
12/07/24 06:11:29 INFO mapred.JobClient: Last checkpointed superstep=0
12/07/24 06:11:29 INFO mapred.JobClient: Current master task partition=0
12/07/24 06:11:29 INFO mapred.JobClient: Sent messages=0
12/07/24 06:11:29 INFO mapred.JobClient: Aggregate finished vertices=100000
12/07/24 06:11:29 INFO mapred.JobClient: Aggregate vertices=100000
12/07/24 06:11:29 INFO mapred.JobClient: FileSystemCounters
12/07/24 06:11:29 INFO mapred.JobClient: HDFS_FILES_CREATED=104
12/07/24 06:11:29 INFO mapred.JobClient: Map-Reduce Framework
12/07/24 06:11:29 INFO mapred.JobClient: Map input records=51
12/07/24 06:11:29 INFO mapred.JobClient: Total physical memory in
bytes=1286495846400
12/07/24 06:11:29 INFO mapred.JobClient: Spilled Records=0
12/07/24 06:11:29 INFO mapred.JobClient: MAP_TASK_WALLCLOCK=16066522
12/07/24 06:11:29 INFO mapred.JobClient: Total cumulative CPU
milliseconds=49089980
12/07/24 06:11:29 INFO mapred.JobClient: Total virtual memory in
bytes=4486483517440
12/07/24 06:11:29 INFO mapred.JobClient: Map output records=0
real 5m35.006s
user 0m7.908s
sys 0m0.947s
{code}
ShortestPathsBenchmark, using HashMapVertex
100K vertices, 10K edges per vertex, 50 workers
trunk + GIRAPH-244:
{code}
12/07/24 06:42:05 INFO mapred.JobClient: Giraph Stats
12/07/24 06:42:05 INFO mapred.JobClient: Aggregate edges=1000000000
12/07/24 06:42:05 INFO mapred.JobClient: Superstep=26
12/07/24 06:42:05 INFO mapred.JobClient: Current workers=50
12/07/24 06:42:05 INFO mapred.JobClient: Last checkpointed superstep=0
12/07/24 06:42:05 INFO mapred.JobClient: Current master task partition=0
12/07/24 06:42:05 INFO mapred.JobClient: Sent messages=0
12/07/24 06:42:05 INFO mapred.JobClient: Aggregate finished vertices=100000
12/07/24 06:42:05 INFO mapred.JobClient: Aggregate vertices=100000
12/07/24 06:42:05 INFO mapred.JobClient: FileSystemCounters
12/07/24 06:42:05 INFO mapred.JobClient: HDFS_FILES_CREATED=104
12/07/24 06:42:05 INFO mapred.JobClient: Map-Reduce Framework
12/07/24 06:42:05 INFO mapred.JobClient: Map input records=51
12/07/24 06:42:05 INFO mapred.JobClient: Total physical memory in
bytes=1400065339392
12/07/24 06:42:05 INFO mapred.JobClient: Spilled Records=0
12/07/24 06:42:05 INFO mapred.JobClient: MAP_TASK_WALLCLOCK=14865246
12/07/24 06:42:05 INFO mapred.JobClient: Total cumulative CPU
milliseconds=48159780
12/07/24 06:42:05 INFO mapred.JobClient: Total virtual memory in
bytes=4478185152512
12/07/24 06:42:05 INFO mapred.JobClient: Map output records=0
real 5m6.939s
user 0m7.738s
sys 0m0.908s
{code}
trunk:
{code}
12/07/24 06:32:45 INFO mapred.JobClient: Giraph Stats
12/07/24 06:32:45 INFO mapred.JobClient: Aggregate edges=1000000000
12/07/24 06:32:45 INFO mapred.JobClient: Superstep=26
12/07/24 06:32:45 INFO mapred.JobClient: Current workers=50
12/07/24 06:32:45 INFO mapred.JobClient: Last checkpointed superstep=0
12/07/24 06:32:45 INFO mapred.JobClient: Current master task partition=0
12/07/24 06:32:45 INFO mapred.JobClient: Sent messages=0
12/07/24 06:32:45 INFO mapred.JobClient: Aggregate finished vertices=100000
12/07/24 06:32:45 INFO mapred.JobClient: Aggregate vertices=100000
12/07/24 06:32:45 INFO mapred.JobClient: FileSystemCounters
12/07/24 06:32:45 INFO mapred.JobClient: HDFS_FILES_CREATED=104
12/07/24 06:32:45 INFO mapred.JobClient: Map-Reduce Framework
12/07/24 06:32:45 INFO mapred.JobClient: Map input records=51
12/07/24 06:32:45 INFO mapred.JobClient: Total physical memory in
bytes=1373213544448
12/07/24 06:32:45 INFO mapred.JobClient: Spilled Records=0
12/07/24 06:32:45 INFO mapred.JobClient: MAP_TASK_WALLCLOCK=14675125
12/07/24 06:32:45 INFO mapred.JobClient: Total cumulative CPU
milliseconds=49132170
12/07/24 06:32:45 INFO mapred.JobClient: Total virtual memory in
bytes=4494050402304
12/07/24 06:32:45 INFO mapred.JobClient: Map output records=0
real 5m4.375s
user 0m7.042s
sys 0m0.902s
{code}
> Vertex API redesign
> -------------------
>
> Key: GIRAPH-244
> URL: https://issues.apache.org/jira/browse/GIRAPH-244
> Project: Giraph
> Issue Type: Improvement
> Reporter: Alessandro Presta
> Assignee: Alessandro Presta
> Attachments: GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch,
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch,
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch,
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch
>
>
> This is an effort to rationalize the Giraph API. I've put together a few
> issues that we've talked about lately. I'm focusing on making Giraph
> development even more intuitive and less error-prone, and fixing a few
> potential sources of bugs.
> I'm sorry this is a big patch, but most of those issues are intertwined and I
> think this might be easier to review and integrate.
> Here's an account of the changes:
> Vertex API:
> - Renamed BasicVertex to Vertex (as I understand, we used to have both and
> then Vertex was removed).
> - Switched to Iterables instead of Iterators for both edges and messages.
> This makes code more concise for both implementors (no need to call
> .iterator() on collections) and users (can use foreach syntax). See also
> GIRAPH-221.
> - Added SimpleVertex and SimpleMutableVertex classes, where there are no edge
> values and the iterable to be implemented is getNeighbors(). We don’t have
> multiple inheritance, so the only way I could think of was to have
> SimpleVertex extend Vertex, SimpleMutableVertex extend MutableVertex, and
> duplicate the code for the edges iterables.
> Also, due to type erasure, one still has to deal with Edge objects in
> SimpleMutableVertex#initialize. Overall I think this is still an improvement
> over the current situation.
> - Added id and value field to the base Vertex class. All other classes were
> either writing the same boilerplate again and again, or using primitive
> fields and then creating Writables on the fly (inefficient; there was even a
> TODO about that). If there are any actually useful customizations here, I’ve
> yet to see them.
> Also removed redundant “Vertex” from getters/setters (compare vertex.getId()
> with vertex.getVertexId()).
> - Made halt a private field, and added a wakeUp() method to re-activate a
> vertex. isHalted()/voteToHalt()/wakeUp() are just more semantically-charged
> getter/setters.
> - Renamed number of vertices/edges in graph to getTotalNum*. The previous
> naming (getNumEdges) was arguably confusing. If this one sucks too, please
> suggest a better one.
> - Default implementations of hasEdge(), getEdgeValue(), getNumEdges(),
> readFields(), write(), toString(): the implementor can still optimize when
> there is a good opportunity. Currently we are duplicating a lot of code (see
> GIRAPH-238) and potentially introducing bugs (see GIRAPH-239).
> HashMapVertex:
> - Switched representation from Map<I, Edge<I, E>> to Map<I, E> (GIRAPH-242)
> - Only override methods that can be optimized.
> EdgeListVertex:
> - Switched representation from two sorted lists to one list of Edge<I, E>
> (see GIRAPH-243). Mainly this makes iteration over edges (target id and
> value) linear instead of O(n log n). Mutations are still slow and should
> generally be discouraged.
> - Only override methods that can be optimized.
> Small nits:
> - Our code conventions say we should try to avoid abbreviations, so I
> eliminated a few (req -> request, msg -> message).
> - Unilaterally refer to the endpoint of an edge as targetVertex (before we
> had a mix of destVertex and targetVertex).
> - You will notice some rearranged imports. That’s just my IDE trying to be
> helpful (see GIRAPH-230).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira