[ 
https://issues.apache.org/jira/browse/GIRAPH-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421443#comment-13421443
 ] 

Alessandro Presta commented on GIRAPH-244:
------------------------------------------

I backported the new SSSP benchmark in order to do some more comparisons.

The TLDR is: performance is essentially the same with both EdgeList and 
HashMapVertex.
10K edges is probably not enough to see a significative improvement in 
iteration speed.

All in all, I think this is good to go now.

Benchmark results:

ShortestPathsBenchmark, using EdgeListVertex
100K vertices, 10K edges per vertex, 50 workers

trunk + GIRAPH-244:
{code}
12/07/24 06:05:34 INFO mapred.JobClient:   Giraph Stats
12/07/24 06:05:34 INFO mapred.JobClient:     Aggregate edges=1000000000
12/07/24 06:05:34 INFO mapred.JobClient:     Superstep=26
12/07/24 06:05:34 INFO mapred.JobClient:     Current workers=50
12/07/24 06:05:34 INFO mapred.JobClient:     Last checkpointed superstep=0
12/07/24 06:05:34 INFO mapred.JobClient:     Current master task partition=0
12/07/24 06:05:34 INFO mapred.JobClient:     Sent messages=0
12/07/24 06:05:34 INFO mapred.JobClient:     Aggregate finished vertices=100000
12/07/24 06:05:34 INFO mapred.JobClient:     Aggregate vertices=100000
12/07/24 06:05:34 INFO mapred.JobClient:   FileSystemCounters
12/07/24 06:05:34 INFO mapred.JobClient:     HDFS_FILES_CREATED=104
12/07/24 06:05:34 INFO mapred.JobClient:   Map-Reduce Framework
12/07/24 06:05:34 INFO mapred.JobClient:     Map input records=51
12/07/24 06:05:34 INFO mapred.JobClient:     Total physical memory in 
bytes=1325149147136
12/07/24 06:05:34 INFO mapred.JobClient:     Spilled Records=0
12/07/24 06:05:34 INFO mapred.JobClient:     MAP_TASK_WALLCLOCK=15131488
12/07/24 06:05:34 INFO mapred.JobClient:     Total cumulative CPU 
milliseconds=48163170
12/07/24 06:05:34 INFO mapred.JobClient:     Total virtual memory in 
bytes=4487093018624
12/07/24 06:05:34 INFO mapred.JobClient:     Map output records=0

real    5m13.062s
user    0m8.228s
sys     0m0.977s
{code}

trunk:
{code}
12/07/24 06:11:29 INFO mapred.JobClient:   Giraph Stats
12/07/24 06:11:29 INFO mapred.JobClient:     Aggregate edges=1000000000
12/07/24 06:11:29 INFO mapred.JobClient:     Superstep=26
12/07/24 06:11:29 INFO mapred.JobClient:     Current workers=50
12/07/24 06:11:29 INFO mapred.JobClient:     Last checkpointed superstep=0
12/07/24 06:11:29 INFO mapred.JobClient:     Current master task partition=0
12/07/24 06:11:29 INFO mapred.JobClient:     Sent messages=0
12/07/24 06:11:29 INFO mapred.JobClient:     Aggregate finished vertices=100000
12/07/24 06:11:29 INFO mapred.JobClient:     Aggregate vertices=100000
12/07/24 06:11:29 INFO mapred.JobClient:   FileSystemCounters
12/07/24 06:11:29 INFO mapred.JobClient:     HDFS_FILES_CREATED=104
12/07/24 06:11:29 INFO mapred.JobClient:   Map-Reduce Framework
12/07/24 06:11:29 INFO mapred.JobClient:     Map input records=51
12/07/24 06:11:29 INFO mapred.JobClient:     Total physical memory in 
bytes=1286495846400
12/07/24 06:11:29 INFO mapred.JobClient:     Spilled Records=0
12/07/24 06:11:29 INFO mapred.JobClient:     MAP_TASK_WALLCLOCK=16066522
12/07/24 06:11:29 INFO mapred.JobClient:     Total cumulative CPU 
milliseconds=49089980
12/07/24 06:11:29 INFO mapred.JobClient:     Total virtual memory in 
bytes=4486483517440
12/07/24 06:11:29 INFO mapred.JobClient:     Map output records=0

real    5m35.006s
user    0m7.908s
sys     0m0.947s
{code}

ShortestPathsBenchmark, using HashMapVertex
100K vertices, 10K edges per vertex, 50 workers

trunk + GIRAPH-244:
{code}
12/07/24 06:42:05 INFO mapred.JobClient:   Giraph Stats
12/07/24 06:42:05 INFO mapred.JobClient:     Aggregate edges=1000000000
12/07/24 06:42:05 INFO mapred.JobClient:     Superstep=26
12/07/24 06:42:05 INFO mapred.JobClient:     Current workers=50
12/07/24 06:42:05 INFO mapred.JobClient:     Last checkpointed superstep=0
12/07/24 06:42:05 INFO mapred.JobClient:     Current master task partition=0
12/07/24 06:42:05 INFO mapred.JobClient:     Sent messages=0
12/07/24 06:42:05 INFO mapred.JobClient:     Aggregate finished vertices=100000
12/07/24 06:42:05 INFO mapred.JobClient:     Aggregate vertices=100000
12/07/24 06:42:05 INFO mapred.JobClient:   FileSystemCounters
12/07/24 06:42:05 INFO mapred.JobClient:     HDFS_FILES_CREATED=104
12/07/24 06:42:05 INFO mapred.JobClient:   Map-Reduce Framework
12/07/24 06:42:05 INFO mapred.JobClient:     Map input records=51
12/07/24 06:42:05 INFO mapred.JobClient:     Total physical memory in 
bytes=1400065339392
12/07/24 06:42:05 INFO mapred.JobClient:     Spilled Records=0
12/07/24 06:42:05 INFO mapred.JobClient:     MAP_TASK_WALLCLOCK=14865246
12/07/24 06:42:05 INFO mapred.JobClient:     Total cumulative CPU 
milliseconds=48159780
12/07/24 06:42:05 INFO mapred.JobClient:     Total virtual memory in 
bytes=4478185152512
12/07/24 06:42:05 INFO mapred.JobClient:     Map output records=0

real    5m6.939s
user    0m7.738s
sys     0m0.908s
{code}

trunk:
{code}
12/07/24 06:32:45 INFO mapred.JobClient:   Giraph Stats
12/07/24 06:32:45 INFO mapred.JobClient:     Aggregate edges=1000000000
12/07/24 06:32:45 INFO mapred.JobClient:     Superstep=26
12/07/24 06:32:45 INFO mapred.JobClient:     Current workers=50
12/07/24 06:32:45 INFO mapred.JobClient:     Last checkpointed superstep=0
12/07/24 06:32:45 INFO mapred.JobClient:     Current master task partition=0
12/07/24 06:32:45 INFO mapred.JobClient:     Sent messages=0
12/07/24 06:32:45 INFO mapred.JobClient:     Aggregate finished vertices=100000
12/07/24 06:32:45 INFO mapred.JobClient:     Aggregate vertices=100000
12/07/24 06:32:45 INFO mapred.JobClient:   FileSystemCounters
12/07/24 06:32:45 INFO mapred.JobClient:     HDFS_FILES_CREATED=104
12/07/24 06:32:45 INFO mapred.JobClient:   Map-Reduce Framework
12/07/24 06:32:45 INFO mapred.JobClient:     Map input records=51
12/07/24 06:32:45 INFO mapred.JobClient:     Total physical memory in 
bytes=1373213544448
12/07/24 06:32:45 INFO mapred.JobClient:     Spilled Records=0
12/07/24 06:32:45 INFO mapred.JobClient:     MAP_TASK_WALLCLOCK=14675125
12/07/24 06:32:45 INFO mapred.JobClient:     Total cumulative CPU 
milliseconds=49132170
12/07/24 06:32:45 INFO mapred.JobClient:     Total virtual memory in 
bytes=4494050402304
12/07/24 06:32:45 INFO mapred.JobClient:     Map output records=0

real    5m4.375s
user    0m7.042s
sys     0m0.902s
{code}
                
> Vertex API redesign
> -------------------
>
>                 Key: GIRAPH-244
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-244
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>         Attachments: GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, 
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, 
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, 
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch
>
>
> This is an effort to rationalize the Giraph API. I've put together a few 
> issues that we've talked about lately. I'm focusing on making Giraph 
> development even more intuitive and less error-prone, and fixing a few 
> potential sources of bugs.
> I'm sorry this is a big patch, but most of those issues are intertwined and I 
> think this might be easier to review and integrate.
> Here's an account of the changes:
> Vertex API:
> - Renamed BasicVertex to Vertex (as I understand, we used to have both and 
> then Vertex was removed).
> - Switched to Iterables instead of Iterators for both edges and messages. 
> This makes code more concise for both implementors (no need to call 
> .iterator() on collections) and users (can use foreach syntax). See also 
> GIRAPH-221.
> - Added SimpleVertex and SimpleMutableVertex classes, where there are no edge 
> values and the iterable to be implemented is getNeighbors(). We don’t have 
> multiple inheritance, so the only way I could think of was to have 
> SimpleVertex extend Vertex, SimpleMutableVertex extend MutableVertex, and 
> duplicate the code for the edges iterables.
> Also, due to type erasure, one still has to deal with Edge objects in 
> SimpleMutableVertex#initialize. Overall I think this is still an improvement 
> over the current situation.
> - Added id and value field to the base Vertex class. All other classes were 
> either writing the same boilerplate again and again, or using primitive 
> fields and then creating Writables on the fly (inefficient; there was even a 
> TODO about that). If there are any actually useful customizations here, I’ve 
> yet to see them.
> Also removed redundant “Vertex” from getters/setters (compare vertex.getId() 
> with vertex.getVertexId()).
> - Made halt a private field, and added a wakeUp() method to re-activate a 
> vertex. isHalted()/voteToHalt()/wakeUp() are just more semantically-charged 
> getter/setters.
> - Renamed number of vertices/edges in graph to getTotalNum*. The previous 
> naming (getNumEdges) was arguably confusing. If this one sucks too, please 
> suggest a better one.
> - Default implementations of hasEdge(), getEdgeValue(), getNumEdges(), 
> readFields(), write(), toString(): the implementor can still optimize when 
> there is a good opportunity. Currently we are duplicating a lot of code (see 
> GIRAPH-238) and potentially introducing bugs (see GIRAPH-239).
> HashMapVertex:
> - Switched representation from Map<I, Edge<I, E>> to Map<I, E> (GIRAPH-242)
> - Only override methods that can be optimized.
> EdgeListVertex:
> - Switched representation from two sorted lists to one list of Edge<I, E> 
> (see GIRAPH-243). Mainly this makes iteration over edges (target id and 
> value) linear instead of O(n log n). Mutations are still slow and should 
> generally be discouraged.
> - Only override methods that can be optimized.
> Small nits:
> - Our code conventions say we should try to avoid abbreviations, so I 
> eliminated a few (req -> request, msg -> message).
> - Unilaterally refer to the endpoint of an edge as targetVertex (before we 
> had a mix of destVertex and targetVertex).
> - You will notice some rearranged imports. That’s just my IDE trying to be 
> helpful (see GIRAPH-230).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to