[ 
https://issues.apache.org/jira/browse/GIRAPH-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420991#comment-13420991
 ] 

Alessandro Presta commented on GIRAPH-244:
------------------------------------------

Here's a PageRank benchmark run with 1M vertices, 100 edges/vertex, 30 
supersteps and 10 workers:

trunk + GIRAPH-244:
{code}
12/07/23 14:27:19 INFO mapred.JobClient:   Giraph Stats
12/07/23 14:27:19 INFO mapred.JobClient:     Aggregate edges=100000000
12/07/23 14:27:19 INFO mapred.JobClient:     Superstep=31
12/07/23 14:27:19 INFO mapred.JobClient:     Current workers=10
12/07/23 14:27:19 INFO mapred.JobClient:     Last checkpointed superstep=0
12/07/23 14:27:19 INFO mapred.JobClient:     Current master task partition=9
12/07/23 14:27:19 INFO mapred.JobClient:     Sent messages=0
12/07/23 14:27:19 INFO mapred.JobClient:     Aggregate finished vertices=1000000
12/07/23 14:27:19 INFO mapred.JobClient:     Aggregate vertices=1000000
12/07/23 14:27:19 INFO mapred.JobClient:   FileSystemCounters
12/07/23 14:27:19 INFO mapred.JobClient:     HDFS_FILES_CREATED=24
12/07/23 14:27:19 INFO mapred.JobClient:   Map-Reduce Framework
12/07/23 14:27:19 INFO mapred.JobClient:     Map input records=11
12/07/23 14:27:19 INFO mapred.JobClient:     Total physical memory in 
bytes=295136768000
12/07/23 14:27:19 INFO mapred.JobClient:     Spilled Records=0
12/07/23 14:27:19 INFO mapred.JobClient:     MAP_TASK_WALLCLOCK=5577508
12/07/23 14:27:19 INFO mapred.JobClient:     Total cumulative CPU 
milliseconds=28367700
12/07/23 14:27:19 INFO mapred.JobClient:     Total virtual memory in 
bytes=953247084544
12/07/23 14:27:19 INFO mapred.JobClient:     Map output records=0

real    8m43.003s
user    0m6.677s
sys     0m0.877s
{code}

trunk:
{code}
12/07/23 14:36:47 INFO mapred.JobClient:   Giraph Stats
12/07/23 14:36:47 INFO mapred.JobClient:     Aggregate edges=100000000
12/07/23 14:36:47 INFO mapred.JobClient:     Superstep=31
12/07/23 14:36:47 INFO mapred.JobClient:     Current workers=10
12/07/23 14:36:47 INFO mapred.JobClient:     Last checkpointed superstep=0
12/07/23 14:36:47 INFO mapred.JobClient:     Current master task partition=0
12/07/23 14:36:47 INFO mapred.JobClient:     Sent messages=0
12/07/23 14:36:47 INFO mapred.JobClient:     Aggregate finished vertices=1000000
12/07/23 14:36:47 INFO mapred.JobClient:     Aggregate vertices=1000000
12/07/23 14:36:47 INFO mapred.JobClient:   FileSystemCounters
12/07/23 14:36:47 INFO mapred.JobClient:     HDFS_FILES_CREATED=24
12/07/23 14:36:47 INFO mapred.JobClient:   Map-Reduce Framework
12/07/23 14:36:47 INFO mapred.JobClient:     Map input records=11
12/07/23 14:36:47 INFO mapred.JobClient:     Total physical memory in 
bytes=288768761856
12/07/23 14:36:47 INFO mapred.JobClient:     Spilled Records=0
12/07/23 14:36:47 INFO mapred.JobClient:     MAP_TASK_WALLCLOCK=5754578
12/07/23 14:36:47 INFO mapred.JobClient:     Total cumulative CPU 
milliseconds=28970270
12/07/23 14:36:47 INFO mapred.JobClient:     Total virtual memory in 
bytes=953337098240
12/07/23 14:36:47 INFO mapred.JobClient:     Map output records=0

real    8m54.799s
user    0m6.436s
sys     0m0.790s
{code}

They're pretty close in both memory usage and running time.

I'm going to run some other benchmarks tomorrow to see how things change e.g. 
increasing the number of edges per vertex, since we're using a different 
iteration strategy.
                
> Vertex API redesign
> -------------------
>
>                 Key: GIRAPH-244
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-244
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>         Attachments: GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, 
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, 
> GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, GIRAPH-244.patch, 
> GIRAPH-244.patch
>
>
> This is an effort to rationalize the Giraph API. I've put together a few 
> issues that we've talked about lately. I'm focusing on making Giraph 
> development even more intuitive and less error-prone, and fixing a few 
> potential sources of bugs.
> I'm sorry this is a big patch, but most of those issues are intertwined and I 
> think this might be easier to review and integrate.
> Here's an account of the changes:
> Vertex API:
> - Renamed BasicVertex to Vertex (as I understand, we used to have both and 
> then Vertex was removed).
> - Switched to Iterables instead of Iterators for both edges and messages. 
> This makes code more concise for both implementors (no need to call 
> .iterator() on collections) and users (can use foreach syntax). See also 
> GIRAPH-221.
> - Added SimpleVertex and SimpleMutableVertex classes, where there are no edge 
> values and the iterable to be implemented is getNeighbors(). We don’t have 
> multiple inheritance, so the only way I could think of was to have 
> SimpleVertex extend Vertex, SimpleMutableVertex extend MutableVertex, and 
> duplicate the code for the edges iterables.
> Also, due to type erasure, one still has to deal with Edge objects in 
> SimpleMutableVertex#initialize. Overall I think this is still an improvement 
> over the current situation.
> - Added id and value field to the base Vertex class. All other classes were 
> either writing the same boilerplate again and again, or using primitive 
> fields and then creating Writables on the fly (inefficient; there was even a 
> TODO about that). If there are any actually useful customizations here, I’ve 
> yet to see them.
> Also removed redundant “Vertex” from getters/setters (compare vertex.getId() 
> with vertex.getVertexId()).
> - Made halt a private field, and added a wakeUp() method to re-activate a 
> vertex. isHalted()/voteToHalt()/wakeUp() are just more semantically-charged 
> getter/setters.
> - Renamed number of vertices/edges in graph to getTotalNum*. The previous 
> naming (getNumEdges) was arguably confusing. If this one sucks too, please 
> suggest a better one.
> - Default implementations of hasEdge(), getEdgeValue(), getNumEdges(), 
> readFields(), write(), toString(): the implementor can still optimize when 
> there is a good opportunity. Currently we are duplicating a lot of code (see 
> GIRAPH-238) and potentially introducing bugs (see GIRAPH-239).
> HashMapVertex:
> - Switched representation from Map<I, Edge<I, E>> to Map<I, E> (GIRAPH-242)
> - Only override methods that can be optimized.
> EdgeListVertex:
> - Switched representation from two sorted lists to one list of Edge<I, E> 
> (see GIRAPH-243). Mainly this makes iteration over edges (target id and 
> value) linear instead of O(n log n). Mutations are still slow and should 
> generally be discouraged.
> - Only override methods that can be optimized.
> Small nits:
> - Our code conventions say we should try to avoid abbreviations, so I 
> eliminated a few (req -> request, msg -> message).
> - Unilaterally refer to the endpoint of an edge as targetVertex (before we 
> had a mix of destVertex and targetVertex).
> - You will notice some rearranged imports. That’s just my IDE trying to be 
> helpful (see GIRAPH-230).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to