Hi, We are looking into some giraph benchmarks to compare against a similar programming model and framework we are working on.
As a start we are planning to benchmark the following algorithms on data sets with more than a billion edges. 1. Single Source Shortest Path from a given source 2. Page Rank 3. Connected Components We have a small cluster of 16 nodes (8 core/16 gb each) to run the benchmarks. Given that we have a few questions to help us get the best out of giraph. 1. Which version of giraph should we use to take advantage of the optimizations in terms of memory optimization/caching, multi-threading etc. mentioned here https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920? 1.0 or trunk? 2. Are the samples present in the giraph distribution for the above algorithms a good place to start? How can we take advantage of different optimizations, including aggregators/combiners for these algorithms? 3. Is there a document i can look at to understand the best practices for implementing optimized vertex-centric code using the latest features and deployment guidelines to maximize utilization. Looking forward to your help. Thanks, Alok Kumbhare