Re: Suggestions on problem sizes for giraph performance benchmarking

2012-07-10 Thread Avery Ching
You should try using the appropriate memory settings (i.e.  
-Dmapred.child.java.opts=-Xms30g -Xmx30g -Xss128k) for a 30 GB heap.  
This depends on how much memory you can get.


Avery

On 7/9/12 5:57 AM, Amani Alonazi wrote:
Actually, I had the same problem of running out of memory with Giraph 
when trying to implement strongly connected components algorithm on 
Giraph. My input graph is 1 million nodes and 7 million edges.


I'm using cluster of 21 computers.


On Mon, Jul 9, 2012 at 3:44 PM, Benjamin Heitmann 
benjamin.heitm...@deri.org mailto:benjamin.heitm...@deri.org wrote:



Hello Stephen,

sorry for the very late reply.

On 28 Jun 2012, at 02:50, Fleischman, Stephen (ISS SCI - Plano TX)
wrote:


Hello Avery and all:

I have a cluster of 10  two-processor/48 GB RAM servers, upon
which we are conducting Hadoop performance characterization
tests.  I plan to use the Giraph pagerank and simple shortest
path example tests as part of this exercise and would appreciate
guidance on problem sizes for both tests.  I’m looking at paring
down an obfuscated Twitter dataset and it would save a lot of
time if someone has some knowledge on roughly how the time and
memory scales with number of nodes in a graph.




I can provide some suggestions for the kind of algorithm and data
which does currently surpass the scalability of giraph.

While the limits to my knowledge of Giraph and Hadoop are probably
also to blame for this, please see the recent discussions on this
list,
and on JIRA for other indications that the scalability of Giraph
needs improvement:
* post  by Yuanyuan Tian in the thread wierd communication
errors on user@giraph.apache.org mailto:user@giraph.apache.org
* GIRAPH-234 about GC overhead

https://issues.apache.org/jira/browse/GIRAPH-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

If you want to stretch the limits of Giraph, then you need to try
an algorithm which is conceptually different from PageRank, and
you need a big data set.
If you use an algorithm which has complex application logic (maybe
even domain specific logic), which needs to be embedded in the
algorithm,
then the nodes need to have a lot of state. In addition, such
algorithms probably send around a lot of messages, and each of the
messages might have a payload
which is more complex then one floating point number. In addition,
it helps to have a graph format, which requires strings on the
edges and vertices.
The strings are required for the domain specific business logic
which the graph algorithm needs to follow.

Finally, imagine a data set which has a big loading time, and
where one run of the algorithm only provides results for one user.
The standard Hadoop paradigm is to throw away the graph after
loading it.
So if you have 100s or 1000s of users, then you need a way to
execute the algorithm multiple times in parallel.
Again this will add a lot of state, as each of the vertices will
need to hold one state object for each user who has visited the
vertex.

In my specific case, I had the following data and algorithm:
Data:
* an RDF graph with 10 million vertices and 40 million edges
I used my own import code to map the RDF graph to a undirected
graph with a limit of one edge between any two nodes (so it was
not a multi-graph)
* each vertex and each edge uses a string as an identity to
represent a URI in the RDF graph (required for the business logic
in the algorithm)

Algorithm:
* spreading activation.
You can think of it as depth first search guided by domain
specific logic.
A short introduction here:
https://en.wikipedia.org/wiki/Spreading_activation
The wikipedia article only mentions using spreading activation on
weighted graphs, however I used it on graphs which have additional
types on the edges.
The whole area of using the semantics of the edges to guide the
algorithm is an active research topic, so thats why I can't point
you to a good article on that.
* parallel execution:
I need to run the algorithm once for every user in the system,
however loading the data set takes around 15 minutes alone.
So each node has an array of states, one for each user for which
the algorithm has visited a node.
I experimented with user numbers between 30 and 1000, anything
more did not work for concurrent execution of the algorithm.

Infrastructure:
* a single server with 24 Intel Xeon 2.4 GHz cpus and 96 GB of RAM
* Hadoop 1.0, pseudo-distributed setup
* between 10 and 20 Giraph workers


A few weeks ago I stopped work on my Giraph based implementation,
as Giraph ran out of memory almost immediately after loading and
initialising the data.
I made sure that the Giraph workers do not run out of 

Announcing: Green-Marl/Giraph

2012-07-10 Thread Jan van der Lugt
Hi everyone,

Some of you might already know Green-Marl, Green-Marl compatibility was
even mentioned as a new feature for Giraph 0.2, but for those who don't,
let me give you a quick introduction.

Green-Marl is a domain-specific language tailored to graph algorithms. It
has many features that makes it possible to write graph algorithms very
concise and intuitive. For example, it has built-in constructs for graphs,
node properties, and traversals. A more complete description is given here:
http://ppl.stanford.edu/papers/asplos12_hong.pdf. Apart from the language,
there is also a Green-Marl compiler that compiles the language to different
targets. C++/OpenMP was initially supported, after which support for
Stanford GPS (a Pregel-clone developed at Stanford) was added for a subset
of the features in the Green-Marl language. The big advantage of using a
language like Green-Marl is that it enables you to write implicitly
parallel programs in an intuitive way with the compiler doing most of the
heavy work such as generating the messaging, converting pull-based
operation (remote reading) to push-based operations (message sending), etc.
In the last few months I have worked on a few features in Giraph that
enable Green-Marl to target Giraph as a back-end (namely GIRAPH-127,
GIRAPH-192 and GIRAPH-216). Since these have all been merged, Green-Marl
and Giraph should be compatible as of last week!

Using Green-Marl with Giraph is very simple, as long as you have a working
Hadoop/Giraph environment. Since the Green-Marl compiler is a
source-to-source compiler, it is very easy to use it in an existing set-up.
The steps to using Green-Marl are like this:

   1. Download the Green-Marl compiler from
   https://github.com/stanford-ppl/Green-Marl, run make_dirs.sh in the
   top-level dir and run make in the src dir. A more detailed explanation is
   given in the Github readme.
   2. Compile one of the examples (in the top-level directory) like
   this: bin/gm_comp -t=giraph apps/src/pagerank.gm
   3. Move pagerank.java to the Giraph src directory or your own project,
   package it into a .jar and go!
   4. Explore more examples, modify them and start writing your own

Green-Marl is actively being worked on within Oracle Labs, new features are
being added to the language (such as collection node properties) and
support for these features will be added to the distributed back-ends in
the near future.

In case of questions or suggestions, please send us a message! Feedback is
very much appreciated and people are more than welcome to contribute
everything from small bugfixes to entire back-ends ;-)

- Jan


Apache Giraph BOARD report for 7/25 meeting

2012-07-10 Thread Avery Ching

Status report for the Apache Giraph project - July 2012

Giraph is a Bulk Synchronous Parallel framework for writing programs that
analyze large graphs on a Hadoop cluster. Giraph is similar to Google's
Pregel system.

Project Status
--

Releases:
  0.2.0 - expected 7/31
 * Reduce memory consumption
 * Improve support for the Green-Marl project.

The transition to being a full Apache project is nearly complete (still a
few references to incubator on the website).

Community
-

Activity has picked up on Apache Giraph and more contributors seem to be
gaining interest and we had 24 commits for the month of June.  We should
try to convert some contributors to committers soon.

Mailing lists:
  116 subscribers on dev
  155 subscribers on user