Hi Everyone,
I have been looking at GraphLab integration with Hadoop while visiting
Yahoo! Research over the summer. It seems increasingly clear with the
convergence of cloud tools to the Apache framework that some form of
integration will be extremely helpful. However, determining the right path
has been surprisingly challenging.
As the presentation (great link) notes, Hadoop is being split into cluster
management/scheduling tools (i.e., YARN) and the MapReduce computational
model (i.e., Hadoop). In that setting GraphLab will, ideally, run on-top of
YARN (not Hadoop) and use the exposed distributed file system resources to
load and save graphs. While it is possible to launch GraphLab from within
Hadoop streaming, as a map-only job, this is slightly less clean and appears
to be the old way of integrating external toolkits.
Does anyone know the status of YARN? Is it ready for launching/managing C++
jobs? I have been looking for a project page but have not been very
successful.
With respect to MRv2 I think there are still some issues that need to be
resolved. For example I looked at the pipeline:
Hadoop Graph Creation --> Hadoop/Yarn GraphLab Launch --> Hadoop Post
Processing
This requires a common data format. I constructed a prototype around AVRO
but found that the C++ implementation lacks nested structures which I
"needed" to cleanly encode the GraphLab data graph. Also, surprisingly,
MRv2 did not seem to have an AVRO interface (maybe this is fixed now).
While these problems are not insurmountable they make it difficult to
cleanly integrate with Hadoop, Yarn, and Avro while they are transitioning
to MRv2 (especially when the C++bindings seem to be updated more slowly).
An alternative, potentially cleaner, strategy might be to try to build a
Java (or at least JVM based) version of GraphLab. This would eliminate the
need for cross language formats and a JNI interface for Java based update
functions. However, maintaining two versions of GraphLab could be difficult
and we would probably take a performance hit.
We are now in the processing of building a roadmap for the next generation
of GraphLab which will provide more powerful abstractions, improved
scalability, and greater ease-of-use. Any feedback, suggestions, or ideas
would be great!
Joey
PS: We are co-organizing a NIPS'11 workshop (http://biglearn.org) which will
address issues related to large-scale machine learning and will provide an
excellent venue to discuss future plans and system integration.