HBase/Accumulo Input and Output formats
---------------------------------------

                 Key: GIRAPH-153
                 URL: https://issues.apache.org/jira/browse/GIRAPH-153
             Project: Giraph
          Issue Type: New Feature
          Components: bsp
    Affects Versions: 0.1.0
         Environment: Single host OSX 10.6.8 2.2Ghz Intel i7, 8GB
            Reporter: Brian Femiano
         Attachments: AccumuloRootMarker.java, 
AccumuloRootMarkerInputFormat.java, AccumuloRootMarkerOutputFormat.java, 
AccumuloVertexInputFormat.java, AccumuloVertexOutputFormat.java, 
ComputeIsRoot.java, DistributedCacheHelper.java, HBaseVertexInputFormat.java, 
HBaseVertexOutputFormat.java, IdentifyAndMarkRoots.java, SetLongWritable.java, 
SetTextWritable.java, TableRootMarker.java, TableRootMarkerInputFormat.java, 
TableRootMarkerOutputFormat.java

Four abstract classes that wrap their respective delegate input/output formats 
for
easy hooks into vertex input format subclasses. I've included some sample 
programs that show two very simple graph
algorithms. I have a graph generator that builds out a very simple direct 
structure, starting with a few 'root' nodes.

Root nodes are defined as nodes that is not listed as a child anywhere in the 
graph. 

Algorithm 1) AccumuloRootMarker.java  --> Accumulo as read/write source. Every 
vertex starts thinking it's a root. At superstep 0, send a message down to each
child as a non-root notification. After superstep 1, only root nodes will have 
never been messaged. 

Algorithm 2) TableRootMarker --> HBase as read/write source. Expands on A1 by 
bundling the notification logic followed by root node propagation. Once we've 
marked the appropriate nodes as roots, tell every child which roots it can be 
traced back to via one or more spanning trees. This will take N + 2 supersteps 
where N is the maximum number of hops from any root to any leaf, plus 2 
supersteps for the initial root flagging. 

I've included all relevant code plus DistributedCacheHelper.java for recursive 
cache file and archive searches. It is more hadoop centric than giraph in 
particular, but these jobs use it so I figured why not commit here. 

These have been tested through local JobRunner, pseudo-distributed on the 
aforementioned hardware, and full distributed on EC2. More details in the 
comments.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to