Updated Branches:
  refs/heads/release-1.0.0 b77acf2ef -> 253720172

GIRAPH-617: Website Documentation: Vertex/Edge Input/Output (apresta)


Project: http://git-wip-us.apache.org/repos/asf/giraph/repo
Commit: http://git-wip-us.apache.org/repos/asf/giraph/commit/b408ed1f
Tree: http://git-wip-us.apache.org/repos/asf/giraph/tree/b408ed1f
Diff: http://git-wip-us.apache.org/repos/asf/giraph/diff/b408ed1f

Branch: refs/heads/release-1.0.0
Commit: b408ed1f5ff54ac62fa073d5402515b3d892086d
Parents: b77acf2
Author: Alessandro Presta <[email protected]>
Authored: Fri Apr 12 14:20:02 2013 -0700
Committer: Alessandro Presta <[email protected]>
Committed: Mon Apr 15 11:43:55 2013 -0700

----------------------------------------------------------------------
 src/site/site.xml    |    1 +
 src/site/xdoc/io.xml |   79 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 0 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/giraph/blob/b408ed1f/src/site/site.xml
----------------------------------------------------------------------
diff --git a/src/site/site.xml b/src/site/site.xml
index 223f66e..ebf1056 100644
--- a/src/site/site.xml
+++ b/src/site/site.xml
@@ -70,6 +70,7 @@
     
     <menu name="User Docs" inherit="top">
       <item name="Page Rank Example" href="pagerank.html"/>
+      <item name="Input/output in Giraph" href="io.html"/>
       <item name="Javadoc" href="javadoc_modules.html"/>
       <item name="Presentations" href="presentations.html"/>
       <item name="External Community Wiki" 
href="https://cwiki.apache.org/confluence/display/GIRAPH"; />

http://git-wip-us.apache.org/repos/asf/giraph/blob/b408ed1f/src/site/xdoc/io.xml
----------------------------------------------------------------------
diff --git a/src/site/xdoc/io.xml b/src/site/xdoc/io.xml
new file mode 100644
index 0000000..cb752cb
--- /dev/null
+++ b/src/site/xdoc/io.xml
@@ -0,0 +1,79 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<document xmlns="http://maven.apache.org/XDOC/2.0";
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+         xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 
http://maven.apache.org/xsd/xdoc-2.0.xsd";>
+  <properties>
+    <title>Input/output in Giraph</title>
+  </properties>
+
+  <body>
+    <section name="Overview">
+      <p>
+        Any real-world Giraph application reads input data from some sort of 
(usually distributed) storage, runs several computation supersteps, and then 
writes back the output. The input data contains a representation of the graph 
and, often, some metadata on the vertices or edges. The output data often 
consists of final vertex values, but can also contain the graph itself, 
possibly modified.          
+      </p>
+      <p>
+        For example, in a standard PageRank implementation, the input will 
contain the edges in the graph, and the output will consist of the final 
PageRank values for all vertices.
+      </p>
+      <p>
+        Giraph itself doesn't dictate a specific form of storage. Instead, the 
user implements a generic API which converts the preferred type of data to and 
from Giraph's main classes (<code>Vertex</code> and <code>Edge</code>). For 
example, one can build on top of Hadoop's TextInputFormat in order to read the 
graph from plain text files stored in HDFS. Giraph provides several examples of 
text-based formats (see the <code>org.apache.giraph.io.formats</code> package 
in <code>giraph-core</code>) and libraries to read from other types of storage 
(<code>giraph-hive</code>, <code>giraph-hbase</code>, etc...).
+      </p>
+    </section>
+    <section name="Main interfaces">
+      <p>
+        <br>
+          Giraph's I/O builds on top of Hadoop's input/output format API. This 
allows us to easily incorporate existing Hadoop formats.
+        </br>
+        There are two main ways the input graph may be layed out: the directed 
edges may be grouped by source vertex (i.e., represented as an adjacency list), 
or they may appear in arbitrary order (as is often the case with relational 
storage, where each record corresponds to an edge). In the first case, any 
metadata for the vertex can be read together with its out-edges. This is 
achieved by implementing <code>VertexInputFormat</code>. In the second case, 
edges will be read by means of an <code>EdgeInputFormat</code>. If there is 
additional data for the vertices, it will be read separately by a 
<code>VertexValueInputFormat</code>.
+      </p>
+      <p>
+        To summarize, <code>VertexInputFormat</code> is usually used by 
itself, whereas <code>EdgeInputFormat</code> may be used in combination with 
<code>VertexValueInputFormat</code>.
+      </p>
+      <p>
+        Output is always done on a per-vertex basis: a 
<code>VertexOutputFormat</code> will specify what data to write for each 
vertex. This usually means (some function of) the vertex value, but nothing 
prevents us from writing back the edges instead.
+      </p>
+      <p>
+        Let's have a quick look at the base classes:
+        <ul>
+          <li>
+            <code>VertexInputFormat&lt;I, V, E&gt;</code>: the 
<code>getSplits()</code> method returns a list of <i>logical</i> splits of the 
input data, given a hint provided by the infrastructure (usually the total 
number of input threads across all workers). The 
<code>createVertexReader()</code> method returns a <code>VertexReader</code> 
for the given split.
+          </li>
+          <li>
+            <code>VertexReader&lt;I, V, E&gt;</code>: this is where the user 
defines how to create vertices from an input split. The interface is similar to 
Hadoop's <code>RecordReader</code>: the infrastructure will call 
<code>getCurrentVertex()</code> to read vertices, until 
<code>nextVertex()</code> returns false, meaning the input split is finished.
+          </li>
+          <li>
+            <code>VertexValueInputFormat&lt;I, V&gt;</code>: similar to the 
above, but creates a <code>VertexValueReader</code> instead.
+          </li>
+          <li>
+            <code>VertexValueReader&lt;I, V&gt;</code>: here we are not 
reading edges, so the methods to define are <code>getCurrentVertexId()</code> 
and <code>getCurrentVertexValue()</code>.
+          </li>
+          <li>
+            <code>EdgeInputFormat&lt;I, E&gt;</code>: as usual, splits the 
input via <code>getSplits()</code> and creates an <code>EdgeReader</code> via 
<code>createEdgeReader()</code>.
+          </li>
+          <li>
+            <code>EdgeReader&lt;I, E&gt;</code>: the main methods are 
<code>getCurrentSourceId()</code>, which returns the source vertex id, and 
<code>getCurrentEdge()</code>, which returns an <code>Edge&lt;I, E&gt;</code> 
(i.e., the target vertex id, possibly with an edge value).
+          </li>
+        </ul>
+      </p>
+    </section>
+  </body>
+</document>

Reply via email to