Updated Branches: refs/heads/trunk f7c302587 -> 82dad0a17
http://git-wip-us.apache.org/repos/asf/giraph/blob/82dad0a1/src/site/site.xml ---------------------------------------------------------------------- diff --git a/src/site/site.xml b/src/site/site.xml index b299b59..a5931ea 100644 --- a/src/site/site.xml +++ b/src/site/site.xml @@ -78,6 +78,7 @@ <item name="Page Rank Example" href="pagerank.html"/> <item name="Input/Output in Giraph" href="io.html"/> <item name="Hive" href="hive.html"/> + <item name="Gora" href="gora.html"/> <item name="Rexster" href="rexster.html"/> <item name="Aggregators" href="aggregators.html"/> <item name="Out-of-core" href="ooc.html"/> http://git-wip-us.apache.org/repos/asf/giraph/blob/82dad0a1/src/site/xdoc/gora.xml ---------------------------------------------------------------------- diff --git a/src/site/xdoc/gora.xml b/src/site/xdoc/gora.xml new file mode 100644 index 0000000..ea8f1c8 --- /dev/null +++ b/src/site/xdoc/gora.xml @@ -0,0 +1,354 @@ +<?xml version="1.0" encoding="UTF-8"?> + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, software + ~ distributed under the License is distributed on an "AS IS" BASIS, + ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + ~ See the License for the specific language governing permissions and + ~ limitations under the License. + --> + +<document xmlns="http://maven.apache.org/XDOC/2.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 + http://maven.apache.org/xsd/xdoc-2.0.xsd"> + + <properties> + <title>Giraph Input/Output with Gora</title> + </properties> + + <body> + <section name="Overview"> + The <a class="externalLink" href="http://gora.apache.org/index.html">Apache + Gora</a> project is an open source framework which provides an in-memory + data model and persistence for big data. Gora supports persisting to column + stores, key value stores, document stores and RDBMSs, and + analyzing the data with extensive Apache Hadoop MapReduce support. + <br /> + + The integration of these two awesome Apache projects has as main motivation + the possibility of turning Gora-supported-NoSQL data stores into + Giraph-processable graphs, and to provide Giraph the ability to store its + results into different data stores, letting users focus on the processing itself. + <br /> + + The way Gora works is by defining the data model how our data is going to be + stored using a JSON-like schema inspired in + <a class="externalLink" href="http://avro.apache.org/">Apache Avro</a> and + doing the physical mapping to the data store using an XML file. + The former one will help us generate data beans which will be read or written + into different data stores, and the latter one, helps us defining which data + bean should go where. + + In this way, Giraph will be able to read/write data using three files: + <ul> + <li>The generated data beans representing our data model.</li> + <li>The XML mapping file representing our physical mapping.</li> + <li>A file called <code>gora.properties</code> containing + configurations related to which data store Gora will use.</li> + </ul> + The image below shows how this integration works in a plain simple image: + <img src="images/Gora-Giraph.svg" alt="Giraph Gora integration"/> + + </section> + <section name="Generating DataBeans"> + So the first thing we have to is to define our data model using a JSON-like schema. Here it is + a schema resembling graphs stored inside Apache HBase through Gora. The following shows a schema + for a vertex: + <div class="source"><pre class="prettyprint"> +{"type": "record", +"name": "Vertex", +"namespace": "org.apache.giraph.gora.generated", +"fields" : [ + {"name": "vertexId", "type": "long"}, + {"name": "value", "type": "float"}, + {"name": "edges", + "type": { + "type":"array", "items": { + "name": "Edge", + "type": "record", + "namespace": "org.apache.giraph.gora.generated", + "fields": [ + {"name": "vertexId", "type": "long"}, + {"name": "edgeValue", "type": "float"} + ] + } + } + } + ] +}</pre></div> + + And this other schema shows what a schema for an edge should look like. + <div class="source"><pre class="prettyprint"> + { + "type": "record", + "name": "GEdge", + "namespace": "org.apache.giraph.gora.generated", + "fields" : [ + {"name": "edgeId", "type": "string"}, + {"name": "edgeWeight", "type": "float"}, + {"name": "vertexInId", "type": "string"}, + {"name": "vertexOutId", "type": "string"}, + {"name": "label", "type": "string"} + ] + } + </pre></div> + + Now we are ready to generate our data beans. To do this, we need to use gora-core.jar which + comes with Giraph. The gora-compiler works using three parameters: + <div class="source"><pre class="prettyprint"> + <schema file> - REQUIRED -individual avsc file to be compiled or a directory path containing avsc files + <output dir> - REQUIRED -output directory for generated Java files + <-license id> - the preferred license header to add to the + </pre></div> + + So by executing the gora compiler through this command, the generated data beans + will be created in the path set. + + <div class="source"><pre class="prettyprint"> + java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/ + java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class edge.avsc gora-app/src/main/java/ + </pre></div> + + <br /> + This will result into a java class which will look something similar to this: + <div class="source"><pre class="prettyprint"> + /** + * Class for defining a Giraph-Vertex. + */ + @SuppressWarnings("all") + public class GVertex extends PersistentBase { + /** + * Schema used for the class. + */ + public static final Schema OBJ_SCHEMA = Schema.parse( + "{\"type\":\"record\",\"name\":\"Vertex\"," + + "\"namespace\":\"org.apache.giraph.gora.generated\"," + + "\"fields\":[{\"name\":\"vertexId\",\"type\":\"string\"}," + + "{\"name\":\"value\",\"type\":\"float\"},{\"name\":\"edges\"," + + "\"type\":{\"type\":\"map\",\"values\":\"string\"}}]}"); + + /** + * Vertex Id + */ + private Utf8 vertexId; + + /** + * Gets vertexId + * @return Utf8 vertexId + */ + public Utf8 getVertexId() { + return (Utf8) get(0); + } + + /** + * Sets vertexId + * @param value vertexId + */ + public void setVertexId(Utf8 value) { + put(0, value); + } + . . . + </pre></div> + + Once this logical data modeling is done, the physical mapping between this generated + classes and the actual data repositories have to be made. Gora does this by using a + xml "mapping file". + <br /> + The file below represents a <code>gora-hbase-mapping.xml</code> i.e. the necessary + information to map our data model into HBase tables. Within the tags <code>table</code> + the necessary column families will be defined. Moreover, within the tags + <code>class</code>, the actual generated java bean will be mapped into the column + families. Inside this, each field should be mapped into their respective column + family, and the HBase qualifier to be used for storing this field. + <br /> + This mapping file can contain as many mappings as generated data beans our application + uses i.e. we can redefine more <code>table</code> tags with their own <code>class</code> + and <code>fields</code>. + + <div class="source"><pre class="prettyprint"> + <gora-orm> + <table name="graphGiraph"> + <family name="vertices"/> + </table> + <class name="org.apache.giraph.io.gora.generated.GVertex" keyClass="java.lang.String" table="graphGiraph"> + <field name="vertexId" family="vertices" qualifier="vertexId"/> + <field name="value" family="vertices" qualifier="value"/> + <field name="edges" family="vertices" qualifier="edges"/> + </class> + </gora-orm> + </pre></div> + A more complex file can be found inside <code>giraph-gora/conf</code> folder. + + </section> + <section name="Preparation"> + Once the data beans have been generated, the <code>gora.properties</code> file + has be created. This file specifies which data store is going to be used with + Gora, but also contains extra information about such data store. An example of + such file can be found inside <code>giraph-gora/conf</code> folder. Following + our example, if it has been decided to use Apache HBase so <code>gora.properties</code> + should contain such configuration, as shown below:<br /> + <code> + # FOR HBASE DATASTORE + gora.datastore.default=org.apache.gora.hbase.store.HBaseStore + </code> + Then to be able to use the Gora API the user needs to prepare the Gora environment. + This is not more than having set up one of the data stores Gora support, having + the data beans generated and the <code>gora.properties</code> file set up. A more + detail yet simple tutorial can be found + <a href="http://gora.apache.org/current/tutorial.html">here</a>. + + <br /> + The data definition files should be available in the classpath when the + Giraph job is run. But also all configuration files needed for each specific data + store should also be made available across the cluster. For example, if we were + to use HBase along Giraph and Gora, then the hbase-site.xml file should be passed + along as well. There are several ways to make these files available, and one common + way to do this is with the <code>-file</code> option. This option would look like + something similar to this: <br /> + <div class="source"><pre class="prettyprint"> + -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml + </pre></div><br /> + + Gora also needs to be told which serialization types it will use. This serialization + types could be made across the cluster, but if that is not desired, then they can be + passed using the <code>-D</code> option of Hadoop. This option would look like + something similar to this:<br /> + <div class="source"><pre class="prettyprint"> + -Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization + </pre></div><br /> + </section> + + <section name="Configuration Options"> + Now that the data beans have been generated, and Gora environment ready, + the configuration options for this API have to be known in order to be specified + by the user. These configurations are as follow: <br /> + <table border='0'> + <tr> + <th>label</th> + <th>type</th> + <th>description</th> + </tr> + <tr> + <td>giraph.gora.datastore.class</td> + <td>string</td> + <td>Gora DataStore class to access to data from - required.</td> + </tr> + <tr> + <td>giraph.gora.key.class</td> + <td>String</td> + <td>Gora Key class to query the datastore - required.</td> + </tr> + <tr> + <td>giraph.gora.persistent.class</td> + <td>String</td> + <td>Gora Persistent class to read objects from Gora - required.</td> + </tr> + <tr> + <td>giraph.gora.start.key</td> + <td>String</td> + <td>Gora start key to query the datastore.</td> + </tr> + <tr> + <td>giraph.gora.end.key</td> + <td>String</td> + <td>Gora end key to query the datastore.</td> + </tr> + <tr> + <td>giraph.gora.keys.factory.class</td> + <td>String</td> + <td> Keys factory to convert strings into desired keys - required. </td> + </tr> + <tr> + <td>giraph.gora.output.datastore.class</td> + <td>String</td> + <td>Gora DataStore class to write data to - required.</td> + </tr> + <tr> + <td>giraph.gora.output.key.class</td> + <td>String</td> + <td>Gora Key class to write to datastore - required.</td> + </tr> + <tr> + <td>giraph.gora.output.persistent.class</td> + <td>String</td> + <td>Gora Persistent class to write to Gora - required. + </td> + </tr> + </table> + </section> + + <section name="Input/Output Example"> + To make use of the Giraph input API available for Gora, it is required to extend the + classes <code>GoraVertexInputFormat</code> or <code>GoraEdgeInputFormat</code>. + In the first class, the only method that has to be implemented is + <code>transformVertex</code> to transform a <code>Gora Object</code> into a + Giraph's <code>Vertex</code> object. Likewise, for the second class the methods + that have to be implemented are <code>transformEdge</code>, to convert a + <code>Gora Edge Object</code> into a the Giraph's<code>Edge</code> object, and + <code>getCurrentSourceId</code>. There are two Examples of such implementations + which are <code>GoraGVertexVertexInputFormat</code> and + <code>GoraGEdgeEdgeInputFormat</code>. One other class that has to be implemented + here is the <code>KeyFactory</code> because this class is used to transform the keys + passed as strings throught the options into actual Gora key Objects used to query + the data store. The default one assumes your key type is a <code>String</code>.<br /> + + On the other hand, to make use of the Giraph output API available for Gora, + it is required to extend the classes <code>GoraVertexOutputFormat</code> or + <code>GoraEdgeOutputFormat</code>. + In the first class, the only method that has to be implemented is + <code>getGoraVertex</code> to transform a Giraph's Vertex object into a + Gora object, and <code>getGoraKey</code> to determine the key which will represent + such vertex. Likewise, for the Edge output class the methods + that have to be implemented are <code>getGoraEdge</code>, to convert a Giraph's + Edge object into a Gora Edge object, and <code>getGoraKey</code> to determine the + key which will represent such edge. There are two Examples of such implementations + which are <code>GoraGVertexVertexOutputFormat</code> and + <code>GoraGEdgeEdgeOutputFormat</code>. + <br /> + + An example command showing how to put together all these classes and configurations + is shown below. This command is to compute the shortest path algorithm onto the + graph database shown previously is provided below. + <br /> + <code> + export GIRAPH_CORE_JAR=$GIRAPH_CORE_TARGET_DIR/giraph-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar<br /> + export GIRAPH_EXAMPLES_JAR=$GIRAPH_EXAMPLES_TARGET_DIR/giraph-examples-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar<br /> + export GIRAPH_GORA_JAR=$GIRAPH_GORA_TARGET_DIR/giraph-gora-$GIRAPH_VERSION-SNAPSHOT-jar-with-dependencies.jar<br /> + export GORA_HBASE_JAR=$GORA_HBASE_TARGET_DIR/gora-cassandra-$GORA_VERSION.jar<br /> + export HBASE_JAR=$GORA_DIR/gora-hbase/lib/hbase-0.90.4.jar + export HADOOP_CLASSPATH=$GIRAPH_CORE_JAR:$GIRAPH_EXAMPLES:$GIRAPH_GORA_JAR:$GORA_HBASE_JAR<br/><br/> + </code><br /> + + <div class="source"><pre class="prettyprint"> + hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner + -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml + -Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization + -Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore + -Dgiraph.gora.key.class=java.lang.String + -Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge + -Dgiraph.gora.start.key=0 + -Dgiraph.gora.end.key=10 + -Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory + -Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore + -Dgiraph.gora.output.key.class=java.lang.String + -Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult + -libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR + org.apache.giraph.examples.SimpleShortestPathsComputation + -eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat + -eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat + -w 1 + </pre></div><br /> + </section> + </body> +</document>
