Upgrade of mapred --> mapreduce in trunk e.g. Nutch 3.X

Lewis John Mcgibbney Sat, 14 Nov 2015 02:28:29 -0800

Hi Folks,

Mike Joyce and myself have been working on a Tinkerpop implementation of
Node and NodeDB (generated through WebGraph) which builds a Vertex input,
used by Tinkerpop, subsequently Gremlin and persisted into a graph database
such as TitanDB.
We have analyzed the problem quite a bit and came across the following I/O
formats
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
I've implemented a PropertyWebGraphVertex writable in Nutch which builds
off of NodeDB (and others) to enable us to write out to
the ScriptOutputFormat. Essentially we address the issues of parent child
Vs child parent e.g. Outlinks Vs Inlinks respectively.
The work from there then consists of an external process (to Nutch)
invoking a Groovy script from within Gremlin to ingest data into TitanDB.
During the course of this work we have realized that mapred and mapreduce
API's are NOT ok within trunk if we want to move Nutch to accommodate the
above described architecture.


Breath of fresh air and a deep breath...

What do you guys think about branching trunk into a 3.X branch with every
mapred --> mapreduce package addressed.
Mike, Sujen and myself talked today. We want to touch base with everyone
within dev@ as it lends itself very much to the work undertaken by
https://issues.apache.org/jira/browse/NUTCH-2097

It does not however totally rearrange the codebase. It will however
generate a genuine graph output based upon
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format
We can have a gremlin script as part of $NUTCH_HOME/conf which merely
ingests data (along with a config file) to a GraphDB such as Titan.

What does everyone think?
Thanks
Lewis

-- 
*Lewis*

Upgrade of mapred --> mapreduce in trunk e.g. Nutch 3.X

Reply via email to