Hi Folks, Mike Joyce and myself have been working on a Tinkerpop implementation of Node and NodeDB (generated through WebGraph) which builds a Vertex input, used by Tinkerpop, subsequently Gremlin and persisted into a graph database such as TitanDB. We have analyzed the problem quite a bit and came across the following I/O formats http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format I've implemented a PropertyWebGraphVertex writable in Nutch which builds off of NodeDB (and others) to enable us to write out to the ScriptOutputFormat. Essentially we address the issues of parent child Vs child parent e.g. Outlinks Vs Inlinks respectively. The work from there then consists of an external process (to Nutch) invoking a Groovy script from within Gremlin to ingest data into TitanDB. During the course of this work we have realized that mapred and mapreduce API's are NOT ok within trunk if we want to move Nutch to accommodate the above described architecture.
Breath of fresh air and a deep breath... What do you guys think about branching trunk into a 3.X branch with every mapred --> mapreduce package addressed. Mike, Sujen and myself talked today. We want to touch base with everyone within dev@ as it lends itself very much to the work undertaken by https://issues.apache.org/jira/browse/NUTCH-2097 It does not however totally rearrange the codebase. It will however generate a genuine graph output based upon http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#script-io-format We can have a gremlin script as part of $NUTCH_HOME/conf which merely ingests data (along with a config file) to a GraphDB such as Titan. What does everyone think? Thanks Lewis -- *Lewis*