[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341326#comment-16341326 ]
Lewis John McGibbney commented on NUTCH-2369: --------------------------------------------- Hi [~markus17] the idea here was to export full graph information into something that could be interpreted by [Tinkerpop|http://tinkpop.apache.org] and queried using [Gremlin|https://tinkerpop.apache.org/gremlin.html]. > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > ------------------------------------------------------------------------------ > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Major > Labels: gsoc2017, gsoc2018 > Fix For: 1.15 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)