[ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2369:
----------------------------------------
    Labels: gsoc2017 gsoc2018  (was: gsoc2017)

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> ------------------------------------------------------------------------------
>
>                 Key: NUTCH-2369
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2369
>             Project: Nutch
>          Issue Type: Task
>          Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>              Labels: gsoc2017, gsoc2018
>             Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to