[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338290#comment-16338290 ]
Markus Jelsma commented on NUTCH-2369: -------------------------------------- How is this different from the current WebGraph package which is just a hyperlink graph without HostDB or CrawlDB information. It does carry all and more information than LinkDB has to offer (ok except anchors perhaps). If this graph carries HostDB and CrawlDB information it would probably become very heavy, heavier than the current WebGraph. Wouldn't it make more sense to have a separate tool that combines data from different DB's? Such as the recent usage of HostDB in the generator? It is relatively fast, on demand, and does not require a prebuilt combined database. Or, for what purpose would you access this new graph? To get information you already can get with the right tool, and separate databases. Or, i am completely missing the picture, which happens often enough anyway. > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > ------------------------------------------------------------------------------ > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Major > Labels: gsoc2017, gsoc2018 > Fix For: 1.15 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)