[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-635: ------------------------------- Attachment: NUTCH-635-5-20080620.patch Refactored patch that removes network calls using MapFile.Readers and simulates better a row matrix though inverting and merging inlink scores. This patch works in the general sort-merge-process structure of MapReduce and as such should be significantly faster. The previous jobs were taking far to long to process on a large dataset. This patch includes the link anlaysis tool, a tool for updating the crawl db with a new score and clearing scores of urls with no score, an outlink database tool, a new inlink database tool that will keep inlinks consistent with outlinks, and a new scoring plugin which replaces the opic plugin. The order of tool runs should now be: Inject, Generate, Fetch, UpdateDb, OutlinkDb, InlinkDb, LinkAnalysis, ScoreUpdater, Indexer > LinkAnalysis Tool for Nutch > --------------------------- > > Key: NUTCH-635 > URL: https://issues.apache.org/jira/browse/NUTCH-635 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.0.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, > NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, > NUTCH-635-5-20080620.patch > > > This is a basic pagerank type link analysis tool for nutch which simulates a > sparse matrix using inlinks and outlinks and converges after a given number > of iterations. This tool is mean to replace the current scoring system in > nutch with a system that converges instead of exponentially increasing > scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.