[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619941#action_12619941 ]
Andrzej Bialecki commented on NUTCH-635: ----------------------------------------- A few comments to the latest patch: * some crucial javadoc is missing, such as the comments on class level (at least), especially if they are cmd-line utilities or classes that support a major functionality. * perhaps we don't need a separate Node db, this information can be added directly to the CrawlDb, which could save us the trouble with running the ScoreUpdater. * minor thing, but in many classes you use a repeating pattern of creating instances of List, HashSet, ObjWritable, etc, etc inside the map()/reduce() methods, while they should be created once and reused. * LinkDatum: ** linkType should be byte, not int - this saves 3 bytes on each entry. * LinkRank: ** I wonder if we couldn't skip the Counter job, and instead collect the total number of links via Hadoop job counters. I.e. define counters in Mapper/Reducer of the analysis job, and then after the job is done you can retrieve them from a RunningJob instance. We could then maintain this value on each update of the db in a well-known location, as you do this already, except we could skip this additional runCounter(..) job ... * Loops: ** Loops.Route.readFields(): I think it's better to use Text.readString() instead of DataInput.readUTF(). Or for that matter, replace the plain Strings with Text, since many times in other places in Loops you need to create a Text object anyway, out of one of Route's fields. * LinkUpdater: ** I don't understand why clearScore is set to 0.00001f. What's with the magic number? * ReprUrlFixer should go into tools.compat * ResolveUrls uses ReprUrlFixer log, it should use its own. Besides, this tool is not relevant to this patch, so I think it should be submitted separately. * the new indexing framework: I like the added flexibility, but the cost for that seems high. Previously we only had to run a single map-red job to create an index, now we have to run at least 6 jobs, each with a large dataset. I vote for splitting the patch and creating a separate issue for this framework, so that we can discuss it further. > LinkAnalysis Tool for Nutch > --------------------------- > > Key: NUTCH-635 > URL: https://issues.apache.org/jira/browse/NUTCH-635 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.0.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, > NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, > NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch > > > This is a basic pagerank type link analysis tool for nutch which simulates a > sparse matrix using inlinks and outlinks and converges after a given number > of iterations. This tool is mean to replace the current scoring system in > nutch with a system that converges instead of exponentially increasing > scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.