[ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605102#action_12605102 ]
Dennis Kubes commented on NUTCH-635: ------------------------------------ Andrzej Bialecki Wrote: * in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This doesn't work as expected, because Hadoop iterator reuses the same single instance of Outlinks under the hood, so if you keep a reference to it its value will mysteriously change under your feet as you call values.next(). This should be replaced with a deep copy (or clone) of the instance, either through a dedicated method of Outlinks or WritableUtils.copy(). Fixed this. Thanks. I knew it happened for writables but wasn't aware that it was implemented the same way in the iterators. * you should avoid spurious whitespace changes to existing classes, this makes the reading more difficult ... (e.g. Outlink.java) That was a mistake, fixed it. * in Outlinks.write() I think there's a bug - you write out System.currentTimeMillis() instead of this.timestamp, is this intentional? Nope, that was a bug from an earlier version of it. Fixed. * in LinkAnalysis.Counter.map() , since you output static values, you should avoid creating new instances and use a pair of static instances. * by the way, in an implementation of similar algo I used Hadoop Counters to count the totals, this way you avoid storing magic numbers in the db itself (although you still need to preserve them somewhere, so I'd create an additional file with this value ... well, perhaps not so elegant either after all ). This is really just a temp file. I count the urls put it into a file using a single reduce task and then read it back in the update method of LinkAnalysis and pass it into the jobs through conf. Once it is read I delete the file. * LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in configure(Job), otherwise you pay the price of getting floats from Configuration (which involves repeated creation of Float via Float.parseFloat()). Also, HashPartitioner should be created once. Well, this is a general comment to this patch - it creates a lot of objects unnecessarily. We can optimize it now or later, whatever you prefer. I think a bit of both. I fixed the HashPartitioner one. My intention with this first version is to get a workable tool that converges the score and to provide workarounds for the common types of link spam such at reciprocal links and link farms / tightly knit communities. Once it is working we can always optimize the speed later. That being said the current version is faster than I thought it would be. The current patch does converge and it handled reciprocal links and some cases of link farms but it is currently being overinflued by link loops of three or more sights. Once I have that taken care of I will post a new path. > LinkAnalysis Tool for Nutch > --------------------------- > > Key: NUTCH-635 > URL: https://issues.apache.org/jira/browse/NUTCH-635 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.0.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch > > > This is a basic pagerank type link analysis tool for nutch which simulates a > sparse matrix using inlinks and outlinks and converges after a given number > of iterations. This tool is mean to replace the current scoring system in > nutch with a system that converges instead of exponentially increasing > scores. Also includes a tool to create an outlinkdb. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.