[ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605102#action_12605102
 ] 

Dennis Kubes commented on NUTCH-635:
------------------------------------

Andrzej Bialecki Wrote:

    *  in OutlinksDb.reduce() you use a simple assignment mostRecent = next. 
This doesn't work as expected, because Hadoop iterator reuses the same single 
instance of Outlinks under the hood, so if you keep a reference to it its value 
will mysteriously change under your feet as you call values.next(). This should 
be replaced with a deep copy (or clone) of the instance, either through a 
dedicated method of Outlinks or WritableUtils.copy().

Fixed this.  Thanks.  I knew it happened for writables but wasn't aware that it 
was implemented the same way in the iterators.

    * you should avoid spurious whitespace changes to existing classes, this 
makes the reading more difficult ... (e.g. Outlink.java)

That was a mistake, fixed it.

    * in Outlinks.write() I think there's a bug - you write out 
System.currentTimeMillis() instead of this.timestamp, is this intentional?

Nope, that was a bug from an earlier version of it.  Fixed.

    * in LinkAnalysis.Counter.map() , since you output static values, you 
should avoid creating new instances and use a pair of static instances.

    * by the way, in an implementation of similar algo I used Hadoop Counters 
to count the totals, this way you avoid storing magic numbers in the db itself 
(although you still need to preserve them somewhere, so I'd create an 
additional file with this value ... well, perhaps not so elegant either after 
all ).

This is really just a temp file.  I count the urls put it into a file using a 
single reduce task and then read it back in the update method of LinkAnalysis 
and pass it into the jobs through conf.  Once it is read I delete the file.

    * LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in 
configure(Job), otherwise you pay the price of getting floats from 
Configuration (which involves repeated creation of Float via 
Float.parseFloat()). Also, HashPartitioner should be created once. Well, this 
is a general comment to this patch - it creates a lot of objects unnecessarily. 
We can optimize it now or later, whatever you prefer.

I think a bit of both.  I fixed the HashPartitioner one.  My intention with 
this first version is to get a workable tool that converges the score and to 
provide workarounds for the common types of link spam such at reciprocal links 
and link farms / tightly knit communities.  Once it is working we can always 
optimize the speed later.  That being said the current version is faster than I 
thought it would be.  The current patch does converge and it handled reciprocal 
links and some cases of link farms but it is currently being overinflued by 
link loops of three or more sights.  Once I have that taken care of I will post 
a new path.


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a 
> sparse matrix using inlinks and outlinks and converges after a given number 
> of iterations.  This tool is mean to replace the current scoring system in 
> nutch with a system that converges instead of exponentially increasing 
> scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to