[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Andrzej Bialecki (JIRA) Tue, 05 Aug 2008 09:15:06 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619941#action_12619941
 ]


Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

A few comments to the latest patch:

* some crucial javadoc is missing, such as the comments on class level (at 
least), especially if they are cmd-line utilities or classes that support a 
major functionality.
* perhaps we don't need a separate Node db, this information can be added 
directly to the CrawlDb, which could save us the trouble with running the 
ScoreUpdater.
* minor thing, but in many classes you use a repeating pattern of creating 
instances of List, HashSet, ObjWritable, etc, etc inside the map()/reduce() 
methods, while they should be created once and reused.
* LinkDatum:
** linkType should be byte, not int - this saves 3 bytes on each entry.
* LinkRank:
** I wonder if we couldn't skip the Counter job, and instead collect the total 
number of links via Hadoop job counters. I.e. define counters in Mapper/Reducer 
of the analysis job, and then after the job is done you can retrieve them from 
a RunningJob instance. We could then maintain this value on each update of the 
db in a well-known location, as you do this already, except we could skip this 
additional runCounter(..) job ...
* Loops:
** Loops.Route.readFields(): I think it's better to use Text.readString() 
instead of DataInput.readUTF(). Or for that matter, replace the plain Strings 
with Text, since many times in other places in Loops you need to create a Text 
object anyway, out of one of Route's fields.
* LinkUpdater:
** I don't understand why clearScore is set to 0.00001f. What's with the magic 
number?
* ReprUrlFixer should go into tools.compat
* ResolveUrls uses ReprUrlFixer log, it should use its own. Besides, this tool 
is not relevant to this patch, so I think it should be submitted separately.
* the new indexing framework: I like the added flexibility, but the cost for 
that seems high. Previously we only had to run a single map-red job to create 
an index, now we have to run at least 6 jobs, each with a large dataset. I vote 
for splitting the patch and creating a separate issue for this framework, so 
that we can discuss it further.


> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
> NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
> NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a 
> sparse matrix using inlinks and outlinks and converges after a given number 
> of iterations.  This tool is mean to replace the current scoring system in 
> nutch with a system that converges instead of exponentially increasing 
> scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch

Reply via email to