Hello Joseph - interesting questions! Answers are inline.

Regards,
Markus
 
 
-----Original message-----
> From:Joseph Naegele <jnaeg...@grierforensics.com>
> Sent: Monday 22nd February 2016 16:10
> To: user@nutch.apache.org
> Subject: ScoringFilters and LinkRank interoperability
> 
> Hi everyone,
> 
>  
> 
> I have a couple questions about Nutch's LinkRank tools. The wiki docs for
> using the WebGraph/LinkRank tools appear to be useful but I have the
> following questions:
> 
>  
> 
> 1.       The docs say, like PageRank, all links start with a common score.
> Does this mean LinkRank is not affected by the results of ScoringFilters?

That depends on which method of the ScoringFilter interface you implement. If 
you override a method and modify a CrawlDatum's score, it will be affected. 
Don't do it if you want to use LinkRank.

> 
> 2.       Can I, or should I, use ScoringFilters in addition to LinkRank?
> Essentially, what happens if I do?

No, you shouldn't implement methods that affect score. ScoringFilter interface 
can be used for many things but do not overwrite score if you use LinkRank. 
LinkRank has a job that updates the scores back to the CrawlDB, that is what 
you need for getting LinkRank scores in the CrawlDB.

> 
> 3.       Can LinkRank operate only on indexed resource links and *not* other
> links (things that aren't indexed, e.g. only HTML pages but not crawled
> images)

WebGraph/LinkRank operates on CrawlDB entries and outlinks only. This means 
that your URLFilters dictate what enters the WebGraph. If your URLFilters 
exclude non-HTML items, they won't enter the WebGraph and are not scored.

> 
>  
> 
> My goal is to score non-indexed resources (e.g. binary file types) as a
> function of indexed resource scores in order to guide the crawl, where
> indexed resources are scored via LinkRank.

This is still possible, but you have to include them into you CrawlDB and 
parsed outlinks. If you do not want to crawl them, you must use different 
URLFilter settings between the generate and fetch jobs, e.g.: allow non-HTML 
suffixes at fetch stage, but prevent them from being crawled by not allowing 
them at generate stage. 
You will then still crawl non-HTML items that hide their MIME-type, e.g.: 
example.org/page/download/some_item.php or other crazy crap. There is a patch 
that prevents crawling non-HTML items, but you must crawl them at least once to 
detect the actual MIME-type.

> 
>  
> 
> Thanks,
> 
> Joe
> 
> 

Reply via email to