RE: Two questions about Nutch

Markus Jelsma Wed, 22 Aug 2012 02:19:50 -0700

Hi weishenyun,

See inline:


Markus
 
 
-----Original message-----
> From:weishenyun <wlx198...@yahoo.com.cn>
> Sent: Wed 22-Aug-2012 11:02
> To: d...@nutch.apache.org
> Subject: Two questions about Nutch
> 
> Hi everyone here:
>       I have two questions which confused me for weeks. If anyone here can
> help me, thanks so much!
>       The first one, I know that Nutch won't store the HTTP code at all.
> Instead, it encodes it as a single status byte. If Nutch fetches a bad link
> whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
> is robots denied or throttled by website because of frequently fetch. How
> can we distinguish between these conditions from that status byte(e.g.
> db_status_gone, db_redir_temp)?

Only in the fetcher you can distinquish between status codes and non-HTTP 
status codes such as being denied by robots or a problem with the robots crawl 
delay.

>       Second, I know a little about Ranking & Scoring mechanism in Nutch. I
> know linkrank algorithm is the main algorithm. The linkrank algorithm is
> just a single score factor in the index system of Nutch, what is other
> factors about index and search in Nutch?

We also use the LinkRank to aggregate a score but a host and use that host 
score to select a master host when deduplicating hosts. The host among the 
duplicates with the highest score prevails and the others are removed.

> The webgraph has not yet been
> ported to the GORA-based API in Nutch 2.0. What is the result if we index
> and search in Nutch 2.0?

You would still have a decent or good search result if you configured your 
weights properly. Keep in mind that LinkRank is not meant for scoring of URL's 
within a domain or host but across domains so it's a more internet scale 
scoring algorithm.

We don't use LinkRank for our site search services.

> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>

RE: Two questions about Nutch

Reply via email to