Hi weishenyun, See inline:
Markus -----Original message----- > From:weishenyun <wlx198...@yahoo.com.cn> > Sent: Wed 22-Aug-2012 11:02 > To: d...@nutch.apache.org > Subject: Two questions about Nutch > > Hi everyone here: > I have two questions which confused me for weeks. If anyone here can > help me, thanks so much! > The first one, I know that Nutch won't store the HTTP code at all. > Instead, it encodes it as a single status byte. If Nutch fetches a bad link > whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which > is robots denied or throttled by website because of frequently fetch. How > can we distinguish between these conditions from that status byte(e.g. > db_status_gone, db_redir_temp)? Only in the fetcher you can distinquish between status codes and non-HTTP status codes such as being denied by robots or a problem with the robots crawl delay. > Second, I know a little about Ranking & Scoring mechanism in Nutch. I > know linkrank algorithm is the main algorithm. The linkrank algorithm is > just a single score factor in the index system of Nutch, what is other > factors about index and search in Nutch? We also use the LinkRank to aggregate a score but a host and use that host score to select a master host when deduplicating hosts. The host among the duplicates with the highest score prevails and the others are removed. > The webgraph has not yet been > ported to the GORA-based API in Nutch 2.0. What is the result if we index > and search in Nutch 2.0? You would still have a decent or good search result if you configured your weights properly. Keep in mind that LinkRank is not meant for scoring of URL's within a domain or host but across domains so it's a more internet scale scoring algorithm. We don't use LinkRank for our site search services. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. >