HTTP Header problem

Kirk Gillock Sat, 05 Dec 2009 06:29:58 -0800

Hi fellow Nutch users.

Long time crawler, first time poster. :-)

We're 23m pages into a 100m page crawl and our preliminary tests haveshown that a lot of pages contain our agent name, description, etc., intheir page content. Meaning, sites that have a script which show httpheaders (typically to show browser information) causes the Nutch crawlerto store its own header information within the content of that page. Sowhen we search our index for "Isara" (our agent name) we get thousandsof results and they all have "Isara/Isara-1.0 (A non-profit searchengine benefiting charity.; http://www.isara.org; [email protected]",which is the content of our nutch-default.xml file: http.agent.name,http.agent.description, http.agent.url, http.agent.email, andhttp.agent.version .

I've searched around and haven't found any information on how to stopthis from happening. Is there a solution and, if so, will it mean weneed to recrawl all those pages again or can we filter the currentdatabase? Any suggestions would be greatly appreciated.


Thank you for developing such an important open-source application,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org

HTTP Header problem

Reply via email to