Re: Extensive web crawl

2008-10-23 Thread Julien Nioche
Hi guys, I just wanted to mention that recently worked on a similar functionality. It is implemented as a custom Nutch indexing plugin and adds a field adult with true|false as values so that the results can be filtered from the search results. This plugin is based on our text classification libra

Re: nutch parsetext missing for some urls

2008-10-23 Thread John Mendenhall
> Maybe you can reproduce the problem on your environment with URLs publicaly > available. > > What is the mime type for the documents without titles? Mime type is text/html. We found the cause of our problem. There was a meta robots element with content of noindex on the pages without titles.

Crawl and Merge questions

2008-10-23 Thread Alex Basa
Does anyone know what crawl output directories are required on a successful crawl? Are crawldb, indexes, index, linkdb and segments all required to have a successful merge? I'm crawling on 5 servers and writing to the SAN. Everything goes fast and fine (up to several million documents). My p

Fwd: Newbie question: How do I build nutch with eclipse?

2008-10-23 Thread [EMAIL PROTECTED]
I am able to build nutch with ant in Eclipse now. And I can run bin/nutch from Cygwin command line and see the changes. Thanks guys! Yet, as I was trying to run/debug the code within Eclipse, my changes made to the code were *ignored*. What I did was putting "org.apache.nutch.segment.SegmentReade

Re: nutch parsetext missing for some urls

2008-10-23 Thread Alexander Aristov
Maybe you can reproduce the problem on your environment with URLs publicaly available. What is the mime type for the documents without titles? Alexander 2008/10/21 John Mendenhall <[EMAIL PROTECTED]> > > > Can u post some of the urls for which parse text is missing. > > I am unable to post the