Hi guys,
I just wanted to mention that recently worked on a similar functionality. It
is implemented as a custom Nutch indexing plugin and adds a field adult with
true|false as values so that the results can be filtered from the search
results. This plugin is based on our text classification libra
> Maybe you can reproduce the problem on your environment with URLs publicaly
> available.
>
> What is the mime type for the documents without titles?
Mime type is text/html.
We found the cause of our problem. There was a meta robots
element with content of noindex on the pages without titles.
Does anyone know what crawl output directories are required on a successful
crawl? Are crawldb, indexes, index, linkdb and segments all required to have a
successful merge?
I'm crawling on 5 servers and writing to the SAN. Everything goes fast and
fine (up to several million documents). My p
I am able to build nutch with ant in Eclipse now. And I can run bin/nutch
from Cygwin command line and see the changes. Thanks guys!
Yet, as I was trying to run/debug the code within Eclipse, my changes made
to the code were *ignored*.
What I did was putting "org.apache.nutch.segment.SegmentReade
Maybe you can reproduce the problem on your environment with URLs publicaly
available.
What is the mime type for the documents without titles?
Alexander
2008/10/21 John Mendenhall <[EMAIL PROTECTED]>
>
> > Can u post some of the urls for which parse text is missing.
>
> I am unable to post the