I poked around in the code for a while, and added a little of my own logging
that wound up leading me here:
https://issues.apache.org/jira/browse/NUTCH-443

While I could apply a patch, I'm reluctant to get too far from
the "standard" version of Nutch.  Can anybody comment on when the next
release can be expected?

Thanks,
C


On 7/24/07, charlie w <[EMAIL PROTECTED]> wrote:


I'm seeing a problem where pages are fetched, but are not indexed.  I've
pared the crawl down to a very small example using the plain Nutch crawl
tool.  It fails consistently with the same url (among others):
http://new.marketwire.com/2.0/rel.jsp?id=710360.  The url redirects, so a
-depth option is required for the nutch command, and I have modified the
crawl-urlfilter.txt file to allow fetching this file.  It is definitely
fetched, as is the page to which Nutch is redirected.

I've used "nutch readseg -dump ...", and it sure looks like the proper
document was fetched to me.  The content is there, and the parsed content is
there and so on.  The crawl datum looks OK too (to my naive eye).

What is going on here?  Is there any further debugging I can turn on to
try to track this down?

Thanks,
C

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to