Hi!

We are using Nutch to crawl a bunch of websites and index them to Solr. At
the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3
and in the same time going from one server to two servers.

Unfortunately we are stuck with a problem which we haven't seen in the old
environment. Several of the pages that we are fetching contain no content
when they are stored in the segment. The following is an excerpt from
"readseg" on a segment containing such a page:

----

Recno:: 5
URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381

Content::
Version: -1
url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
contentType: text/html
metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
Connection=close Content-Type=text/html Server=Apache
Content:

----

The fetch logs say nothing unusual about retrieving this page:
2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.uu.se/news/news_item.php?typ=pm&id=1381

There seems to be nothing strange about the page itself and a very similar
page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and
indexed without any problems.

Anyone have any ideas about what might be wrong here?


Best regards,
--Anders Rask
www.findwise.com

Reply via email to