Hi! We are using Nutch to crawl a bunch of websites and index them to Solr. At the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3 and in the same time going from one server to two servers.
Unfortunately we are stuck with a problem which we haven't seen in the old environment. Several of the pages that we are fetching contain no content when they are stored in the segment. The following is an excerpt from "readseg" on a segment containing such a page: ---- Recno:: 5 URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 Content:: Version: -1 url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 contentType: text/html metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 Connection=close Content-Type=text/html Server=Apache Content: ---- The fetch logs say nothing unusual about retrieving this page: 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381 There seems to be nothing strange about the page itself and a very similar page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and indexed without any problems. Anyone have any ideas about what might be wrong here? Best regards, --Anders Rask www.findwise.com

