What parser are you using? What does bin/nutch org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine with parse-tika enabled.
On Friday 15 July 2011 15:04:55 Anders Rask wrote: > Hi! > > We are using Nutch to crawl a bunch of websites and index them to Solr. At > the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3 > and in the same time going from one server to two servers. > > Unfortunately we are stuck with a problem which we haven't seen in the old > environment. Several of the pages that we are fetching contain no content > when they are stored in the segment. The following is an excerpt from > "readseg" on a segment containing such a page: > > ---- > > Recno:: 5 > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > Content:: > Version: -1 > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > contentType: text/html > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 > Connection=close Content-Type=text/html Server=Apache > Content: > > ---- > > The fetch logs say nothing unusual about retrieving this page: > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > There seems to be nothing strange about the page itself and a very similar > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and > indexed without any problems. > > Anyone have any ideas about what might be wrong here? > > > Best regards, > --Anders Rask > www.findwise.com -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

