Judging from the segment those url's are fetched and parsed. I think maybe some HTML parse API's have changed between your 1.1 and 1.2 versions. If parserchecker shows the same issue then it's most likey a parse plugin problem for the new version. Can you check?
> Hi, > > If you have a look at your regex-ulrfilter.txt it will by default be > rejecting ? in the URL. Please test with line edited (or commented out) and > see if the problem fades. > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <[email protected]> wrote: > > Hi Markus! > > > > We are using a custom parser, but I don't think that the problem is in > > the parsing. I got the same problem when trying the ParserChecker. I > > also tried the following: > > > > I injected the following seeds: > > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm > > http://www.uu.se/news/news_item.php?id=1421&typ=pm > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel > > http://www.uu.se/news/news_item.php?id=1407&typ=pm > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > http://www.uu.se/ > > > > Then generated a segment, fetched that segment and then did a readseg > > with -noparse, -noparsedata and -noparsetext. > > > > I have attached the readseg dump and it shows no content for: > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > > > Can the problem somehow be in the configurations for the fetcher? > > > > > > Best regards, > > --Anders Rask > > www.findwise.com > > > > > > 2011/7/15 Markus Jelsma <[email protected]> > > > >> What parser are you using? What does bin/nutch > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content > >> fine with parse-tika enabled. > >> > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote: > >> > Hi! > >> > > >> > We are using Nutch to crawl a bunch of websites and index them to > >> > Solr. > >> > >> At > >> > >> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch > >> > >> 1.3 > >> > >> > and in the same time going from one server to two servers. > >> > > >> > Unfortunately we are stuck with a problem which we haven't seen in the > >> > >> old > >> > >> > environment. Several of the pages that we are fetching contain no > >> > >> content > >> > >> > when they are stored in the segment. The following is an excerpt from > >> > "readseg" on a segment containing such a page: > >> > > >> > ---- > >> > > >> > Recno:: 5 > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > >> > Content:: > >> > Version: -1 > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > contentType: text/html > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 > >> > Connection=close Content-Type=text/html Server=Apache > >> > Content: > >> > > >> > ---- > >> > > >> > The fetch logs say nothing unusual about retrieving this page: > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381 > >> > > >> > There seems to be nothing strange about the page itself and a very > >> > >> similar > >> > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled > >> > >> and > >> > >> > indexed without any problems. > >> > > >> > Anyone have any ideas about what might be wrong here? > >> > > >> > > >> > Best regards, > >> > --Anders Rask > >> > www.findwise.com > >> > >> -- > >> Markus Jelsma - CTO - Openindex > >> http://www.linkedin.com/in/markus17 > >> 050-8536620 / 06-50258350

