As pointed out by Markus the logs show that the content has been properly fetched. Moreover
> ./nutch org.apache.nutch.parse.ParserChecker ' > http://www.uu.se/news/news_item.php?typ=pm&id=1381' works fine. Double check your custom parser, it is likely to be the source of the problem. BTW : what does your custom parser do? Is it a HtmlParseFilter? If so which parser are you using for HTML - parse-html or parse-tika? Julien On 18 July 2011 10:46, Markus Jelsma <[email protected]> wrote: > Judging from the segment those url's are fetched and parsed. I think maybe > some HTML parse API's have changed between your 1.1 and 1.2 versions. If > parserchecker shows the same issue then it's most likey a parse plugin > problem > for the new version. Can you check? > > > Hi, > > > > If you have a look at your regex-ulrfilter.txt it will by default be > > rejecting ? in the URL. Please test with line edited (or commented out) > and > > see if the problem fades. > > > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <[email protected]> wrote: > > > Hi Markus! > > > > > > We are using a custom parser, but I don't think that the problem is in > > > the parsing. I got the same problem when trying the ParserChecker. I > > > also tried the following: > > > > > > I injected the following seeds: > > > > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > > http://www.uu.se/ > > > > > > Then generated a segment, fetched that segment and then did a readseg > > > with -noparse, -noparsedata and -noparsetext. > > > > > > I have attached the readseg dump and it shows no content for: > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > > > > > Can the problem somehow be in the configurations for the fetcher? > > > > > > > > > Best regards, > > > --Anders Rask > > > www.findwise.com > > > > > > > > > 2011/7/15 Markus Jelsma <[email protected]> > > > > > >> What parser are you using? What does bin/nutch > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content > > >> fine with parse-tika enabled. > > >> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote: > > >> > Hi! > > >> > > > >> > We are using Nutch to crawl a bunch of websites and index them to > > >> > Solr. > > >> > > >> At > > >> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to > Nutch > > >> > > >> 1.3 > > >> > > >> > and in the same time going from one server to two servers. > > >> > > > >> > Unfortunately we are stuck with a problem which we haven't seen in > the > > >> > > >> old > > >> > > >> > environment. Several of the pages that we are fetching contain no > > >> > > >> content > > >> > > >> > when they are stored in the segment. The following is an excerpt > from > > >> > "readseg" on a segment containing such a page: > > >> > > > >> > ---- > > >> > > > >> > Recno:: 5 > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > >> > > > >> > Content:: > > >> > Version: -1 > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > >> > contentType: text/html > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 > > >> > Connection=close Content-Type=text/html Server=Apache > > >> > Content: > > >> > > > >> > ---- > > >> > > > >> > The fetch logs say nothing unusual about retrieving this page: > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381 > > >> > > > >> > There seems to be nothing strange about the page itself and a very > > >> > > >> similar > > >> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is > crawled > > >> > > >> and > > >> > > >> > indexed without any problems. > > >> > > > >> > Anyone have any ideas about what might be wrong here? > > >> > > > >> > > > >> > Best regards, > > >> > --Anders Rask > > >> > www.findwise.com > > >> > > >> -- > > >> Markus Jelsma - CTO - Openindex > > >> http://www.linkedin.com/in/markus17 > > >> 050-8536620 / 06-50258350 > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

