Re: Fetched pages has no content

Julien Nioche Mon, 18 Jul 2011 02:53:46 -0700

As pointed out by Markus the logs show that the content has been properly
fetched. Moreover



> ./nutch org.apache.nutch.parse.ParserChecker '
> http://www.uu.se/news/news_item.php?typ=pm&id=1381'


works fine. Double check your custom parser, it is likely to be the source
of the problem.

BTW : what does your custom parser do? Is it a HtmlParseFilter? If so which
parser are you using for HTML - parse-html or parse-tika?

Julien



On 18 July 2011 10:46, Markus Jelsma <[email protected]> wrote:

> Judging from the segment those url's are fetched and parsed. I think maybe
> some HTML parse API's have changed between your 1.1 and 1.2 versions. If
> parserchecker shows the same issue then it's most likey a parse plugin
> problem
> for the new version. Can you check?
>
> > Hi,
> >
> > If you have a look at your regex-ulrfilter.txt it will by default be
> > rejecting ? in the URL. Please test with line edited (or commented out)
> and
> > see if the problem fades.
> >
> > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <[email protected]> wrote:
> > > Hi Markus!
> > >
> > > We are using a custom parser, but I don't think that the problem is in
> > > the parsing. I got the same problem when trying the ParserChecker. I
> > > also tried the following:
> > >
> > > I injected the following seeds:
> > >
> > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > http://www.uu.se/
> > >
> > > Then generated a segment, fetched that segment and then did a readseg
> > > with -noparse, -noparsedata and -noparsetext.
> > >
> > > I have attached the readseg dump and it shows no content for:
> > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >
> > > Can the problem somehow be in the configurations for the fetcher?
> > >
> > >
> > > Best regards,
> > > --Anders Rask
> > > www.findwise.com
> > >
> > >
> > > 2011/7/15 Markus Jelsma <[email protected]>
> > >
> > >> What parser are you using? What does bin/nutch
> > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content
> > >> fine with parse-tika enabled.
> > >>
> > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> > >> > Hi!
> > >> >
> > >> > We are using Nutch to crawl a bunch of websites and index them to
> > >> > Solr.
> > >>
> > >> At
> > >>
> > >> > the moment we are in the process of upgrading from Nutch 1.1 to
> Nutch
> > >>
> > >> 1.3
> > >>
> > >> > and in the same time going from one server to two servers.
> > >> >
> > >> > Unfortunately we are stuck with a problem which we haven't seen in
> the
> > >>
> > >> old
> > >>
> > >> > environment. Several of the pages that we are fetching contain no
> > >>
> > >> content
> > >>
> > >> > when they are stored in the segment. The following is an excerpt
> from
> > >> > "readseg" on a segment containing such a page:
> > >> >
> > >> > ----
> > >> >
> > >> > Recno:: 5
> > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> >
> > >> > Content::
> > >> > Version: -1
> > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> > contentType: text/html
> > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> > >> > Connection=close Content-Type=text/html Server=Apache
> > >> > Content:
> > >> >
> > >> > ----
> > >> >
> > >> > The fetch logs say nothing unusual about retrieving this page:
> > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> >
> > >> > There seems to be nothing strange about the page itself and a very
> > >>
> > >> similar
> > >>
> > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
> crawled
> > >>
> > >> and
> > >>
> > >> > indexed without any problems.
> > >> >
> > >> > Anyone have any ideas about what might be wrong here?
> > >> >
> > >> >
> > >> > Best regards,
> > >> > --Anders Rask
> > >> > www.findwise.com
> > >>
> > >> --
> > >> Markus Jelsma - CTO - Openindex
> > >> http://www.linkedin.com/in/markus17
> > >> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Fetched pages has no content

Reply via email to