Re: nutch parsetext missing for some urls

Alexander Aristov Thu, 23 Oct 2008 02:14:57 -0700

Maybe you can reproduce the problem on your environment with URLs publicaly
available.


What  is the mime type for the documents without titles?

Alexander

2008/10/21 John Mendenhall <[EMAIL PROTECTED]>

>
> > Can u post some of the urls for which parse text is missing.
>
> I am unable to post the actual urls.  This is a private
> project for which exact urls cannot be shared.
>
> JohnM
>
>
>
>
> > On Tue, Oct 21, 2008 at 6:44 AM, John Mendenhall <[EMAIL PROTECTED]
> >wrote:
> >
> > > We are using nutch version nutch-2008-07-22_04-01-29.
> > > We have a crawldb with over 1 million urls.
> > >
> > > We have noticed some of the urls in search results
> > > do not have titles.  After some research comparing
> > > urls with titles and urls without titles, the urls
> > > without titles have empty parsetext.
> > >
> > > Why would some urls have empty parsetext?
> > > Is there some place I can look to determine why
> > > parsetext is missing?
> > >
> > > Is the only way to reparse those urls with empty
> > > parsetext to remove the crawl_parse directory for
> > > the corresponding segment and run the nutch parse
> > > command?
> > >
> > > Is there something I should do to guarantee all
> > > urls get a parsetext, and hopefully, a title?
> > >
> > > Thanks in advance for any assistance or pointers
> > > to other resources or ideas.
> > >
> > > JohnM
>
> --
> john mendenhall
> [EMAIL PROTECTED]
> surf utopia
> internet services
>



-- 
Best Regards
Alexander Aristov

Re: nutch parsetext missing for some urls

Reply via email to