Re: [Nutch-general] Content Type Not Resolved Correctly?

Briggs Fri, 01 Jun 2007 10:49:24 -0700

Doh!  Again, I missed that.  Thanks... Just wish it had a better
explanation.



On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:


Hi,

On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> Here is another example that keeps saying it can't parse it...
>
> SegmentReader: get '
> http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir'
> Content::
> Version: 2
> url:
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
> base:
> http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
> contentType:
> metadata: nutch.segment.name=20070601050840
nutch.crawl.score=3.5455807E-5
> Content:
>
> These are the headers:
>
> HTTP/1.1 200 OK
> Date: Fri, 01 Jun 2007 15:38:15 GMT
> Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> Window-Target: _top
> X-Highwire-SessionId: nh2ukcdpv1.JS1
> Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/
> Transfer-Encoding: chunked
> Content-Type: text/html
>
>
>
> So, that's it..... any ideas?

In both examples nutch wasn't able to fetch the page. When a url can't
be fetched, fetcher creates an empty content for it. That's why you
can't parse them, there is nothing to parse:).

You can't fetch http://hea.sagepub.com/cgi/alerts and
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
because both hosts have robots.txt files that disallow access to your
urls.

>
>
>
> On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> >
> >
> > So, here is one:
> >
> > http://hea.sagepub.com/cgi/alerts
> >
> > Segment Reader reports:
> >
> > Content::
> > Version: 2
> > url: http://hea.sagepub.com/cgi/alerts
> > base: http://hea.sagepub.com/cgi/alerts
> > contentType:
> > metadata: nutch.segment.name=20070601045920
nutch.crawl.score=0.041666668
> > Content:
> >
> > So, I notice when I try to crawl that url specifically, I get a job
failed
> > (array index out of bounds -1 exception).
> >
> > But if I use curl like:
> >
> > curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt
> >
> > I get content and the headers are:
> >
> > HTTP/1.1 200 OK
> > Date: Fri, 01 Jun 2007 15:03:28 GMT
> > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> > Cache-Control: no-store
> > X-Highwire-SessionId: xlz2cgcww1.JS1
> > Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
> > Transfer-Encoding: chunked
> > Content-Type: text/html
> >
> > So, I'm lost.
> >
> >
> > On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> > > > So, I have been having huge problems with parsing.  It seems that
many
> > > > urls are being ignored because the parser plugins throw and
exception
> > > > saying there is no parser found for, what is reportedly, and
> > > > unresolved contentType.  So, if you look at the exception:
> > > >
> > > >   org.apache.nutch.parse.ParseException: parser not found for
> > > > contentType= url=
> > > http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
> > > >
> > > > You can see that it says the contentType is "".  But, if you look
at
> > > > the headers for this request you can see that the Content-Type
header
> > > > is set at "text/html":
> > > >
> > > > HTTP/1.1 200 OK
> > > > Date: Fri, 01 Jun 2007 13:54:19 GMT
> > > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> > > > Cache-Control: no-store
> > > > X-Highwire-SessionId: y1851mbb91.JS1
> > > > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> > > > Transfer-Encoding: chunked
> > > > Content-Type: text/html
> > > >
> > > > Is there something that I have set up wrong?  This happens on a
LOT of
> > >
> > > > pages/sites.  My current plugins are set at:
> > > >
> > > >
> > >
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
> > >
> > > >
> > > >
> > > > Here is another URL:
> > > >
> > > > http://www.bionews.org.uk/
> > > >
> > > >
> > > > Same issue with parsing (parrser not found for contentType=
> > > > url= http://www.bionews.org.uk/), but the header says:
> > > >
> > > > HTTP/1.0 200 OK
> > > > Server: Lasso/3.6.5 ID/ACGI
> > > > MIME-Version: 1.0
> > > > Content-type: text/html
> > > > Content-length: 69417
> > > >
> > > >
> > > > Any clues?  Does nutch look at the headers or not?
> > >
> > > Can you do a
> > > bin/nutch readseg -get <segment> <url> -noparse -noparsetext
> > > -noparsedata -nofetch -nogenerate
> > >
> > > And send the result? This should show use what nutch fetched as
content.
> > >
> > >
> > > >
> > > >
> > > > --
> > > > "Conscious decisions by conscious minds are what make reality
real"
> > > >
> > >
> > >
> > > --
> > > Doğacan Güney
> > >
> >
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


--
Doğacan Güney




--
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Content Type Not Resolved Correctly?

Reply via email to