Hi,

On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> So, I have been having huge problems with parsing.  It seems that many
> urls are being ignored because the parser plugins throw and exception
> saying there is no parser found for, what is reportedly, and
> unresolved contentType.  So, if you look at the exception:
>
>   org.apache.nutch.parse.ParseException: parser not found for
> contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
>
> You can see that it says the contentType is "".  But, if you look at
> the headers for this request you can see that the Content-Type header
> is set at "text/html":
>
> HTTP/1.1 200 OK
> Date: Fri, 01 Jun 2007 13:54:19 GMT
> Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> Cache-Control: no-store
> X-Highwire-SessionId: y1851mbb91.JS1
> Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> Transfer-Encoding: chunked
> Content-Type: text/html
>
> Is there something that I have set up wrong?  This happens on a LOT of
> pages/sites.  My current plugins are set at:
>
> "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
>
> Here is another URL:
>
> http://www.bionews.org.uk/
>
>
> Same issue with parsing (parrser not found for contentType=
> url=http://www.bionews.org.uk/), but the header says:
>
> HTTP/1.0 200 OK
> Server: Lasso/3.6.5 ID/ACGI
> MIME-Version: 1.0
> Content-type: text/html
> Content-length: 69417
>
>
> Any clues?  Does nutch look at the headers or not?

Can you do a
bin/nutch readseg -get <segment> <url> -noparse -noparsetext
-noparsedata -nofetch -nogenerate

And send the result? This should show use what nutch fetched as content.

>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to