So, I have been having huge problems with parsing.  It seems that many
urls are being ignored because the parser plugins throw and exception
saying there is no parser found for, what is reportedly, and
unresolved contentType.  So, if you look at the exception:

  org.apache.nutch.parse.ParseException: parser not found for
contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl

You can see that it says the contentType is "".  But, if you look at
the headers for this request you can see that the Content-Type header
is set at "text/html":

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 13:54:19 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: y1851mbb91.JS1
Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

Is there something that I have set up wrong?  This happens on a LOT of
pages/sites.  My current plugins are set at:

"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


Here is another URL:

http://www.bionews.org.uk/


Same issue with parsing (parrser not found for contentType=
url=http://www.bionews.org.uk/), but the header says:

HTTP/1.0 200 OK
Server: Lasso/3.6.5 ID/ACGI
MIME-Version: 1.0
Content-type: text/html
Content-length: 69417


Any clues?  Does nutch look at the headers or not?


-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to