So, I have been having huge problems with parsing. It seems that many urls are being ignored because the parser plugins throw and exception saying there is no parser found for, what is reportedly, and unresolved contentType. So, if you look at the exception:
org.apache.nutch.parse.ParseException: parser not found for contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl You can see that it says the contentType is "". But, if you look at the headers for this request you can see that the Content-Type header is set at "text/html": HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 13:54:19 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Cache-Control: no-store X-Highwire-SessionId: y1851mbb91.JS1 Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html Is there something that I have set up wrong? This happens on a LOT of pages/sites. My current plugins are set at: "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) Here is another URL: http://www.bionews.org.uk/ Same issue with parsing (parrser not found for contentType= url=http://www.bionews.org.uk/), but the header says: HTTP/1.0 200 OK Server: Lasso/3.6.5 ID/ACGI MIME-Version: 1.0 Content-type: text/html Content-length: 69417 Any clues? Does nutch look at the headers or not? -- "Conscious decisions by conscious minds are what make reality real" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
