Hi, On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > So, I have been having huge problems with parsing. It seems that many > urls are being ignored because the parser plugins throw and exception > saying there is no parser found for, what is reportedly, and > unresolved contentType. So, if you look at the exception: > > org.apache.nutch.parse.ParseException: parser not found for > contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > You can see that it says the contentType is "". But, if you look at > the headers for this request you can see that the Content-Type header > is set at "text/html": > > HTTP/1.1 200 OK > Date: Fri, 01 Jun 2007 13:54:19 GMT > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > Cache-Control: no-store > X-Highwire-SessionId: y1851mbb91.JS1 > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > Transfer-Encoding: chunked > Content-Type: text/html > > Is there something that I have set up wrong? This happens on a LOT of > pages/sites. My current plugins are set at: > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > Here is another URL: > > http://www.bionews.org.uk/ > > > Same issue with parsing (parrser not found for contentType= > url=http://www.bionews.org.uk/), but the header says: > > HTTP/1.0 200 OK > Server: Lasso/3.6.5 ID/ACGI > MIME-Version: 1.0 > Content-type: text/html > Content-length: 69417 > > > Any clues? Does nutch look at the headers or not?
Can you do a bin/nutch readseg -get <segment> <url> -noparse -noparsetext -noparsedata -nofetch -nogenerate And send the result? This should show use what nutch fetched as content. > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
