Hi, On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > Here is another example that keeps saying it can't parse it... > > SegmentReader: get ' > http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir' > Content:: > Version: 2 > url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir > base: > http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir > contentType: > metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5 > Content: > > These are the headers: > > HTTP/1.1 200 OK > Date: Fri, 01 Jun 2007 15:38:15 GMT > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > Window-Target: _top > X-Highwire-SessionId: nh2ukcdpv1.JS1 > Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/ > Transfer-Encoding: chunked > Content-Type: text/html > > > > So, that's it..... any ideas?
In both examples nutch wasn't able to fetch the page. When a url can't be fetched, fetcher creates an empty content for it. That's why you can't parse them, there is nothing to parse:). You can't fetch http://hea.sagepub.com/cgi/alerts and http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir because both hosts have robots.txt files that disallow access to your urls. > > > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > > > > > > So, here is one: > > > > http://hea.sagepub.com/cgi/alerts > > > > Segment Reader reports: > > > > Content:: > > Version: 2 > > url: http://hea.sagepub.com/cgi/alerts > > base: http://hea.sagepub.com/cgi/alerts > > contentType: > > metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.041666668 > > Content: > > > > So, I notice when I try to crawl that url specifically, I get a job failed > > (array index out of bounds -1 exception). > > > > But if I use curl like: > > > > curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt > > > > I get content and the headers are: > > > > HTTP/1.1 200 OK > > Date: Fri, 01 Jun 2007 15:03:28 GMT > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > > Cache-Control: no-store > > X-Highwire-SessionId: xlz2cgcww1.JS1 > > Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/ > > Transfer-Encoding: chunked > > Content-Type: text/html > > > > So, I'm lost. > > > > > > On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > > > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > > > > So, I have been having huge problems with parsing. It seems that many > > > > urls are being ignored because the parser plugins throw and exception > > > > saying there is no parser found for, what is reportedly, and > > > > unresolved contentType. So, if you look at the exception: > > > > > > > > org.apache.nutch.parse.ParseException: parser not found for > > > > contentType= url= > > > http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > > > > > > > You can see that it says the contentType is "". But, if you look at > > > > the headers for this request you can see that the Content-Type header > > > > is set at "text/html": > > > > > > > > HTTP/1.1 200 OK > > > > Date: Fri, 01 Jun 2007 13:54:19 GMT > > > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > > > > Cache-Control: no-store > > > > X-Highwire-SessionId: y1851mbb91.JS1 > > > > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > > > > Transfer-Encoding: chunked > > > > Content-Type: text/html > > > > > > > > Is there something that I have set up wrong? This happens on a LOT of > > > > > > > pages/sites. My current plugins are set at: > > > > > > > > > > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > > > > > > > > > Here is another URL: > > > > > > > > http://www.bionews.org.uk/ > > > > > > > > > > > > Same issue with parsing (parrser not found for contentType= > > > > url= http://www.bionews.org.uk/), but the header says: > > > > > > > > HTTP/1.0 200 OK > > > > Server: Lasso/3.6.5 ID/ACGI > > > > MIME-Version: 1.0 > > > > Content-type: text/html > > > > Content-length: 69417 > > > > > > > > > > > > Any clues? Does nutch look at the headers or not? > > > > > > Can you do a > > > bin/nutch readseg -get <segment> <url> -noparse -noparsetext > > > -noparsedata -nofetch -nogenerate > > > > > > And send the result? This should show use what nutch fetched as content. > > > > > > > > > > > > > > > > > > -- > > > > "Conscious decisions by conscious minds are what make reality real" > > > > > > > > > > > > > -- > > > Doğacan Güney > > > > > > > > > > > -- > > "Conscious decisions by conscious minds are what make reality real" > > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
