The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0
The server you are hitting prevents robots, see http://79657.70194.14886.graphicspotting.com/robots.txt The parsechecker does not check for robots.txt whereas the normal crawl operations do. Julien On 9 July 2014 16:34, Vijay Chakilam <vchaki...@adjuggler.com> wrote: > Hi, > > I am using Nutch 1.7 and I tried to crawl this url: > http://79657.70194.14886.graphicspotting.com/ > > Created the seed url file with the url: > http://79657.70194.14886.graphicspotting.com/ > Crawled the url using the crawl command: bin/nutch crawl url -depth 1 > Ran a readseg to dump the segment: > > Here’s the dump: > > Recno:: 0 > URL:: http://79657.70194.14886.graphicspotting.com/ > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Jul 09 11:15:39 EDT 2014 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1404918941993 > > CrawlDatum:: > Version: 7 > Status: 37 (fetch_gone) > Fetch time: Wed Jul 09 11:15:46 EDT 2014 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0 > > I don’t see any content, no parse data or text. > > I tried to use parsechecker and here’s the output: > > vijay$ bin/nutch parsechecker -dumpText > http://79657.70194.14886.graphicspotting.com/ > fetching: http://79657.70194.14886.graphicspotting.com/ > parsing: http://79657.70194.14886.graphicspotting.com/ > contentType: text/html > signature: 9f695936ef3bf29b0d1556df1aec7da8 > --------- > Url > --------------- > > http://79657.70194.14886.graphicspotting.com/ > --------- > ParseData > --------- > > Version: 5 > Status: success(1,0) > Title: 79657.70194.14886.graphicspotting > Outlinks: 3 > outlink: toUrl: > http://79657.70194.14886.graphicspotting.com/../css/style.css anchor: > outlink: toUrl: http://79657.70194.14886.graphicspotting.com/index.php > anchor: 79657.70194.14886.graphicspotting > outlink: toUrl: > http://79657.70194.14886.graphicspotting.com/images/image3017.png anchor: > Content Metadata: Date=Wed, 09 Jul 2014 15:28:26 GMT Connection=close > Content-Type=text/html X-Powered-By=PHP/5.3.3 Server=nginx/1.0.15 > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > --------- > ParseText > --------- > > 79657.70194.14886.graphicspotting Welcome to > 79657.70194.14886.graphicspotting ©2014 79657.70194.14886.graphicspotting. > All rights reserved > > Not sure why I am not able to get any parse data or parse text in readseg, > whereas parsechecker is able to extract data and text. > > Thanks, > Vijay -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble