Re: Nutch 1.7: No content fetched

2014-07-09 Thread Vijay Chakilam
Thanks for your response Julien. Is there a way I can bypass the robots check in the normal crawl? Thanks, Vijay On Jul 9, 2014, at 11:46 AM, Julien Nioche wrote: > The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), > lastModified=0 > > The server you are hitting preven

Re: Nutch 1.7: No content fetched

2014-07-09 Thread Julien Nioche
The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0 The server you are hitting prevents robots, see http://79657.70194.14886.graphicspotting.com/robots.txt The parsechecker does not check for robots.txt whereas the normal crawl operations do. Julien On 9 J

Nutch 1.7: No content fetched

2014-07-09 Thread Vijay Chakilam
Hi, I am using Nutch 1.7 and I tried to crawl this url: http://79657.70194.14886.graphicspotting.com/ Created the seed url file with the url: http://79657.70194.14886.graphicspotting.com/ Crawled the url using the crawl command: bin/nutch crawl url -depth 1 Ran a readseg to dump the segment: H