Thanks for your response Julien. Is there a way I can bypass the robots check
in the normal crawl?
Thanks,
Vijay
On Jul 9, 2014, at 11:46 AM, Julien Nioche
wrote:
> The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18),
> lastModified=0
>
> The server you are hitting preven
The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18),
lastModified=0
The server you are hitting prevents robots, see
http://79657.70194.14886.graphicspotting.com/robots.txt
The parsechecker does not check for robots.txt whereas the normal crawl
operations do.
Julien
On 9 J
Hi,
I am using Nutch 1.7 and I tried to crawl this url:
http://79657.70194.14886.graphicspotting.com/
Created the seed url file with the url:
http://79657.70194.14886.graphicspotting.com/
Crawled the url using the crawl command: bin/nutch crawl url -depth 1
Ran a readseg to dump the segment:
H
3 matches
Mail list logo