The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18),
lastModified=0

The server you are hitting prevents robots, see
http://79657.70194.14886.graphicspotting.com/robots.txt

The parsechecker does not check for robots.txt whereas the normal crawl
operations do.

Julien




On 9 July 2014 16:34, Vijay Chakilam <vchaki...@adjuggler.com> wrote:

> Hi,
>
> I am using Nutch 1.7 and I tried to crawl this url:
> http://79657.70194.14886.graphicspotting.com/
>
> Created the seed url file with the url:
> http://79657.70194.14886.graphicspotting.com/
> Crawled the url using the crawl command: bin/nutch crawl url -depth 1
> Ran a readseg to dump the segment:
>
> Here’s the dump:
>
> Recno:: 0
> URL:: http://79657.70194.14886.graphicspotting.com/
>
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Jul 09 11:15:39 EDT 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1404918941993
>
> CrawlDatum::
> Version: 7
> Status: 37 (fetch_gone)
> Fetch time: Wed Jul 09 11:15:46 EDT 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0
>
> I don’t see any content, no parse data or text.
>
> I tried to use parsechecker and here’s the output:
>
> vijay$ bin/nutch parsechecker -dumpText
> http://79657.70194.14886.graphicspotting.com/
> fetching: http://79657.70194.14886.graphicspotting.com/
> parsing: http://79657.70194.14886.graphicspotting.com/
> contentType: text/html
> signature: 9f695936ef3bf29b0d1556df1aec7da8
> ---------
> Url
> ---------------
>
> http://79657.70194.14886.graphicspotting.com/
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title: 79657.70194.14886.graphicspotting
> Outlinks: 3
>   outlink: toUrl:
> http://79657.70194.14886.graphicspotting.com/../css/style.css anchor:
>   outlink: toUrl: http://79657.70194.14886.graphicspotting.com/index.php
> anchor: 79657.70194.14886.graphicspotting
>   outlink: toUrl:
> http://79657.70194.14886.graphicspotting.com/images/image3017.png anchor:
> Content Metadata: Date=Wed, 09 Jul 2014 15:28:26 GMT Connection=close
> Content-Type=text/html X-Powered-By=PHP/5.3.3 Server=nginx/1.0.15
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> ---------
> ParseText
> ---------
>
> 79657.70194.14886.graphicspotting Welcome to
> 79657.70194.14886.graphicspotting ©2014 79657.70194.14886.graphicspotting.
> All rights reserved
>
> Not sure why I am not able to get any parse data or parse text in readseg,
> whereas parsechecker is able to extract data and text.
>
> Thanks,
> Vijay




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to