Re: Unable to fetch content

Vijay Chakilam Thu, 17 Jul 2014 15:32:32 -0700

Thanks for your answers Julien. I tried to use the crawl script, but I am 
having the same problem. I have set redirect.max to 5 and number of rounds 1 (I 
have also tried 2 rounds, but I guess that doesn’t help since I have already 
specified the redirect.max to be 5. So it should follow any redirects even with 
1 round, right?) Here’s the new readseg dump:


Recno:: 0
URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Jul 17 18:20:54 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
        _ngt_=1405635658639

Content::
Version: -1
url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
contentType: text/plain
metadata: Date=Thu, 17 Jul 2014 22:21:08 GMT Vary=User-Agent 
nutch.crawl.score=1.0 
Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
 _fst_=35 nutch.segment.name=20140717182101 Connection=close 
Content-Type=text/plain Server=III 150 MIME-version=1.0 
Content:

CrawlDatum::
Version: 7
Status: 35 (fetch_redir_temp)
Fetch time: Thu Jul 17 18:21:08 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
        _ngt_=1405635658639
        Content-Type=text/plain
        _pst_=temp_moved(13), lastModified=0: 
https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
        _rs_=468

Thanks,
Vijay

On Jul 17, 2014, at 5:13 PM, Julien Nioche <lists.digitalpeb...@gmail.com> 
wrote:

> Hi
> 
> 
> On 17 July 2014 22:04, Vijay Chakilam <vchaki...@adjuggler.com> wrote:
> 
>> Thanks for your reply Julien. I am not doing any indexing and I don’t have
>> a solr url. Looks like crawl script requires me to specify a solr url. How
>> do I run crawl script without specifying a solar url.
> 
> 
> Just comment out the commands related to SOLR in the script and pass it a
> dummy parameter for the SOLR url
> 
> Also, I want to crawl just the webpage I specify: a depth of 1.
> 
> I don’t want to fetch any outlinks.
> 
> 
> That can be done by setting db.update.additions.allowed to false in
> nutch-site.xml
> No new URLs will be added to the crawldb
> 
> 
>> How does number of rounds relate to depth? Are they same?
> 
> 
> No. They will be the same if there were no redirections and if you were
> putting all unfetched URLs in the segments. If there are more unfetched
> URLs in the crawldb then you are putting in the segments then you'll
> definitely need several iterations.
> 
> 
>> If so, what value should I specify for number of rounds to fetch just the
>> page I specify and also take care of the redirects. Are http.redirect.max
>> and number of rounds related?
>> 
> 
> Set http.redirect.max to a value > 0 so that they redirection gets tried
> within the same fetch step (i.e same round).
> 
> HTH
> 
> Julien
> 
> 
> 
>> 
>> Thanks,
>> Vijay
>> 
>> On Jul 17, 2014, at 4:42 PM, Julien Nioche <lists.digitalpeb...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> The crawl command is deprecated, use the crawl script instead and give
>> it a
>>> number of rounds > 1 so that it has a chance to fetch the redirection
>>> 
>>> J.
>>> 
>>> 
>>> On 17 July 2014 21:10, Vijay Chakilam <vchaki...@adjuggler.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am trying to crawl the page at: "
>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/";
>>>> 
>>>> Here’s the parse checker output.
>>>> 
>>>> runtime/local/bin/nutch parsechecker -dumpText
>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>> Fetch failed with protocol status: temp_moved(13), lastModified=0:
>>>> 
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>> 
>>>> Looks like a redirection and I did a parsechecker again for "
>>>> 
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>> ”
>>>> 
>>>> The fetching and parsing was successful this time. I have set
>>>> http.redirect.max at 5 and tried to crawl using notch crawl:
>>>> 
>>>> bin/nutch crawl testurl -depth 1
>>>> 
>>>> and did a readseg on the above crawl. Here’s the readseg dump:
>>>> 
>>>> Recno:: 0
>>>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>> 
>>>> CrawlDatum::
>>>> Version: 7
>>>> Status: 1 (db_unfetched)
>>>> Fetch time: Wed Jul 16 00:43:28 EDT 2014
>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>> Retries since fetch: 0
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 1.0
>>>> Signature: null
>>>> Metadata: _ngt_: 1405485810821
>>>> 
>>>> Content::
>>>> Version: -1
>>>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>> contentType: text/plain
>>>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent
>>>> nutch.crawl.score=1.0 Location=
>>>> 
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>> _fst_=35 nutch.segment.name=20140716004332 Connection=close
>>>> Content-Type=text/plain Server=III 150 MIME-version=1.0
>>>> Content:
>>>> 
>>>> CrawlDatum::
>>>> Version: 7
>>>> Status: 35 (fetch_redir_temp)
>>>> Fetch time: Wed Jul 16 00:43:38 EDT 2014
>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>> Retries since fetch: 0
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 1.0
>>>> Signature: null
>>>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_:
>>>> temp_moved(13), lastModified=0:
>>>> 
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>> 
>>>> Not sure why notch didn’t fetch any content or parse any data or text
>> when
>>>> crawling the page! Did I miss setting some property? I am sure I have
>>>> increased the redirect to 5. Using parsechecker, I was able to get the
>> data
>>>> and text parsed in two steps, so I think max redirect of 5 should be
>>>> sufficient. Want to understand why parse checker works and crawl
>> doesn’t.
>>>> 
>>>> Thanks,
>>>> Vijay
>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>> 
>> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Unable to fetch content

Reply via email to