Thanks for your answers Julien. I tried to use the crawl script, but I am having the same problem. I have set redirect.max to 5 and number of rounds 1 (I have also tried 2 rounds, but I guess that doesn’t help since I have already specified the redirect.max to be 5. So it should follow any redirects even with 1 round, right?) Here’s the new readseg dump:
Recno:: 0 URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Thu Jul 17 18:20:54 EDT 2014 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_=1405635658639 Content:: Version: -1 url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ contentType: text/plain metadata: Date=Thu, 17 Jul 2014 22:21:08 GMT Vary=User-Agent nutch.crawl.score=1.0 Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F _fst_=35 nutch.segment.name=20140717182101 Connection=close Content-Type=text/plain Server=III 150 MIME-version=1.0 Content: CrawlDatum:: Version: 7 Status: 35 (fetch_redir_temp) Fetch time: Thu Jul 17 18:21:08 EDT 2014 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_=1405635658639 Content-Type=text/plain _pst_=temp_moved(13), lastModified=0: https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F _rs_=468 Thanks, Vijay On Jul 17, 2014, at 5:13 PM, Julien Nioche <lists.digitalpeb...@gmail.com> wrote: > Hi > > > On 17 July 2014 22:04, Vijay Chakilam <vchaki...@adjuggler.com> wrote: > >> Thanks for your reply Julien. I am not doing any indexing and I don’t have >> a solr url. Looks like crawl script requires me to specify a solr url. How >> do I run crawl script without specifying a solar url. > > > Just comment out the commands related to SOLR in the script and pass it a > dummy parameter for the SOLR url > > Also, I want to crawl just the webpage I specify: a depth of 1. > > I don’t want to fetch any outlinks. > > > That can be done by setting db.update.additions.allowed to false in > nutch-site.xml > No new URLs will be added to the crawldb > > >> How does number of rounds relate to depth? Are they same? > > > No. They will be the same if there were no redirections and if you were > putting all unfetched URLs in the segments. If there are more unfetched > URLs in the crawldb then you are putting in the segments then you'll > definitely need several iterations. > > >> If so, what value should I specify for number of rounds to fetch just the >> page I specify and also take care of the redirects. Are http.redirect.max >> and number of rounds related? >> > > Set http.redirect.max to a value > 0 so that they redirection gets tried > within the same fetch step (i.e same round). > > HTH > > Julien > > > >> >> Thanks, >> Vijay >> >> On Jul 17, 2014, at 4:42 PM, Julien Nioche <lists.digitalpeb...@gmail.com> >> wrote: >> >>> Hi, >>> >>> The crawl command is deprecated, use the crawl script instead and give >> it a >>> number of rounds > 1 so that it has a chance to fetch the redirection >>> >>> J. >>> >>> >>> On 17 July 2014 21:10, Vijay Chakilam <vchaki...@adjuggler.com> wrote: >>> >>>> Hi, >>>> >>>> I am trying to crawl the page at: " >>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/" >>>> >>>> Here’s the parse checker output. >>>> >>>> runtime/local/bin/nutch parsechecker -dumpText >>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>> Fetch failed with protocol status: temp_moved(13), lastModified=0: >>>> >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>> >>>> Looks like a redirection and I did a parsechecker again for " >>>> >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>> ” >>>> >>>> The fetching and parsing was successful this time. I have set >>>> http.redirect.max at 5 and tried to crawl using notch crawl: >>>> >>>> bin/nutch crawl testurl -depth 1 >>>> >>>> and did a readseg on the above crawl. Here’s the readseg dump: >>>> >>>> Recno:: 0 >>>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>> >>>> CrawlDatum:: >>>> Version: 7 >>>> Status: 1 (db_unfetched) >>>> Fetch time: Wed Jul 16 00:43:28 EDT 2014 >>>> Modified time: Wed Dec 31 19:00:00 EST 1969 >>>> Retries since fetch: 0 >>>> Retry interval: 2592000 seconds (30 days) >>>> Score: 1.0 >>>> Signature: null >>>> Metadata: _ngt_: 1405485810821 >>>> >>>> Content:: >>>> Version: -1 >>>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>> contentType: text/plain >>>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent >>>> nutch.crawl.score=1.0 Location= >>>> >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>> _fst_=35 nutch.segment.name=20140716004332 Connection=close >>>> Content-Type=text/plain Server=III 150 MIME-version=1.0 >>>> Content: >>>> >>>> CrawlDatum:: >>>> Version: 7 >>>> Status: 35 (fetch_redir_temp) >>>> Fetch time: Wed Jul 16 00:43:38 EDT 2014 >>>> Modified time: Wed Dec 31 19:00:00 EST 1969 >>>> Retries since fetch: 0 >>>> Retry interval: 2592000 seconds (30 days) >>>> Score: 1.0 >>>> Signature: null >>>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: >>>> temp_moved(13), lastModified=0: >>>> >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>> >>>> Not sure why notch didn’t fetch any content or parse any data or text >> when >>>> crawling the page! Did I miss setting some property? I am sure I have >>>> increased the redirect to 5. Using parsechecker, I was able to get the >> data >>>> and text parsed in two steps, so I think max redirect of 5 should be >>>> sufficient. Want to understand why parse checker works and crawl >> doesn’t. >>>> >>>> Thanks, >>>> Vijay >>> >>> >>> >>> >>> -- >>> >>> Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >> >> > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble