Antwort: Re: Few questions from a newbie

Mike Zuehlke Wed, 26 Jan 2011 06:59:00 -0800

Hi Arjun,

nutch handles redirect by itself - like the return codes 301 and 302.


Did you check how much redirects you have to follow until you get 
HTTP_ACCESS (200).
I think there are four redirects needed to get the given url content. So 
you have to increase the depth for your crawling.

Regards
Mike




Von:    Arjun Kumar Reddy <charjunkumar.re...@iiitb.net>
An:     user@nutch.apache.org
Datum:  26.01.2011 15:43
Betreff:        Re: Few questions from a newbie



I am developing an application based on twitter feeds...so 90% of the 
url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely 
with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.re...@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > ..... 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl 
-depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content 
block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>






<img src="http://www.zanox.com/disclaimer/znx_logo_01.gif"; alt="disclaimer 
logo: ZANOX.de AG">
--------------------------------------------------------------------------------
We will create the ultimate global alliance to monetize the Internet
--------------------------------------------------------------------------------

ZANOX.de AG | Headquarters: Berlin
AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel 
Keller (CTO)
Chairman of the Supervisory Board: Ralph Büchi

Antwort: Re: Few questions from a newbie

Reply via email to