Re: Re: Few questions from a newbie

Arjun Kumar Reddy Wed, 26 Jan 2011 07:47:44 -0800

Hi Mike,

Actually in my application, I am working on twitter feeds where I
am filtering the tweets present with inks and I am storing the contents of
the links. I am maintaining all such links in the urls file giving it as an
input to nutch crawler. Here, I am not bothered about the inlinks or
outlinks of that particular link.


So, at first I have given the depth as 1 and later on increased to 3. If I
increase the depth, I can prevent the unwanted crawls. So is there any other
solution for this?

I have also changed the number of redirects configuration paramater to 4 in
nutch-default.xml file.

Thanks and regards,*
*Ch. Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 8:28 PM, Mike Zuehlke <mike.zueh...@zanox.com>wrote:

> Hi Arjun,
>
> nutch handles redirect by itself - like the return codes 301 and 302.
>
> Did you check how much redirects you have to follow until you get
> HTTP_ACCESS (200).
> I think there are four redirects needed to get the given url content. So
> you have to increase the depth for your crawling.
>
> Regards
> Mike
>
>
>
>
> Von:    Arjun Kumar Reddy <charjunkumar.re...@iiitb.net>
> An:     user@nutch.apache.org
> Datum:  26.01.2011 15:43
> Betreff:        Re: Few questions from a newbie
>
>
>
> I am developing an application based on twitter feeds...so 90% of the
> url's
> will be short urls.
> So, it is difficult for me to manually convert all these urls to actual
> urls. Do we have any other solution for this?
>
>
> Thanks and regards,
> Arjun Kumar Reddy
>
>
> On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
> estrada.adam.gro...@gmail.com> wrote:
>
> > You probably have to literally click on each URL to get the URL it's
> > referencing. Those are URL shorteners  and probably won't play nicely
> with a
> > crawler because of the redirection.
> >
> > Adam
> >
> > Sent from my iPhone
> >
> > On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> > charjunkumar.re...@iiitb.net> wrote:
> >
> > > Hi list,
> > >
> > > I have given the set of urls as
> > >
> > > http://is.gd/Jt32Cf
> > > http://is.gd/hS3lEJ
> > > http://is.gd/Jy1Im3
> > > http://is.gd/QoJ8xy
> > > http://is.gd/e4ct89
> > > http://is.gd/WAOVmd
> > > http://is.gd/lhkA69
> > > http://is.gd/3OilLD
> > > ..... 43 such urls
> > >
> > > And I have run the crawl command bin/nutch crawl urls/ -dir crawl
> -depth
> > 3
> > >
> > > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > > *CrawlDb statistics start: crawl/crawldb*
> > > *Statistics for CrawlDb: crawl/crawldb*
> > > *TOTAL urls: 43*
> > > *retry 0: 43*
> > > *min score: 1.0*
> > > *avg score: 1.0*
> > > *max score: 1.0*
> > > *status 3 (db_gone): 1*
> > > *status 4 (db_redir_temp): 1*
> > > *status 5 (db_redir_perm): 41*
> > > *CrawlDb statistics: done*
> > >
> > > When I am trying to read the content from the segments, the content
> block
> > is
> > > empty for every record.
> > >
> > > Can you please tell me where I can get the content of these urls.
> > >
> > > Thanks and regards,*
> > > *Arjun Kumar Reddy
> >
>
>
>
>
>
>
> <img src="http://www.zanox.com/disclaimer/znx_logo_01.gif"; alt="disclaimer
> logo: ZANOX.de AG">
>
> --------------------------------------------------------------------------------
> We will create the ultimate global alliance to monetize the Internet
>
> --------------------------------------------------------------------------------
>
> ZANOX.de AG | Headquarters: Berlin
> AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
> Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel
> Keller (CTO)
> Chairman of the Supervisory Board: Ralph Büchi

Re: Re: Few questions from a newbie

Reply via email to