Hi Mike, Actually in my application, I am working on twitter feeds where I am filtering the tweets present with inks and I am storing the contents of the links. I am maintaining all such links in the urls file giving it as an input to nutch crawler. Here, I am not bothered about the inlinks or outlinks of that particular link.
So, at first I have given the depth as 1 and later on increased to 3. If I increase the depth, I can prevent the unwanted crawls. So is there any other solution for this? I have also changed the number of redirects configuration paramater to 4 in nutch-default.xml file. Thanks and regards,* *Ch. Arjun Kumar Reddy On Wed, Jan 26, 2011 at 8:28 PM, Mike Zuehlke <mike.zueh...@zanox.com>wrote: > Hi Arjun, > > nutch handles redirect by itself - like the return codes 301 and 302. > > Did you check how much redirects you have to follow until you get > HTTP_ACCESS (200). > I think there are four redirects needed to get the given url content. So > you have to increase the depth for your crawling. > > Regards > Mike > > > > > Von: Arjun Kumar Reddy <charjunkumar.re...@iiitb.net> > An: user@nutch.apache.org > Datum: 26.01.2011 15:43 > Betreff: Re: Few questions from a newbie > > > > I am developing an application based on twitter feeds...so 90% of the > url's > will be short urls. > So, it is difficult for me to manually convert all these urls to actual > urls. Do we have any other solution for this? > > > Thanks and regards, > Arjun Kumar Reddy > > > On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups < > estrada.adam.gro...@gmail.com> wrote: > > > You probably have to literally click on each URL to get the URL it's > > referencing. Those are URL shorteners and probably won't play nicely > with a > > crawler because of the redirection. > > > > Adam > > > > Sent from my iPhone > > > > On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy < > > charjunkumar.re...@iiitb.net> wrote: > > > > > Hi list, > > > > > > I have given the set of urls as > > > > > > http://is.gd/Jt32Cf > > > http://is.gd/hS3lEJ > > > http://is.gd/Jy1Im3 > > > http://is.gd/QoJ8xy > > > http://is.gd/e4ct89 > > > http://is.gd/WAOVmd > > > http://is.gd/lhkA69 > > > http://is.gd/3OilLD > > > ..... 43 such urls > > > > > > And I have run the crawl command bin/nutch crawl urls/ -dir crawl > -depth > > 3 > > > > > > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats > > > *CrawlDb statistics start: crawl/crawldb* > > > *Statistics for CrawlDb: crawl/crawldb* > > > *TOTAL urls: 43* > > > *retry 0: 43* > > > *min score: 1.0* > > > *avg score: 1.0* > > > *max score: 1.0* > > > *status 3 (db_gone): 1* > > > *status 4 (db_redir_temp): 1* > > > *status 5 (db_redir_perm): 41* > > > *CrawlDb statistics: done* > > > > > > When I am trying to read the content from the segments, the content > block > > is > > > empty for every record. > > > > > > Can you please tell me where I can get the content of these urls. > > > > > > Thanks and regards,* > > > *Arjun Kumar Reddy > > > > > > > > > <img src="http://www.zanox.com/disclaimer/znx_logo_01.gif" alt="disclaimer > logo: ZANOX.de AG"> > > -------------------------------------------------------------------------------- > We will create the ultimate global alliance to monetize the Internet > > -------------------------------------------------------------------------------- > > ZANOX.de AG | Headquarters: Berlin > AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705 > Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel > Keller (CTO) > Chairman of the Supervisory Board: Ralph Büchi