Did you modify the URL filtering rules to allow URLs with ? & etc...? By default such URLs will be filtered out
On 17 August 2011 14:01, abhayd <[email protected]> wrote: > hi > I have seen similar posts in this forum but still not able to understand > how > redirect is handled.. > > I m trying to crawl http://developer.att.com/developer/ . After successful > crawl i dump the crawldb using readdb. I see entries like following. What > does this mean? Has nutch crawled the redirected page and is it in index? > > I tried using readseg command with all the segments under crawl/segments > directory but i could not find > > http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037 > url. > > heres is my crawl/segments directory listing. > 20110817001833 20110817002117 20110817003028 20110817003930 > 20110817004202 > 20110817001844 20110817002556 20110817003532 20110817004105 > > Any help why redirected page is not crawled? > > http://developer.att.com/developer/ Version: 7 > Status: 4 (db_redir_temp) > Fetch time: Fri Sep 16 00:18:36 CDT 2011 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _pst_: temp_moved(13), lastModified=0: > > http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037 > > http://developer.att.com/developer/100006 Version: 7 > Status: 5 (db_redir_perm) > Fetch time: Fri Sep 16 00:43:33 CDT 2011 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 0.0 > Signature: null > Metadata: _pst_: moved(12), lastModified=0: > http://developer.att.com/developer/forward.jsp?passedItemId=100006 > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

