Hi Anita. I tried crawling autos.aols.com, and I could find pages similar to what our looking at in 3 crawls. (I injected http://autos.aol.com/ and added autos.aol.com to my regex filter to allow it)
eg. fetching http://autos.aol.com/bmw-650-2007:8774-photos fetching http://autos.aol.com/article/general/v2/_a/auto- financing-101/20060818153509990001 fetching http://autos.aol.com/options_trimless?v=8544 fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl fetching http://autos.aol.com/bmw-m-2007:8905-overview fetching http://autos.aol.com/getaquote?myid=8623 fetching http://autos.aol.com/options_trimless?v=8226 fetching http://autos.aol.com/options_trimless?v=7803 fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger- srt8/20061030193309990001 fetching http://autos.aol.com/bmw-x3-2007:8770-specs fetching http://autos.aol.com/saturn-vue-2007:8371-overview fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-overview fetching http://autos.aol.com/options_trimless?v=8394 fetching http://autos.aol.com/jaguar-listings:JA--- fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview fetching http://autos.aol.com/bmw-x5-2007:8817-overview fetching http://autos.aol.com/audi-a4-2007:8622-specs fetching http://autos.aol.com/options_trimless?v=8416 fetching http://autos.aol.com/getaquote?myid=8774 the differences is that I am using the latest nutch (SVN head), and am just using a local store, not hadoop. what I would do next if I were you is to check your regex filters to make sure you are not blocking things with a colon ':' in them for some strange reason, and possibly upgrade to the latest and greatest version of nutch. (0.9.1) regards Ian. On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote: > Hi > > I am a new Nutch user, and am using Nutch 8.1 with Hadoop. The > domain I am > trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I > am crawling > to the depth of 10. > There are certain pages that Nutch could not fetch. An example > would be > _http://autos.aol.com/acura-rl-2006:8060-review_ > (http://autos.aol.com/acura-rl-2006:8060-review) . > > The referring url to this page is > _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// > autos.aol.com/acura-rl-2007:8060-review) . This url was there > in the fetch list. > > I did a mini crawl pointing directly to > _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// > autos.aol.com/acura-rl-2007:8060-review) , then the page > _http://autos.aol.com/acura-rl-2006:8060-review_ > (http://autos.aol.com/acura-rl-2006:8060-review) gets fetched. > > Does anyone have any ideas on why I am seeing this behavior. > > > Thanks > Anita Bidari (X55746) > > > > > ************************************** See what's free at http:// > www.aol.com. Ian Holsman [EMAIL PROTECTED] http://parent-chatter.com -- what do parents know? ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
