Hi Anita.

I tried crawling autos.aols.com, and I could find pages similar to  
what our looking at in 3 crawls. (I injected http://autos.aol.com/   
and added autos.aol.com to my regex filter to allow it)


eg.
fetching http://autos.aol.com/bmw-650-2007:8774-photos
fetching http://autos.aol.com/article/general/v2/_a/auto- 
financing-101/20060818153509990001
fetching http://autos.aol.com/options_trimless?v=8544
fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl
fetching http://autos.aol.com/bmw-m-2007:8905-overview
fetching http://autos.aol.com/getaquote?myid=8623
fetching http://autos.aol.com/options_trimless?v=8226
fetching http://autos.aol.com/options_trimless?v=7803
fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger- 
srt8/20061030193309990001
fetching http://autos.aol.com/bmw-x3-2007:8770-specs
fetching http://autos.aol.com/saturn-vue-2007:8371-overview
fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-overview
fetching http://autos.aol.com/options_trimless?v=8394
fetching http://autos.aol.com/jaguar-listings:JA---
fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview
fetching http://autos.aol.com/bmw-x5-2007:8817-overview
fetching http://autos.aol.com/audi-a4-2007:8622-specs
fetching http://autos.aol.com/options_trimless?v=8416
fetching http://autos.aol.com/getaquote?myid=8774

the differences is that I am using the latest nutch (SVN head), and  
am just using a local store, not hadoop.

what I would do next if I were you is to check your regex filters to  
make sure you are not blocking things with a colon ':' in them for  
some strange reason,
and possibly upgrade to the latest and greatest version of nutch.  
(0.9.1)

regards
Ian.



On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote:

> Hi
>
> I am a new Nutch user, and am using Nutch 8.1 with  Hadoop. The  
> domain I am
> trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I  
> am crawling
> to the depth  of 10.
> There are certain pages that Nutch could not fetch. An  example  
> would be
> _http://autos.aol.com/acura-rl-2006:8060-review_
> (http://autos.aol.com/acura-rl-2006:8060-review) .
>
> The referring url to this page is
> _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// 
> autos.aol.com/acura-rl-2007:8060-review) .  This url was there
> in the fetch list.
>
> I did a mini crawl pointing directly to
> _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// 
> autos.aol.com/acura-rl-2007:8060-review) ,  then the page
> _http://autos.aol.com/acura-rl-2006:8060-review_
> (http://autos.aol.com/acura-rl-2006:8060-review)  gets  fetched.
>
> Does anyone have any ideas on why I am seeing this  behavior.
>
>
> Thanks
> Anita Bidari (X55746)
>
>
>
>
> ************************************** See what's free at http:// 
> www.aol.com.

Ian Holsman
[EMAIL PROTECTED]
http://parent-chatter.com -- what do parents know?



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to