Hi,

Just change line in crawl-urlfilter from




        
        
        
        


+^http://([a-z09]*\.)*MY.DOMA
IN.NAME/

TO



        
        
        
        


  +^http://([a-z0-9]*\.)*


> Date: Mon, 13 Jun 2011 13:10:40 +0200
> Subject: No Urls to fetch
> From: [email protected]
> To: [email protected]
> 
> Hello,
> 
> I´m new to Nutch and I´m doing some tests to see how it works. I want to do
> some crawling in a digital newspaper webpage. To do so, I put in the urls
> directory where I have my seed list the URL I want to crawl that is: *
> http://elcorreo.com*
> The thing is that I don´t want to crawl all the news in the site but only
> the ones of the current day, so I put a filter in the
> *crawl-urlfilter.txt*(for the moment I´m using the
> *crawl* command). The filter I put is:
> 
> +^http://www.elcorreo.com/.*?/20110613/.*?.html
> 
> A correct URL would be for example,
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> 
> so, I think the regular expression is correct but Nutch doesn´t crawl
> anything. It says that there are *No Urls to Fetch  - check your seed list
> and URL filters.*
> 
> 
> Am I missing something ??
> 
> Thanks,
                                          

Reply via email to