Hi Adelaida, Assuming that you have been able to successfully crawl the top level domain http://elcorreo.com e.g. that you have been able to crawl and create an index, at least we know that your configuration options are OK.
I assume that you are using 1.2... can you confirm? What does the rest of your crawl-urlfilter.txt look like? Have you been setting any properties in nutch-site.txt which might alter Nutch behaviour? I am not perfect with syntax for creating filter rules in crawl-urlfilter... can someone confirm that this is correct. On Mon, Jun 13, 2011 at 12:10 PM, Adelaida Lejarazu <alejar...@gmail.com>wrote: > Hello, > > I´m new to Nutch and I´m doing some tests to see how it works. I want to do > some crawling in a digital newspaper webpage. To do so, I put in the urls > directory where I have my seed list the URL I want to crawl that is: * > http://elcorreo.com* > The thing is that I don´t want to crawl all the news in the site but only > the ones of the current day, so I put a filter in the > *crawl-urlfilter.txt*(for the moment I´m using the > *crawl* command). The filter I put is: > > +^http://www.elcorreo.com/.*?/20110613/.*?.html > > A correct URL would be for example, > > http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html > > so, I think the regular expression is correct but Nutch doesn´t crawl > anything. It says that there are *No Urls to Fetch - check your seed list > and URL filters.* > > > Am I missing something ?? > > Thanks, > -- *Lewis*