Yes...is my only filter..... >You should have at least a filter for the seed page you are accessing in the very first step! Sorry....but, I don´t understand what you are talking about...In my seed list I only have http://elcorreo.com and I have the filter to it.
Regards Adelaida. 2011/6/13 Hannes Carl Meyer <[email protected]> > Hi, > > is this your only filter? You should have at least a filter for the seed > page you are accessing in the very first step! > > Regards > > Hannes > > On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <[email protected] > >wrote: > > > Hello, > > > > I´m new to Nutch and I´m doing some tests to see how it works. I want to > do > > some crawling in a digital newspaper webpage. To do so, I put in the urls > > directory where I have my seed list the URL I want to crawl that is: * > > http://elcorreo.com* > > The thing is that I don´t want to crawl all the news in the site but only > > the ones of the current day, so I put a filter in the > > *crawl-urlfilter.txt*(for the moment I´m using the > > *crawl* command). The filter I put is: > > > > +^http://www.elcorreo.com/.*?/20110613/.*?.html > > > > A correct URL would be for example, > > > > > http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html > > > > so, I think the regular expression is correct but Nutch doesn´t crawl > > anything. It says that there are *No Urls to Fetch - check your seed > list > > and URL filters.* > > > > > > Am I missing something ?? > > > > Thanks, > > > > > > Hannes C. Meyer > www.informera.de >

