Please add also +^http://www.elcorreo.com/$ to your filter. Otherwise you will exclude the seed page.
On Mon, Jun 13, 2011 at 1:44 PM, Adelaida Lejarazu <[email protected]>wrote: > Yes...is my only filter..... > > >You should have at least a filter for the seed page you are accessing in > the very first step! > Sorry....but, I don´t understand what you are talking about...In my seed > list I only have http://elcorreo.com and I have the filter to it. > > Regards > > Adelaida. > > > 2011/6/13 Hannes Carl Meyer <[email protected]> > >> Hi, >> >> is this your only filter? You should have at least a filter for the seed >> page you are accessing in the very first step! >> >> Regards >> >> Hannes >> >> On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <[email protected] >> >wrote: >> >> > Hello, >> > >> > I´m new to Nutch and I´m doing some tests to see how it works. I want to >> do >> > some crawling in a digital newspaper webpage. To do so, I put in the >> urls >> > directory where I have my seed list the URL I want to crawl that is: * >> > http://elcorreo.com* >> > The thing is that I don´t want to crawl all the news in the site but >> only >> > the ones of the current day, so I put a filter in the >> > *crawl-urlfilter.txt*(for the moment I´m using the >> > *crawl* command). The filter I put is: >> > >> > +^http://www.elcorreo.com/.*?/20110613/.*?.html >> > >> > A correct URL would be for example, >> > >> > >> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html >> > >> > so, I think the regular expression is correct but Nutch doesn´t crawl >> > anything. It says that there are *No Urls to Fetch - check your seed >> list >> > and URL filters.* >> > >> > >> > Am I missing something ?? >> > >> > Thanks, >> > >> >> >> >> Hannes C. Meyer >> www.informera.de >> > >

