Yes...is my only filter.....
>You should have at least a filter for the seed page you are accessing in
the very first step!
Sorry....but, I don´t understand what you are talking about...In my seed
list I only have http://elcorreo.com and I have the filter to it.

Regards

Adelaida.

2011/6/13 Hannes Carl Meyer <[email protected]>

> Hi,
>
> is this your only filter? You should have at least a filter for the seed
> page you are accessing in the very first step!
>
> Regards
>
> Hannes
>
> On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <[email protected]
> >wrote:
>
> > Hello,
> >
> > I´m new to Nutch and I´m doing some tests to see how it works. I want to
> do
> > some crawling in a digital newspaper webpage. To do so, I put in the urls
> > directory where I have my seed list the URL I want to crawl that is: *
> > http://elcorreo.com*
> > The thing is that I don´t want to crawl all the news in the site but only
> > the ones of the current day, so I put a filter in the
> > *crawl-urlfilter.txt*(for the moment I´m using the
> > *crawl* command). The filter I put is:
> >
> > +^http://www.elcorreo.com/.*?/20110613/.*?.html
> >
> > A correct URL would be for example,
> >
> >
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> >
> > so, I think the regular expression is correct but Nutch doesn´t crawl
> > anything. It says that there are *No Urls to Fetch  - check your seed
> list
> > and URL filters.*
> >
> >
> > Am I missing something ??
> >
> > Thanks,
> >
>
>
>
> Hannes C. Meyer
> www.informera.de
>

Reply via email to