Please add also +^http://www.elcorreo.com/$ to your filter.
Otherwise you will exclude the seed page.


On Mon, Jun 13, 2011 at 1:44 PM, Adelaida Lejarazu <[email protected]>wrote:

> Yes...is my only filter.....
>
> >You should have at least a filter for the seed page you are accessing in
> the very first step!
> Sorry....but, I don´t understand what you are talking about...In my seed
> list I only have http://elcorreo.com and I have the filter to it.
>
> Regards
>
> Adelaida.
>
>
> 2011/6/13 Hannes Carl Meyer <[email protected]>
>
>> Hi,
>>
>> is this your only filter? You should have at least a filter for the seed
>> page you are accessing in the very first step!
>>
>> Regards
>>
>> Hannes
>>
>> On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <[email protected]
>> >wrote:
>>
>> > Hello,
>> >
>> > I´m new to Nutch and I´m doing some tests to see how it works. I want to
>> do
>> > some crawling in a digital newspaper webpage. To do so, I put in the
>> urls
>> > directory where I have my seed list the URL I want to crawl that is: *
>> > http://elcorreo.com*
>> > The thing is that I don´t want to crawl all the news in the site but
>> only
>> > the ones of the current day, so I put a filter in the
>> > *crawl-urlfilter.txt*(for the moment I´m using the
>> > *crawl* command). The filter I put is:
>> >
>> > +^http://www.elcorreo.com/.*?/20110613/.*?.html
>> >
>> > A correct URL would be for example,
>> >
>> >
>> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
>> >
>> > so, I think the regular expression is correct but Nutch doesn´t crawl
>> > anything. It says that there are *No Urls to Fetch  - check your seed
>> list
>> > and URL filters.*
>> >
>> >
>> > Am I missing something ??
>> >
>> > Thanks,
>> >
>>
>>
>>
>> Hannes C. Meyer
>> www.informera.de
>>
>
>

Reply via email to