You may want to escape the dot at leats

+^http://www\.elcorreo\.com/.*?/20110613/.*?\.html<http://www.elcorreo.com/.*?/20110613/.*?.html>
I'm assuming you have the other rule at the end of your file, for filtering
everything else out :
-.*

Now this is a problem because you can not seed the crawl with
http://elcorreo.com, since there is no matching rule. That URL gets rejected
immediately.

Also unless you have a sitemap page which point you to all the pages you are
looking for you need to have other  pages with links to the content you are
interested in. And those pages will get crawled/indexed etc...

Therefore I don't see how you can use crawl-filter for doing what you want.

You may want to write a special indexer plugin so that unnecessary pages get
drop from the search index and therefore not polute your search results.
But you need to keep those pages and their links for the crawler to work.



2011/6/13 Adelaida Lejarazu <[email protected]>

> Yes...is my only filter.....
> >You should have at least a filter for the seed page you are accessing in
> the very first step!
> Sorry....but, I don´t understand what you are talking about...In my seed
> list I only have http://elcorreo.com and I have the filter to it.
>
> Regards
>
> Adelaida.
>
> 2011/6/13 Hannes Carl Meyer <[email protected]>
>
> > Hi,
> >
> > is this your only filter? You should have at least a filter for the seed
> > page you are accessing in the very first step!
> >
> > Regards
> >
> > Hannes
> >
> > On Mon, Jun 13, 2011 at 1:10 PM, Adelaida Lejarazu <[email protected]
> > >wrote:
> >
> > > Hello,
> > >
> > > I´m new to Nutch and I´m doing some tests to see how it works. I want
> to
> > do
> > > some crawling in a digital newspaper webpage. To do so, I put in the
> urls
> > > directory where I have my seed list the URL I want to crawl that is: *
> > > http://elcorreo.com*
> > > The thing is that I don´t want to crawl all the news in the site but
> only
> > > the ones of the current day, so I put a filter in the
> > > *crawl-urlfilter.txt*(for the moment I´m using the
> > > *crawl* command). The filter I put is:
> > >
> > > +^http://www.elcorreo.com/.*?/20110613/.*?.html
> > >
> > > A correct URL would be for example,
> > >
> > >
> >
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> > >
> > > so, I think the regular expression is correct but Nutch doesn´t crawl
> > > anything. It says that there are *No Urls to Fetch  - check your seed
> > list
> > > and URL filters.*
> > >
> > >
> > > Am I missing something ??
> > >
> > > Thanks,
> > >
> >
> >
> >
> > Hannes C. Meyer
> > www.informera.de
> >
>



-- 
-MilleBii-

Reply via email to