Thanks for your quick response. I will try to answer to all the questions:
- I am using Nutch 1.2.
- The rest of the crawl-urlfilter.txt is the one that comes by default...I
haven´t changed anything else; only added the +^
http://www.elcorreo.com/.*?/20110613/.*?.html filter.
- In the nutch-site.txt I have following:
<configuration>
*<property>
        <name>http.agent.name</name>
        <value>My Spider</value>
    </property>
    <property>
        <name>generate.max.per.host</name>
        <value>-1</value>
    </property>
<property>
  <name>http.robots.agents</name>
  <value>**My Spider**,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>
    <property>
        <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>*
</configuration>



2011/6/13 lewis john mcgibbney <[email protected]>

> Hi Adelaida,
>
> Assuming that you have been able to successfully crawl the top level domain
> http://elcorreo.com e.g. that you have been able to crawl and create an
> index, at least we know that your configuration options are OK.
>
> I assume that you are using 1.2... can you confirm?
> What does the rest of your crawl-urlfilter.txt look like?
> Have you been setting any properties in nutch-site.txt which might alter
> Nutch behaviour?
>
> I am not perfect with syntax for creating filter rules in
> crawl-urlfilter...
> can someone confirm that this is correct.
>
> On Mon, Jun 13, 2011 at 12:10 PM, Adelaida Lejarazu <[email protected]
> >wrote:
>
> > Hello,
> >
> > I´m new to Nutch and I´m doing some tests to see how it works. I want to
> do
> > some crawling in a digital newspaper webpage. To do so, I put in the urls
> > directory where I have my seed list the URL I want to crawl that is: *
> > http://elcorreo.com*
> > The thing is that I don´t want to crawl all the news in the site but only
> > the ones of the current day, so I put a filter in the
> > *crawl-urlfilter.txt*(for the moment I´m using the
> > *crawl* command). The filter I put is:
> >
> > +^http://www.elcorreo.com/.*?/20110613/.*?.html
> >
> > A correct URL would be for example,
> >
> >
> http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
> >
> > so, I think the regular expression is correct but Nutch doesn´t crawl
> > anything. It says that there are *No Urls to Fetch  - check your seed
> list
> > and URL filters.*
> >
> >
> > Am I missing something ??
> >
> > Thanks,
> >
>
>
>
> --
> *Lewis*
>

Reply via email to