Hi everyone,
I am kind of a n00b to nutch. So here are a few questions for you to answer (or
your amusement)
1. Duing a nutch crawl and subsequent crawls, does the crawler always pick up
new links on a page or just checks for old ones?
For eg. if i set 20 as the limit of number of links on a
the news in the site but only
the ones of the current day, so I put a filter in the
*crawl-urlfilter.txt*(for the moment I´m using the
*crawl* command). The filter I put is:
+^http://www.elcorreo.com/.*?/20110613/.*?.html
A correct URL would be for example,
http://www.elcorreo.com/vizcaya/20110613/mas
is that I don´t want to crawl all the news in the site but only
the ones of the current day, so I put a filter in the
*crawl-urlfilter.txt*(for the moment I´m using the
*crawl* command). The filter I put is:
+^http://www.elcorreo.com/.*?/20110613/.*?.html
A correct URL would be for example,
http
in the
*crawl-urlfilter.txt*(for the moment I´m using the
*crawl* command). The filter I put is:
+^http://www.elcorreo.com/.*?/20110613/.*?.html
A correct URL would be for example,
http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
Thanks for your quick response. I will try to answer to all the questions:
- I am using Nutch 1.2.
- The rest of the crawl-urlfilter.txt is the one that comes by default...I
haven´t changed anything else; only added the +^
http://www.elcorreo.com/.*?/20110613/.*?.html filter.
- In the nutch
Hello,
I'm trying to fetch a segment using hadoop on a single node with nutch 1.3.
I seem to be struggling with the new runtime configuration. I have hadoop
up and running and have successfully run the readdb -stats command and
generated a sement, but when I run:
runtime/deploy/bin/nutch fetch
Hi Jason,
If you have hadoop running independently from Nutch you should use
runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir or
in the Nutch job which you will need to regenerate with 'ant job' so that it
reflects the changes you made in NUTCH/conf
Julien
On 13 June
@Julien all Thx for the correction,
@Ken , you know what I just got the book last week, and I'm in the process
of reading it. And whilst I was reading it, I said oops my answer is wrong.
You guys corrected it, fine.
I got to this conclusion because I only ever used a pseudo/distributed or a
Thanks for the help Julien, I'll just copy the files to the hadoop conf
directory for now while it is a single node.
If I use the job file do I have to have the nutch package on each node in
the cluster, or just on the master node?
I'm also curious if it would be possible or practical to declare
On 13 June 2011 21:15, Jason Stubblefield
mr.jason.stubblefi...@gmail.comwrote:
Thanks for the help Julien, I'll just copy the files to the hadoop conf
directory for now while it is a single node.
If I use the job file do I have to have the nutch package on each node in
the cluster, or just
10 matches
Mail list logo