Crawling - basic questions.

2011-06-13 Thread tamanjit bindra
Hi everyone, I am kind of a n00b to nutch. So here are a few questions for you to answer (or your amusement) 1. Duing a nutch crawl and subsequent crawls, does the crawler always pick up new links on a page or just checks for old ones? For eg. if i set 20 as the limit of number of links on a

No Urls to fetch

2011-06-13 Thread Adelaida Lejarazu
the news in the site but only the ones of the current day, so I put a filter in the *crawl-urlfilter.txt*(for the moment I´m using the *crawl* command). The filter I put is: +^http://www.elcorreo.com/.*?/20110613/.*?.html A correct URL would be for example, http://www.elcorreo.com/vizcaya/20110613/mas

Re: No Urls to fetch

2011-06-13 Thread lewis john mcgibbney
is that I don´t want to crawl all the news in the site but only the ones of the current day, so I put a filter in the *crawl-urlfilter.txt*(for the moment I´m using the *crawl* command). The filter I put is: +^http://www.elcorreo.com/.*?/20110613/.*?.html A correct URL would be for example, http

Re: No Urls to fetch

2011-06-13 Thread Hannes Carl Meyer
in the *crawl-urlfilter.txt*(for the moment I´m using the *crawl* command). The filter I put is: +^http://www.elcorreo.com/.*?/20110613/.*?.html A correct URL would be for example, http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html

Re: No Urls to fetch

2011-06-13 Thread Adelaida Lejarazu
Thanks for your quick response. I will try to answer to all the questions: - I am using Nutch 1.2. - The rest of the crawl-urlfilter.txt is the one that comes by default...I haven´t changed anything else; only added the +^ http://www.elcorreo.com/.*?/20110613/.*?.html filter. - In the nutch

Nutch 1.3 fetch: No agents listed in 'http.agent.name' property

2011-06-13 Thread Jason Stubblefield
Hello, I'm trying to fetch a segment using hadoop on a single node with nutch 1.3. I seem to be struggling with the new runtime configuration. I have hadoop up and running and have successfully run the readdb -stats command and generated a sement, but when I run: runtime/deploy/bin/nutch fetch

Re: Nutch 1.3 fetch: No agents listed in 'http.agent.name' property

2011-06-13 Thread Julien Nioche
Hi Jason, If you have hadoop running independently from Nutch you should use runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir or in the Nutch job which you will need to regenerate with 'ant job' so that it reflects the changes you made in NUTCH/conf Julien On 13 June

Re: Using multi cores on local machines

2011-06-13 Thread MilleBii
@Julien all Thx for the correction, @Ken , you know what I just got the book last week, and I'm in the process of reading it. And whilst I was reading it, I said oops my answer is wrong. You guys corrected it, fine. I got to this conclusion because I only ever used a pseudo/distributed or a

Re: Nutch 1.3 fetch: No agents listed in 'http.agent.name' property

2011-06-13 Thread Jason Stubblefield
Thanks for the help Julien, I'll just copy the files to the hadoop conf directory for now while it is a single node. If I use the job file do I have to have the nutch package on each node in the cluster, or just on the master node? I'm also curious if it would be possible or practical to declare

Re: Nutch 1.3 fetch: No agents listed in 'http.agent.name' property

2011-06-13 Thread Julien Nioche
On 13 June 2011 21:15, Jason Stubblefield mr.jason.stubblefi...@gmail.comwrote: Thanks for the help Julien, I'll just copy the files to the hadoop conf directory for now while it is a single node. If I use the job file do I have to have the nutch package on each node in the cluster, or just