Running Nutch Crawler on a 3 slave hadoop cluster. On the third depth, the
reduces fail with the following exception:
2011-05-03 15:36:00,786 WARN org.apache.hadoop.conf.Configuration:
/mnt/hadoop/mapred/local/taskTracker/jobcache/job_201105022308_0101/attempt_201105022308_0101_r_02_0/job.xml:
Hi,
I craweling one site.
I want to create index but wanted to index full content with html markup.
I have searched google but didn't find any help.
Can any body provide any clue?
Thanks,
Meenakshi
Hello everyone, I'm a newbie to nutch... sorry if the question is silly...
I've installed Nutch according to the steps of the official tutorial.
Everything seems ok, and the crawl completes (just with some error on
specific pages), but I cannot get any result through the browser search.
My cata
Hi Roberto,
By the looks of it this has to do with correctly defining your searcher.dir
property in nutch-site.xml
If you have set this property previously with 'file:/path/to/index' then remove
the 'file:' and just try 'path/to/index'
How are you running Nutch? Although in this case catalina.
Hello,
I would like to crawl wikipedia using Nutch, but as it is too large, I would
only like to crawl pages that are related to a particular subject.
For example, I would like to crawl for webpages of wikipedia that contain the
term "Football". Is this possible using Nutch?
Thank you for your
Thank you very much Lewis! It seems ok now...
The property "searcher.dir" was not set at all (The tutorial do not
mention it...).
I edited /var/lib/tomcat6/webapps/nutch/WEB-INF/classes/nutch-site.xml
this way:
searcher.dir
file:/var/apache-nutch-1.1-bin/crawl.test/
the "file:" pref
On Wed, May 4, 2011 at 5:20 PM, Kelvin wrote:
> Hello,
>
> I would like to crawl wikipedia using Nutch, but as it is too large, I
> would only like to crawl pages that are related to a particular subject.
>
> For example, I would like to crawl for webpages of wikipedia that contain
> the term "Fo
Hi,
I would rather use the wikipedia dumps!
You should have a look at jwpl http://code.google.com/p/jwpl/
BR
Hannes
On Wed, May 4, 2011 at 5:20 PM, Kelvin wrote:
> Hello,
>
> I would like to crawl wikipedia using Nutch, but as it is too large, I
> would only like to crawl pages that are rela
Try prefixing your script /crawling command with:
$ xmlstarlet edit -L -u "/configuration/property[name='']"/value -v 'test'
conf/nutch-default.xml
$ xmlstarlet sel -t -c "/configuration/property[name='http.agent.name']"/value
conf/nutch-default.xml
After the second you should see printed:
test
Hi Gabriele,
Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the
whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy server?
And this will take up a very big space?
I also need to crawl youtube, to look for videos whose metatags contain
"Football",
Hi Hannes,
Thanks for the suggestion, I will have a look at wikipedia dumps. What is your
advice on integrating the downloaded data from wikipedia dumps with Lucene? Can
I use Lucene to directly index it? My initial thoughts are getting the mysql
version of the wikipedia dumps, then use Lusql t
On Wed, May 4, 2011 at 6:22 PM, Kelvin wrote:
> Hi Gabriele,
>
> Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the
> whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy
> server?
>
i think so (I'm also a newbie).
>
>
> And this will take up a very b
>Now nutch web search works, but just for one of two sites configured
Just to clarify, are you saying that the pages you configured have been
fetched, processed and indexed but do not feature when you submit a query or
that Nutch is failing to fetch one site when you are crawling?
>moreover
Backwards from what you want, but may help. Using the original URL:
bin/nutch readdb output/crawldb -url 'http://example.org/original/url/'
Replace "output" with the name of your crawl output directory. If it was
redirected, the "Metadata" will say "moved" and show you where. If there
were mul
Ok, everything seems to work now. I've just created four separated
'conf' and 'url' files (two sites with two language version each) and
four tomcat nutch instances, following this guide:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Thank you again for your help!
15 matches
Mail list logo