Error in Nutch Fetch

2011-05-04 Thread Amin Bandeali
Running Nutch Crawler on a 3 slave hadoop cluster. On the third depth, the reduces fail with the following exception: 2011-05-03 15:36:00,786 WARN org.apache.hadoop.conf.Configuration: /mnt/hadoop/mapred/local/taskTracker/jobcache/job_201105022308_0101/attempt_201105022308_0101_r_02_0/job.xml:

Full Content iwth html Markup

2011-05-04 Thread Meenakshi Kanaujia
Hi, I craweling one site. I want to create index but wanted to index full content with html markup. I have searched google but didn't find any help. Can any body provide any clue? Thanks, Meenakshi

Newbie: No search result

2011-05-04 Thread Roberto
Hello everyone, I'm a newbie to nutch... sorry if the question is silly... I've installed Nutch according to the steps of the official tutorial. Everything seems ok, and the crawl completes (just with some error on specific pages), but I cannot get any result through the browser search. My cata

RE: Newbie: No search result

2011-05-04 Thread McGibbney, Lewis John
Hi Roberto, By the looks of it this has to do with correctly defining your searcher.dir property in nutch-site.xml If you have set this property previously with 'file:/path/to/index' then remove the 'file:' and just try 'path/to/index' How are you running Nutch? Although in this case catalina.

Can I custom crawl using Nutch?

2011-05-04 Thread Kelvin
Hello, I would like to crawl wikipedia using Nutch, but as it is too large, I would only like to crawl pages that are related to a particular subject. For example, I would like to crawl for webpages of wikipedia that contain the term "Football". Is this possible using Nutch? Thank you for your

Re: Newbie: No search result

2011-05-04 Thread Roberto
Thank you very much Lewis! It seems ok now... The property "searcher.dir" was not set at all (The tutorial do not mention it...). I edited /var/lib/tomcat6/webapps/nutch/WEB-INF/classes/nutch-site.xml this way: searcher.dir file:/var/apache-nutch-1.1-bin/crawl.test/ the "file:" pref

Re: Can I custom crawl using Nutch?

2011-05-04 Thread Gabriele Kahlout
On Wed, May 4, 2011 at 5:20 PM, Kelvin wrote: > Hello, > > I would like to crawl wikipedia using Nutch, but as it is too large, I > would only like to crawl pages that are related to a particular subject. > > For example, I would like to crawl for webpages of wikipedia that contain > the term "Fo

Re: Can I custom crawl using Nutch?

2011-05-04 Thread Hannes Carl Meyer
Hi, I would rather use the wikipedia dumps! You should have a look at jwpl http://code.google.com/p/jwpl/ BR Hannes On Wed, May 4, 2011 at 5:20 PM, Kelvin wrote: > Hello, > > I would like to crawl wikipedia using Nutch, but as it is too large, I > would only like to crawl pages that are rela

Re: Re: Error: No agents listed in 'http.agent.name' property.

2011-05-04 Thread Gabriele Kahlout
Try prefixing your script /crawling command with: $ xmlstarlet edit -L -u "/configuration/property[name='']"/value -v 'test' conf/nutch-default.xml $ xmlstarlet sel -t -c "/configuration/property[name='http.agent.name']"/value conf/nutch-default.xml After the second you should see printed: test

Re: Can I custom crawl using Nutch?

2011-05-04 Thread Kelvin
Hi Gabriele, Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy server? And this will take up a very big space? I also need to crawl youtube, to look for videos whose metatags contain "Football",

Re: Can I custom crawl using Nutch?

2011-05-04 Thread Kelvin
Hi Hannes, Thanks for the suggestion, I will have a look at wikipedia dumps. What is your advice on integrating the downloaded data from wikipedia dumps with Lucene? Can I use Lucene to directly index it? My initial thoughts are getting the mysql version of the wikipedia dumps, then use Lusql t

Re: Can I custom crawl using Nutch?

2011-05-04 Thread Gabriele Kahlout
On Wed, May 4, 2011 at 6:22 PM, Kelvin wrote: > Hi Gabriele, > > Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the > whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy > server? > i think so (I'm also a newbie). > > > And this will take up a very b

RE: Newbie: No search result

2011-05-04 Thread McGibbney, Lewis John
>Now nutch web search works, but just for one of two sites configured Just to clarify, are you saying that the pages you configured have been fetched, processed and indexed but do not feature when you submit a query or that Nutch is failing to fetch one site when you are crawling? >moreover

Re: Getting original URL for redirect

2011-05-04 Thread Mark Achee
Backwards from what you want, but may help. Using the original URL: bin/nutch readdb output/crawldb -url 'http://example.org/original/url/' Replace "output" with the name of your crawl output directory. If it was redirected, the "Metadata" will say "moved" and show you where. If there were mul

Re: Newbie: No search result

2011-05-04 Thread Roberto
Ok, everything seems to work now. I've just created four separated 'conf' and 'url' files (two sites with two language version each) and four tomcat nutch instances, following this guide: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian Thank you again for your help!