Re: Nutch 2.1 crawling step by step and crawling command differences

2013-03-05 Thread Adriana Farina
Ok, I didn't read that issue on jira. Thank you very much, I'll use the crawl script! Inviato da iPhone Il giorno 04/mar/2013, alle ore 18:35, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi, If you look at the crawl script iirc there is no way to programmatically obtain the

Understanding fetch MapReduce job counters and logs

2013-03-05 Thread Amit Sela
Hi all, I am trying to better understand the counters and logging of the fetch MapReduce executed when crawling. When looking at the job counters in the MapReduce web UI, I note the following counters and values: *Map input records 162,080* moved

Re: Nutch Incremental Crawl

2013-03-05 Thread David Philip
Hi, I used less command and checked, it shows the past content , not modified one. Any other cache clearing from crawl db? or any property to set in nutch-site so that it does re-fetch modified content? - Cleared tomcat cache - settings: property namedb.fetch.interval.default/name

Re: Robots.db instead of robots.txt

2013-03-05 Thread Tejas Patil
robots.txt is a global standard accepted by everyone. Even google, bing use that. I dont think that there is any db file format maintained by web servers for the robots information. On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran cull...@gmail.com wrote: Hi Instead of parsing robots.txt

Re: Robots.db instead of robots.txt

2013-03-05 Thread Raja Kulasekaran
Hi, I meant to move the entire crawl process in the client environment , create robots.db and fetch only robots.db as a indexed data . Raja On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil tejas.patil...@gmail.comwrote: robots.txt is a global standard accepted by everyone. Even google, bing use

Rest API for Nutch 2.x

2013-03-05 Thread Anand Bhagwat
Hi, I already know that nutch provides command line tools for crawl and index. I also read somewhere that it has a REST API. Do you have any documentation around it? Its capabilities, limitations etc. Regards, Anand

Re: Robots.db instead of robots.txt

2013-03-05 Thread Tejas Patil
Nutch is internally caching the robots rules (it uses a hash map) in every round. It will fetch robots file for a particular host just once in a given round. This model works out well. If you are creating a separate db for it, then you have to ensure that it is updated frequently to take into

Re: Nutch 1.6 : How to reparse Nutch segments ?

2013-03-05 Thread kiran chitturi
Thanks Tejas. Deleting the 'crawl_parse' directory worked for me today. On Mon, Mar 4, 2013 at 11:15 PM, Tejas Patil tejas.patil...@gmail.comwrote: Yes. After I deleted that directory, parse operation ran successfully. Even if its an empty directory, parse wont proceed normally. On Mon,

Find which URL created exception

2013-03-05 Thread raviksingh
Hi, I am new to nutch. I am using nutch with MySQL. While trying to crawl http://piwik.org/xmlrpc.php http://piwik.org/xmlrpc.php nutch throws exception : Parsing http://piwik.org/xmlrpc.php Call completed java.lang.RuntimeException: job failed: name=update-table, jobid=null at

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-05 Thread kiran chitturi
Tejas, I have a total of 364k files fetched in my last crawl and i used a topN of 2000 and 2 threads per queue. The gap i have noticed is between 5-8 minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at the beginning with topN of 10k but after it crashed i changed topN to

Re: Find which URL created exception

2013-03-05 Thread kiran chitturi
Hi! Looking at 'logs/hadoop.log' will give you more information on why the job has failed. To check if a single URL can be crawled, please use parseChecker tool [0] [0] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker I have checked using parseChecker and it worked for me. On Tue,

Re: Find which URL created exception

2013-03-05 Thread raviksingh
This is the log : The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type text/plain, but they are not mapped to it in the parse-plugins.xml file 2013-03-05 22:06:54,076 WARN parse.ParseUtil -

Re: Parse statistics in Nutch

2013-03-05 Thread kiran chitturi
Thanks Lewis. I will give a try at this On Tue, Mar 5, 2013 at 12:59 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: There are a few things you can do Kiran. My preference is to use custom counters for successfully and unsuccessfully parsed docs within the ParserJob or equivalent.

Continue Nutch Crawling After Exception

2013-03-05 Thread raviksingh
I am new to Nutch.I have already configured Nutch with MYSQL. I have few questions : 1.Currently I am crawling all the domains from my SEED.TXT. If some exception occurs the crawling stops and some domains are not crawled, just because of one domain/webpage. Is there a way to force nutch to

recrawl - will it re-fetch and parse all the URLS again?

2013-03-05 Thread David Philip
Hi all, When I am doing full re-crawl, the old urls that are modified should be updated correct?That is not happening. Please correct me where I am wrong. Below are the list of steps: - property set db.fetch.interval.default=600sec db.injector.update=true - crawl : bin/nutch crawl urls

Re: Continue Nutch Crawling After Exception

2013-03-05 Thread Lewis John Mcgibbney
Hi, On Tue, Mar 5, 2013 at 7:22 AM, raviksingh ravisingh.air...@gmail.comwrote: I am new to Nutch.I have already configured Nutch with MYSQL. I have few questions : I would like to star by saying that this is not a great idea. If you read this list you will see why. 1.Currently I am

Re: Rest API for Nutch 2.x

2013-03-05 Thread Lewis John Mcgibbney
Documentation - No prior art - yes - http://www.mail-archive.com/user@nutch.apache.org/msg06927.html Jira issues - NUTCH-932 Please let us know how you get on. Getting some concrete documentation for this would be excellent. Thank you Lewis On Tue, Mar 5, 2013 at 7:33 AM, Anand Bhagwat

keep all pages from a domain in one slice

2013-03-05 Thread Jason S
Hello, I seem to remember seeing a discussion about this in the past but I can't seem to find it in the archives. When using mergesegs -slice, is it possible to keep all the pages from a domain in the same slice? I have just been messing around with this functionality (Nutch 1.6), and it

RE: keep all pages from a domain in one slice

2013-03-05 Thread Markus Jelsma
Hi You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal. -Original message- From:Jason S

Re: Nutch Incremental Crawl

2013-03-05 Thread feng lu
Hi I used less command and checked, it shows the past content , not modified one. Any other cache clearing from crawl db? or any property to set in nutch-site so that it does re-fetch modified content? As far as i know, the crawl db does not use cache. As Markus says that you can simply

Re: keep all pages from a domain in one slice

2013-03-05 Thread feng lu
Hi Maybe you can implement SegmentMergeFilter interface to filter segments during segment merge. On Wed, Mar 6, 2013 at 6:02 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge

Re: keep all pages from a domain in one slice

2013-03-05 Thread Stubblefield Jason
I have several Solr 3.6 instances that for various reasons, I don't want to upgrade to 4.0 yet. My index is too big to fit on one machine. I want to be able to slice the crawl so that I can have 1 slice per solr shard, but also use the grouping feature on solr. From what I understand, solr

Re: keep all pages from a domain in one slice

2013-03-05 Thread Lewis John Mcgibbney
Hi Jason, There is nothing I can see here which concerns Nutch. Try solr lists please. Thank you Lewis On Tuesday, March 5, 2013, Stubblefield Jason mr.jason.stubblefi...@gmail.com wrote: I have several Solr 3.6 instances that for various reasons, I don't want to upgrade to 4.0 yet. My index