Re: keep all pages from a domain in one slice

2013-03-05 Thread Lewis John Mcgibbney
Hi Jason, There is nothing I can see here which concerns Nutch. Try solr lists please. Thank you Lewis On Tuesday, March 5, 2013, Stubblefield Jason < mr.jason.stubblefi...@gmail.com> wrote: > I have several Solr 3.6 instances that for various reasons, I don't want to upgrade to 4.0 yet. My index

Re: keep all pages from a domain in one slice

2013-03-05 Thread Stubblefield Jason
I have several Solr 3.6 instances that for various reasons, I don't want to upgrade to 4.0 yet. My index is too big to fit on one machine. I want to be able to slice the crawl so that I can have 1 slice per solr shard, but also use the grouping feature on solr. From what I understand, solr gr

Re: keep all pages from a domain in one slice

2013-03-05 Thread feng lu
Hi Maybe you can implement SegmentMergeFilter interface to filter segments during segment merge. On Wed, Mar 6, 2013 at 6:02 AM, Markus Jelsma wrote: > Hi > > You can't do this with -slice but you can merge segments and filter them. > This would mean you'd have to merge the segments for each do

Re: Nutch Incremental Crawl

2013-03-05 Thread feng lu
Hi << I used less command and checked, it shows the past content , not modified one. Any other cache clearing from crawl db? or any property to set in nutch-site so that it does re-fetch modified content? >> As far as i know, the crawl db does not use cache. As Markus says that you can simply r

RE: keep all pages from a domain in one slice

2013-03-05 Thread Markus Jelsma
Hi You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal. -Original message- > From:Jason S

keep all pages from a domain in one slice

2013-03-05 Thread Jason S
Hello, I seem to remember seeing a discussion about this in the past but I can't seem to find it in the archives. When using mergesegs -slice, is it possible to keep all the pages from a domain in the same slice? I have just been messing around with this functionality (Nutch 1.6), and it seem

Re: Rest API for Nutch 2.x

2013-03-05 Thread Lewis John Mcgibbney
Documentation - No prior art - yes - http://www.mail-archive.com/user@nutch.apache.org/msg06927.html Jira issues - NUTCH-932 Please let us know how you get on. Getting some concrete documentation for this would be excellent. Thank you Lewis On Tue, Mar 5, 2013 at 7:33 AM, Anand Bhagwat wrote: >

Re: Continue Nutch Crawling After Exception

2013-03-05 Thread Lewis John Mcgibbney
Hi, On Tue, Mar 5, 2013 at 7:22 AM, raviksingh wrote: > I am new to Nutch.I have already configured Nutch with MYSQL. I have few > questions : > I would like to star by saying that this is not a great idea. If you read this list you will see why. > > 1.Currently I am crawling all the domains f

Re: recrawl - will it re-fetch and parse all the URLS again?

2013-03-05 Thread David Philip
solr url is http://localhost:8080/solrnutch, Version : solr 3.6, Nutch - 1.6 Below commands and log was copy paste problem. -David On Wed, Mar 6, 2013 at 12:03 AM, David Philip wrote: > Hi all, > > When I am doing full re-crawl, the old urls that are modified should be > updated correct?

recrawl - will it re-fetch and parse all the URLS again?

2013-03-05 Thread David Philip
Hi all, When I am doing full re-crawl, the old urls that are modified should be updated correct?That is not happening. Please correct me where I am wrong. Below are the list of steps: - property set db.fetch.interval.default=600sec db.injector.update=true - crawl : bin/nutch crawl urls -

Continue Nutch Crawling After Exception

2013-03-05 Thread raviksingh
I am new to Nutch.I have already configured Nutch with MYSQL. I have few questions : 1.Currently I am crawling all the domains from my SEED.TXT. If some exception occurs the crawling stops and some domains are not crawled, just because of one domain/webpage. Is there a way to force nutch to contin

Re: Parse statistics in Nutch

2013-03-05 Thread kiran chitturi
Thanks Lewis. I will give a try at this On Tue, Mar 5, 2013 at 12:59 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > There are a few things you can do Kiran. > My preference is to use custom counters for successfully and unsuccessfully > parsed docs within the ParserJob or equival

Re: Parse statistics in Nutch

2013-03-05 Thread Lewis John Mcgibbney
There are a few things you can do Kiran. My preference is to use custom counters for successfully and unsuccessfully parsed docs within the ParserJob or equivalent. I would be surprised if this is not already there however. It is not much trouble to add counters to something like this. We already d

Re: Find which URL created exception

2013-03-05 Thread raviksingh
This is the log : The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type text/plain, but they are not mapped to it in the parse-plugins.xml file 2013-03-05 22:06:54,076 WARN parse.ParseUtil - U

Re: Find which URL created exception

2013-03-05 Thread kiran chitturi
Hi! Looking at 'logs/hadoop.log' will give you more information on why the job has failed. To check if a single URL can be crawled, please use parseChecker tool [0] [0] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker I have checked using parseChecker and it worked for me. On Tue, M

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-05 Thread kiran chitturi
Tejas, I have a total of 364k files fetched in my last crawl and i used a topN of 2000 and 2 threads per queue. The gap i have noticed is between 5-8 minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at the beginning with topN of 10k but after it crashed i changed topN to 2

Find which URL created exception

2013-03-05 Thread raviksingh
Hi, I am new to nutch. I am using nutch with MySQL. While trying to crawl http://piwik.org/xmlrpc.php nutch throws exception : Parsing http://piwik.org/xmlrpc.php Call completed java.lang.RuntimeException: job failed: name=update-table, jobid=null at or

Re: Nutch 1.6 : How to reparse Nutch segments ?

2013-03-05 Thread kiran chitturi
Thanks Tejas. Deleting the 'crawl_parse' directory worked for me today. On Mon, Mar 4, 2013 at 11:15 PM, Tejas Patil wrote: > Yes. After I deleted that directory, parse operation ran successfully. Even > if its an empty directory, parse wont proceed normally. > > > On Mon, Mar 4, 2013 at 8:07

Re: Robots.db instead of robots.txt

2013-03-05 Thread Tejas Patil
Nutch is internally caching the robots rules (it uses a hash map) in every round. It will fetch robots file for a particular host just once in a given round. This model works out well. If you are creating a separate db for it, then you have to ensure that it is updated frequently to take into accou

Rest API for Nutch 2.x

2013-03-05 Thread Anand Bhagwat
Hi, I already know that nutch provides command line tools for crawl and index. I also read somewhere that it has a REST API. Do you have any documentation around it? Its capabilities, limitations etc. Regards, Anand

Re: Robots.db instead of robots.txt

2013-03-05 Thread Raja Kulasekaran
Hi, I meant to move the entire crawl process in the client environment , create "robots.db" and fetch only robots.db as a indexed data . Raja On Tue, Mar 5, 2013 at 8:27 PM, Tejas Patil wrote: > robots.txt is a global standard accepted by everyone. Even google, bing use > that. I dont think t

Re: Robots.db instead of robots.txt

2013-03-05 Thread Tejas Patil
robots.txt is a global standard accepted by everyone. Even google, bing use that. I dont think that there is any db file format maintained by web servers for the robots information. On Tue, Mar 5, 2013 at 1:29 AM, Raja Kulasekaran wrote: > Hi > > Instead of parsing robots.txt file, why don't as

Re: Nutch Incremental Crawl

2013-03-05 Thread David Philip
Hi, I used less command and checked, it shows the past content , not modified one. Any other cache clearing from crawl db? or any property to set in nutch-site so that it does re-fetch modified content? - Cleared tomcat cache - settings: db.fetch.interval.default 600 db.inj

Understanding fetch MapReduce job counters and logs

2013-03-05 Thread Amit Sela
Hi all, I am trying to better understand the counters and logging of the fetch MapReduce executed when crawling. When looking at the job counters in the MapReduce web UI, I note the following counters and values: *Map input records 162,080* moved

Re: Nutch 2.1 crawling step by step and crawling command differences

2013-03-05 Thread Adriana Farina
Ok, I didn't read that issue on jira. Thank you very much, I'll use the crawl script! Inviato da iPhone Il giorno 04/mar/2013, alle ore 18:35, Lewis John Mcgibbney ha scritto: > Hi, > If you look at the crawl script iirc there is no way to programmatically > obtain the generated batchId(s) fr