Re: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread shri_s_ram
I get it now! Thanks a lot! I was running my crawl command with fetcher.parse as true which was creating the problem.. On Thu, Oct 18, 2012 at 5:53 PM, Markus Jelsma-2 [via Lucene] < ml-node+s472066n4014609...@n3.nabble.com> wrote: > You would have to check the generator code to make sure. But wh

RE: Nutch 2.x : ParseUtil failing for some pdf files

2012-10-18 Thread j.sullivan
Kiran, I took a look at your nutch-site.xml and I did not see anything for http.accept. I believe nutch-default.xml does not include application/pdf by default in http.accept so you may need to add it in your nutch-site.xml. Please take a look at the example below from my nutch-site.xml h

RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread Markus Jelsma
You would have to check the generator code to make sure. But why would you want to distribute the queue for a single domain to multiple mappers? A single local running mapper without parsing on a low-end machine can easily fetch 20-40 records per second from the same domain (if it allows you to

Re: Nutch 2.x : ParseUtil failing for some pdf files

2012-10-18 Thread kiran chitturi
Hi James, I have increased the limit in nutch-site.xml ( https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have created the webpage table based on the fields here ( http://nlp.solutions.asia/?p=180). The database stills shows the parseStatus as '–org.apache.nutch.parse.Pars

RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread shri_s_ram
Thanks.. But I thought there would be a way around it.. Is it possible even to have multiple fetch lists generated (for this problem) at all by tweaking some parameters? [I am thinking of something like partition.url.mode - byRandom] -- View this message in context: http://lucene.472066.n3.n

RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread Markus Jelsma
Hi - the generator tool partitions URL's by host, domain or IP address, they'll all end up in the same fetch list. Since you're doing only one domain there is no need to run additional mappers. If you want to crawl them as fast as you can (and you are allowed to do that) then use only one mapper

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Julien Nioche
off topic. we are talking about an issue with the SQL backend in GORA, not the performance of Nutch. Julien On 18 October 2012 20:28, Stefan Scheffler wrote: > Hi, > The problem why nutch is so slow is, that all of the steps uses hadoop > jobs which takes a long time to start. As well there is

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Stefan Scheffler
Hi, The problem why nutch is so slow is, that all of the steps uses hadoop jobs which takes a long time to start. As well there is somewhere a hardcoded 3 second delay in the hadoop core which makes sense in distributed systems. But not on single machines. Regards stefan Am 18.10.2012 17:55,

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread alxsss
Hello, I think the problem is with the storage not nutch itself. Looks like generate cannot read status or fetch time (or gets null values) from mysql. I had a bunch of issues with mysql storage and switched to hbase at the end. Alex. -Original Message- From: Sebastian Nagel

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Sebastian Nagel
Hi Luca, > I'm using Nutch 2.1 on Linux and I'm having similar problem of > http://goo.gl/nrDLV, my Nutch is > fetching same pages at each round. Um... I failed to reproduce the Pierre's problem with - a simpler configuration - HBase as back-end (Pierre and Luca both use mysql) > Then I ran "bin

Same pages crawled more than once and slow crawling

2012-10-18 Thread Luca Vasarelli
Hello, I'm using Nutch 2.1 on Linux and I'm having similar problem of http://goo.gl/nrDLV, my Nutch is fetching same pages at each round. I've built a simple localhost site, with 3 pages linked each other: first.htm -> second.htm -> third.htm I did these steps: - downloaded nutch 2.1 (source

building from src

2012-10-18 Thread sumarlidason
Good Morning, I am working on building nutch from source on centos to be used in conjunction with solr and hadoop. So far I have... download the source, ( http://www.gtlib.gatech.edu/pub/apache/nutch/2.1/ ) built with ant, successfully, created a bin folder, download the nutch script, ( https://s

Re: Fetcher Thread

2012-10-18 Thread Ye T Thet
Thanks Marcus. I will remember to set 1 for thread.per.host. Cheers, Ye On Thu, Oct 18, 2012 at 9:55 PM, Markus Jelsma wrote: > Hi Ye, > > -Original message- > > From:Ye T Thet > > Sent: Thu 18-Oct-2012 15:46 > > To: user@nutch.apache.org > > Subject: Fetcher Thread > > > > Hi Folks,

RE: Fetcher Thread

2012-10-18 Thread Markus Jelsma
Hi Ye, -Original message- > From:Ye T Thet > Sent: Thu 18-Oct-2012 15:46 > To: user@nutch.apache.org > Subject: Fetcher Thread > > Hi Folks, > > I have two questions about the Fetcher Thread in Nutch. The value > fetcher.threads.fetch in configuration file determines the number of > th