Re: Hardware Specifications

Sean Dean Thu, 12 Jun 2008 21:53:28 -0700

So in the most simple of contexts your sort of agreeing with me. Running 
multiple nutch processes on a multi-core processor is more efficient then 
running one single process on heavily scaled hardware.
 
Am i correct with this statement?



----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Friday, June 13, 2008 12:16:38 AM
Subject: Re: Hardware Specifications

I'm not sure -- I try to avoid running single Nutch job at a time, as I find 
overlapping is more efficient.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Sean Dean <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, June 12, 2008 12:37:19 PM
> Subject: Re: Hardware Specifications
> 
> I see.
>  
> What happens with the utilization when only one job is running, does it stay 
> about equal at a lower overall percentage or does it move predominately to 
> one 
> core?
> 
> 
> 
> ----- Original Message ----
> From: "[EMAIL PROTECTED]" 
> To: nutch-user@lucene.apache.org
> Sent: Thursday, June 12, 2008 12:17:10 AM
> Subject: Re: Hardware Specifications
> 
> Hm, hm.
> 
> I can't speak for Nutch's search (don't have it running at the moment), but I 
> am 
> looking at a cluster that is running a fetch job and a generate job 
> concurrently 
> and I see both cores on the dual-core server being utilized about equally.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Sean Dean 
> > To: nutch-user@lucene.apache.org
> > Sent: Saturday, June 7, 2008 3:52:33 AM
> > Subject: Re: Hardware Specifications
> > 
> > Hey Otis,
> >  
> > I will first disclose that the OS im using for my Nutch implementation is 
> > FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux distribution. 
> > The 
> 
> > JDK however is your standard SUN 1.5.0-14 64-bit package.
> >  
> > I find that the JVM does not treat Nutch as something that's truly 
> > multithreaded. Which ever task you ask it to do, be it serve results, 
> > fetch, 
> > inject, update, etc. it will always peg one core and not use anything else 
> > (sometimes it will share processing on another core but this is just the 
> garbage 
> > collection thread inside the JVM).
> >  
> > Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or so 
> > of 
> 
> > RAM) doesn't fix this limitation, but it does cheat in that each instance 
> > runs 
> 
> > as its own independent JVM and as such the OS will execute operations on 
> > the 
> > core which has the lowest utilization via the scheduler (in my case 
> > FreeBSD's 
> > ULE) for each instance.
> >  
> > When you think about it this type of setup scales very well horizontally, 
> > much 
> 
> > like Nutch/Hadoop itself. I find creating one huge index on the same 
> > machine 
> and 
> > giving it everything it has in terms of resources has diminishing returns, 
> > and 
> 
> > as my example points out never uses it all anyway.
> >  
> > One negative about this setup though is detailed in NUTCH-92. This issue 
> > alone 
> 
> > kills any attempt to scale your search engine for "main stream" commercial 
> > success (e.g. Google).
> > 
> > 
> > 
> > ----- Original Message ----
> > From: "[EMAIL PROTECTED]" 
> > To: nutch-user@lucene.apache.org
> > Sent: Friday, June 6, 2008 12:20:41 PM
> > Subject: Re: Hardware Specifications
> > 
> > Dan, you left out one important "bit" - this is a 64-bit machine?
> > 
> > Sean, out of curiosity... is this really better than running a single JVM 
> > on a 
> 
> > multi-core 64-bit machine with 32GB of RAM than running a single JVM 
> > instance, 
> 
> > single Nutch instance, and letting the OS switch between cores?
> > 
> > 
> > As for fetching/indexing/searching - you probably don't want to do this on 
> > the 
> 
> > same set of machines.  Use a set of machines for fetching/indexing, and a 
> > set 
> of 
> > machines for serving search requests.
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: Sean Dean 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Thursday, June 5, 2008 3:45:41 PM
> > > Subject: Re: Hardware Specifications
> > > 
> > > Another idea is to setup 8 seperate nutch instances on the same server, 
> > > each 
> 
> > > with its own 20M index.
> > >  
> > > The idea behind this is that one-core per application will be used, 
> > > although 
> 
> > its 
> > > not pegged and the RAM is used in ~4GB chunks (JVM setting) for each 
> instance.
> > >  
> > > This would be used for serving results only though, you would have to 
> disable 
> > > part or all of this when in fetching mode but it would give you 160M 
> > > pages 
> and 
> > 
> > > still very good speeds (about 4-5 per second or more as other factors 
> > > come 
> > into 
> > > play). Keep in mind we use 8 hard drives, each associated with its own 
> > instance 
> > > on the server but as long as the RAID FC setup you have is very fast the 
> > results 
> > > should be comparible (maybe even faster).
> > > 
> > > 
> > > ----- Original Message ----
> > > From: Dennis Kubes 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Thursday, June 5, 2008 2:38:04 PM
> > > Subject: Re: Hardware Specifications
> > > 
> > > In memory index 15M.  On disk index, slower but still doable where 
> > > response time isn't critical, ~350M pages maybe more.
> > > 
> > > Dennis
> > > 
> > > Dan Segel wrote:
> > > > We have a server that has 30TB of hard drive space connected through 
> fiber,
> > > > 2 quad core 2.5ghz, and 32gb of ram.  If fetching 5 searches per second 
> how
> > > > many million indexed pages do you think we can achieve?
> > > >

Re: Hardware Specifications

Reply via email to