Thank you, Manu, for that excellent discussion on the topic, I could have
been more detailed about my use case.

We'll be indexing off-of the main production servers (either on a master, or
in Hadoop, we're yet to build out that piece of the puzzle). We don't store
documents at all, we only store the index data and return a document ID,
each document is maybe 1k of text, small.  We do have a few "interesting"
queries in which we do some grouping.

We currently index 100GB of input data, that'll grow 2x or 3x in the near
future.

So based on your experience, it seems likely that we'll be CPU bound (heavy
queries against a static index updated nightly from the master), thus
nullifying the advantage of dual-purposing a box with another CPU bound app.

Very useful discussion, I'll get proper load tests done in time but this
helps direct my thinking now.

David



-----Original Message-----
From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel
Le Normand
Sent: Monday, March 18, 2013 9:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Is Solr more CPU bound or IO bound?

Your question is a typical use-case dependent, the bottleneck will change
from user to user.

These are two main issues that will affect the answer:
1. How do you index: what is your indexing rate (how many docs a days)? how
big is a typical document? how many documents do you plan on indexing in
tota? do you store fields? calculate their term vectors?
2. How looks you retrieval process: What's the query rate expected? Are
there common queries (taking advantage of the cache)? Complexity of queries
(faceted / highlighted / filtered/ how many conditions, NRT)? Do you plan to
retrieve stored fields or only id's?

After answering all that there's an interative game between hardware
configuration and software configuration (how do you split your shards, use
your cache, tuning your merges and flushes etc) that would also affect the
IO / CPU bounded answer.

In my use-case for example the indexing part is IO bounded, but as my
indexing rate is much below the rate my machine could initially provide it
didn't affect my hardware spec.
After fine tuning my configuration i discovered my retrieval process was CPU
bounded and was directly affecting my avg response time, while the IO rate
in cache usage was quite low.

Try describing your use case in more details with the above questions so
we'd be able to give you guidelines.

Best,
Manu


On Mon, Mar 18, 2013 at 3:55 AM, David Parks <davidpark...@yahoo.com> wrote:

> I'm spec'ing out some hardware for a first go at our production Solr 
> instance, but I haven't spent enough time loadtesting it yet.
>
>
>
> What I want to ask if how IO intensive solr is vs. CPU intensive, 
> typically.
>
>
>
> Specifically I'm considering whether to dual-purpose the Solr servers 
> to run Solr and another CPU-only application we have. I know Solr uses 
> a fair amount of CPU, but if it also is very disk intensive it might 
> be a net benefit to have more instances running Solr and share the CPU 
> resources with the other app than to run Solr separate from the other 
> CPU app that wouldn't otherwise use the disk.
>
>
>
> Thoughts on this?
>
>
>
> Thanks,
>
> David
>
>
>
>

Reply via email to