Re: Index size and performance degradation

Itamar Syn-Hershko Sun, 12 Jun 2011 10:15:40 -0700

Andrew, no particular hardware setup I'm afraid. That is a generalproduct which we can't assume anything about the hardware it would runon. Thanks for the tip on multi-core tho.


On 12/06/2011 11:45, Andrew Kane wrote:

In the literature there is some evidence that sharding of in-memory indexes
on multi-core machines might be better.  Has anyone tried this lately?

     http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4228359

Single disk machines (HDD or SSD) would be slower.  Multi-disk or RAID type
setups might have some benefits.  What's your hardware setup?

Andrew.


On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<ita...@code972.com>wrote:

Thanks.


The whole point of my question was to find out if and how to make balancing
on the SAME machine. Apparently thats not going to help and at a certain
point we will just have to prompt the user to buy more hardware...


Out of curiosity, isn't there anything that we can do to avoid that? for
instance using memory-mapped files for the indexes? anything that would help
us overcome OS limitations of that sort...


Also, you mention a scheduled job to check for performance degradation; any
idea how serious such a drop should be for sharding to be really beneficial?
or is it application specific too?


Itamar.



On 12/06/2011 06:43, Shai Erera wrote:

  I agree w/ Erick, there is no cutoff point (index size for that matter)

above which you start sharding.

What you can do is create a scheduled job in your system that runs a
select
list of queries and monitors their performance. Once it degrades, it
shards
the index by either splitting it (you can use IndexSplitter under contrib)
or create a new shard, and direct new documents to it.

I think I read somewhere, not sure if it was in Solr or ElasticSearch
documentation, about a Balancer object, which moves shards around in order
to balance the load on the cluster. You can implement something similar
which tries to balance the index sizes, creates new shards on-the-fly,
even
merge shards if suddenly a whole source is being removed from the system
etc.

Also, note that the 'largest index size' threshold is really a machine
constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
it
is pointless to create 10x10GB shards on the same machine -- searching
them
is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
even
worse because you consume more RAM when the indexes are split (e.g., terms
index, field infos etc.).

Shai

On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<erickerick...@gmail.com

wrote:

  <<<We can't assume anything about the machine running it,

so testing won't really tell us much>>>

Hmmm, then it's pretty hopeless I think. Problem is that
anything you say about running on a machine with
2G available memory on a single processor is completely
incomparable to running on a machine with 64G of
memory available for Lucene and 16 processors.

There's really no such thing as an "optimum" Lucene index
size, it always relates to the characteristics of the
underlying hardware.

I think the best you can do is actually test on various
configurations, then at least you can say "on configuration
X this is the tipping point".

Sorry there isn't a better answer that I know of, but...

Best
Erick

On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<ita...@code972.com>
wrote:

Hi all,

I know Lucene indexes to be at their optimum up to a certain size - said

to

be around several GBs. I haven't found a good discussion over this, but

its

my understanding that at some point its better to split an index into

parts

(a la sharding) than to continue searching on a huge-size index. I
assume
this has to do with OS and IO configurations. Can anyone point me to
more
info on this?

We have a product that is using Lucene for various searches, and at the
moment each type of search is using its own Lucene index. We plan on
refactoring the way it works and to combine all indexes into one -
making
the whole system more robust and with a smaller memory footprint, among
other things.

Assuming the above is true, we are interested in knowing how to do this
correctly. Initially all our indexes will be run in one big index, but
if

at

some index size there is a severe performance degradation we would like

to

handle that correctly by starting a new FSDirectory index to flush into,

or

by re-indexing and moving large indexes into their own Lucene index.

Are there are any guidelines for measuring or estimating this correctly?
what we should be aware of while considering all that? We can't assume
anything about the machine running it, so testing won't really tell us
much...

Thanks in advance for any input on this,

Itamar.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  ---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index size and performance degradation

Reply via email to