In the literature there is some evidence that sharding of in-memory indexes on multi-core machines might be better. Has anyone tried this lately?
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4228359 Single disk machines (HDD or SSD) would be slower. Multi-disk or RAID type setups might have some benefits. What's your hardware setup? Andrew. On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko <ita...@code972.com>wrote: > Thanks. > > > The whole point of my question was to find out if and how to make balancing > on the SAME machine. Apparently thats not going to help and at a certain > point we will just have to prompt the user to buy more hardware... > > > Out of curiosity, isn't there anything that we can do to avoid that? for > instance using memory-mapped files for the indexes? anything that would help > us overcome OS limitations of that sort... > > > Also, you mention a scheduled job to check for performance degradation; any > idea how serious such a drop should be for sharding to be really beneficial? > or is it application specific too? > > > Itamar. > > > > On 12/06/2011 06:43, Shai Erera wrote: > > I agree w/ Erick, there is no cutoff point (index size for that matter) >> above which you start sharding. >> >> What you can do is create a scheduled job in your system that runs a >> select >> list of queries and monitors their performance. Once it degrades, it >> shards >> the index by either splitting it (you can use IndexSplitter under contrib) >> or create a new shard, and direct new documents to it. >> >> I think I read somewhere, not sure if it was in Solr or ElasticSearch >> documentation, about a Balancer object, which moves shards around in order >> to balance the load on the cluster. You can implement something similar >> which tries to balance the index sizes, creates new shards on-the-fly, >> even >> merge shards if suddenly a whole source is being removed from the system >> etc. >> >> Also, note that the 'largest index size' threshold is really a machine >> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >> it >> is pointless to create 10x10GB shards on the same machine -- searching >> them >> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >> even >> worse because you consume more RAM when the indexes are split (e.g., terms >> index, field infos etc.). >> >> Shai >> >> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<erickerick...@gmail.com >> >wrote: >> >> <<<We can't assume anything about the machine running it, >>> so testing won't really tell us much>>> >>> >>> Hmmm, then it's pretty hopeless I think. Problem is that >>> anything you say about running on a machine with >>> 2G available memory on a single processor is completely >>> incomparable to running on a machine with 64G of >>> memory available for Lucene and 16 processors. >>> >>> There's really no such thing as an "optimum" Lucene index >>> size, it always relates to the characteristics of the >>> underlying hardware. >>> >>> I think the best you can do is actually test on various >>> configurations, then at least you can say "on configuration >>> X this is the tipping point". >>> >>> Sorry there isn't a better answer that I know of, but... >>> >>> Best >>> Erick >>> >>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<ita...@code972.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I know Lucene indexes to be at their optimum up to a certain size - said >>>> >>> to >>> >>>> be around several GBs. I haven't found a good discussion over this, but >>>> >>> its >>> >>>> my understanding that at some point its better to split an index into >>>> >>> parts >>> >>>> (a la sharding) than to continue searching on a huge-size index. I >>>> assume >>>> this has to do with OS and IO configurations. Can anyone point me to >>>> more >>>> info on this? >>>> >>>> We have a product that is using Lucene for various searches, and at the >>>> moment each type of search is using its own Lucene index. We plan on >>>> refactoring the way it works and to combine all indexes into one - >>>> making >>>> the whole system more robust and with a smaller memory footprint, among >>>> other things. >>>> >>>> Assuming the above is true, we are interested in knowing how to do this >>>> correctly. Initially all our indexes will be run in one big index, but >>>> if >>>> >>> at >>> >>>> some index size there is a severe performance degradation we would like >>>> >>> to >>> >>>> handle that correctly by starting a new FSDirectory index to flush into, >>>> >>> or >>> >>>> by re-indexing and moving large indexes into their own Lucene index. >>>> >>>> Are there are any guidelines for measuring or estimating this correctly? >>>> what we should be aware of while considering all that? We can't assume >>>> anything about the machine running it, so testing won't really tell us >>>> much... >>>> >>>> Thanks in advance for any input on this, >>>> >>>> Itamar. >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >