Distributing lucene segments across multiple disks.
Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Re: Distributing lucene segments across multiple disks.
@Greg - Are you suggesting RAID as a replacement for Solr or making Solr work with RAID? Could you elaborate more on the latter, if that's you meant? We make use of solr's advanced text processing features which would be hard to replicate just using RAID. -Deepak On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters gwalt...@sherpaanalytics.com wrote: Why not use some form of RAID for your index store? You'd get the performance benefit of multiple disks without the complexity of managing them via solr. Thanks, Greg -Original Message- From: Deepak Konidena [mailto:deepakk...@gmail.com] Sent: Wednesday, September 11, 2013 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Distributing lucene segments across multiple disks. Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. -Deepak On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote: I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Re: Distributing lucene segments across multiple disks.
Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. -Deepak On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote: I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Re: Distributing lucene segments across multiple disks.
I guess at this point in the discussion, I should probably give some more background on why I am doing what I am doing. Having a single Solr shard (multiple segments) on the same disk is posing severe performance problems under load,in that, calls to Solr cause a lot of connection timeouts. When we looked at the ganglia stats for the Solr box, we saw that while memory, cpu and network usage were quite normal, the i/o wait spiked. We are unsure on what caused the i/o wait and why there were no spikes in the cpu/memory usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to distribute the segments to multiple locations (disks) and see whether this improves performance under load. @Greg - Thanks for clarifying that. I just learnt that I can't set them up using RAID as some of them are SSDs and some others are SATA (spinning disks). @Shawn Heisey - Could you elaborate more about the broker core and delegating the requests to other cores? -Deepak On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 1:07 PM, Deepak Konidena wrote: Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. Sure, you can do it all manually. At that point you would not be using SolrCloud at all, because the way to enable SolrCloud is to tell Solr where zookeeper lives. Without SolrCloud, there is no cluster automation at all. There is no collection paradigm, you just have cores. You have to send updates to the correct core; they not be redirected for you. Similarly, queries will not be load balanced automatically. For Java clients, the CloudSolrServer object can work seamlessly when servers go down. If you're not using SolrCloud, you can't use CloudSolrServer. You would be in charge of creating the shards parameter yourself. The way that I do this on my index is that I have a broker core that has no index of its own, but its solrconfig.xml has the shards and shards.qt parameters in all the request handler definitions. You can also include the parameter with the query. You would also have to handle redundancy yourself, either with replication or with independently updated indexes. I use the latter method, because it offers a lot more flexibility than replication. As mentioned in another reply, setting up RAID with a lot of disks may be better than trying to split your index up on different filesystems that each reside on different disks. I would recommend RAID10 for Solr, and it works best if it's hardware RAID and the controller has battery-backed (or NVRAM) cache. Thanks, Shawn
Re: Distributing lucene segments across multiple disks.
@Greg - Thanks for the suggestion. Will pass it along to my folks. @Shawn - That's the link I was looking for 'non-SolrCloud approach to distributed search'. Thanks for passing that along. Will give it a try. As far as RAM usage goes, I believe we set the heap size to about 40% of the RAM and less than 10% is available for OS caching ( since replica takes another 40%). Why does unallocated RAM help? How does it impact performance under load? -Deepak On Wed, Sep 11, 2013 at 2:50 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 2:57 PM, Deepak Konidena wrote: I guess at this point in the discussion, I should probably give some more background on why I am doing what I am doing. Having a single Solr shard (multiple segments) on the same disk is posing severe performance problems under load,in that, calls to Solr cause a lot of connection timeouts. When we looked at the ganglia stats for the Solr box, we saw that while memory, cpu and network usage were quite normal, the i/o wait spiked. We are unsure on what caused the i/o wait and why there were no spikes in the cpu/memory usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to distribute the segments to multiple locations (disks) and see whether this improves performance under load. @Greg - Thanks for clarifying that. I just learnt that I can't set them up using RAID as some of them are SSDs and some others are SATA (spinning disks). @Shawn Heisey - Could you elaborate more about the broker core and delegating the requests to other cores? On the broker core - I have a core on my servers that has no index of its own. In the /select handler (and others) I have placed a shards parameter, and many of them also have a shards.qt parameter. The shards paramter is how a non-cloud distributed search is done. http://wiki.apache.org/solr/**DistributedSearchhttp://wiki.apache.org/solr/DistributedSearch Addressing your first paragraph: You say that you have lots of RAM ... but is there a lot of unallocated RAM that the OS can use for caching, or is it mostly allocated to processes, such as the java heap for Solr? Depending on exactly how your indexes are composed, you need up to 100% of the total index size available as unallocated RAM. With SSD, the requirement is less, but cannot be ignored. I personally wouldn't go below about 25-50% even with SSD, and I'd plan on 50-100% for regular disks. There is some evidence to suggest that you only need unallocated RAM equal to 10% of your index size for caching with SSD, but that is only likely to work if you have a lot of stored (as opposed to indexed) data. If most of your index is unstored, then more would be required. Thanks, Shawn
Re: Distributing lucene segments across multiple disks.
Very helpful link. Thanks for sharing that. -Deepak On Wed, Sep 11, 2013 at 4:34 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 4:16 PM, Deepak Konidena wrote: As far as RAM usage goes, I believe we set the heap size to about 40% of the RAM and less than 10% is available for OS caching ( since replica takes another 40%). Why does unallocated RAM help? How does it impact performance under load? Because once the data is in the OS disk cache, reading it becomes instantaneous, it doesn't need to go out to the disk. Disks are glacial compared to RAM. Even SSD has a far slower response time. Any recent operating system does this automatically, including the one from Redmond that we all love to hate. http://blog.thetaphi.de/2012/**07/use-lucenes-mmapdirectory-** on-64bit.htmlhttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Thanks, Shawn
Order of fields in a search query.
Does the order of fields matter in a lucene query? For instance, q = A B C Lets say A appears in a million documents, B in 1, C in 1000. while the results would be identical irrespective of the order in which you AND A, B and C, will the response times of the following queries differ in any way? C B A A B C Does Lucene/Solr pick the best query execution plan in terms of both space and time for a given query? -Deepak
Multiple _val_ inside a lucene query.
One of my previous mails to the group helped me simulate short-circuiting OR behavior using (thanks to yonik) _val_:def(query(cond1,cond2,..)) where if cond1 is true the query returns without executing the subsequent conditions. While it works successfully for single attribute, I am trying to extend it so I can achieve the same behavior for multiple attributes. I am trying to use multiple _val_, where the query returns an error. How do I make a query with multiple _val_ clauses? -Deepak
short-circuit OR operator in lucene/solr
I understand that lucene's AND (), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra). I have been trying to construct a simple OR expression, as follows q = +(field1:value1 OR field2:value2) with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses. How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true. -Deepak