Re: Improving Solr performance
On Fri, Jan 14, 2011 at 1:56 PM, supersoft wrote: > > The tests are performed with a selfmade program. [...] May I ask in what language is the program written in? The reason to ask that is to eliminate the possibility that there is an issue with the threading model, e.g., if you were using Python, for example. Would it be possible for you to run Apache bench, ab, against your Solr setup, e.g., something like: # For 10 simultaneous connections ab -n 100 -c 10 http://localhost:8983/solr/select/?q=my_query1 # For 50 simultaneous connections ab -n 500 -c 50 http://localhost:8983/solr/select/?q=my_query2 Please pay attention to the meaning of the -n parameter (there is a slight gotcha there). "man ab" for details on usage, or see, http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ for example. > In the last post, I wrote the results of the 100 threads example orderered > by the response date. The results ordered by the creation date are: [...] OK, the numbers makes more sense now. As someone else has pointed out, your throughput does increase with more simultaneous queries, and there are better ways to do the measurement. Nevertheless, your results are very much at odds with what we see, and I would like to understand the issue. Regards, Gora
Re: Improving Solr performance
On Thu, 2011-01-13 at 17:40 +0100, supersoft wrote: > Although most of the queries are cache hits, the performance is still > dependent of the number of simultaneous queries: > > 1 simultaneous query: 3437 ms (cache fails) Average response time: 3437 ms Throughput: 0.29 queries/sec > 2 simultaneous queries: 594, 954 ms Average response time: 774 ms Throughput: 1.29 queries/sec > 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500, > 2938, 3000 ms Average response time: 2030 ms Throughput: 4.93 queries/sec > 50 simultaneous queries: 1203, 1453, [...] Average response time: 15478 ms Throughput: 3.23 queries/sec > 100 simultaneous queries: 1297, 1531, 1969, [...] Average response time: 16285 ms Throughput: 6.14 queries/sec > Is this an expected situation? Your numbers for 50 queries are strangely low, but the trend throughout your tests indicate that your tests for 1, 2, 10, 50 and 100 threads do not perform the same number of searches. In order to compare the numbers, you need to let each test perform the same number of searches and to start each test from exactly the same warmup state. That means restarting Solr and flushing the disk cache, which might require rebooting depending on your setup. It is also recommended that you perform 5-10 searches before you start measuring anything, as the first searches are not representative of general performance. Going with the numbers as they are, performance actually increases for each thread you add: Look at throughput, not response time. This is clearly bogus, but easily explained by the cache.
Re: Improving Solr performance
The tests are performed with a selfmade program. The arguments are the number of threads and the path to a file which contains available queries (in the last test only one). When each thread is created, it gets the current date (in milisecs), and when it gets the response from the query, the thread logs the diff with that initial date. In the last post, I wrote the results of the 100 threads example orderered by the response date. The results ordered by the creation date are: 100 simultaneous queries: 9265, 11922, 12375, 4109, 4890, 7093, 21875, 8547, 13562, 13219, 1531, 11875, 21281, 31985, 11703, 7391, 32031, 22172, 21469, 13875, 1969, 11406, 8172, 9609, 16953, 13828, 17282, 22141, 16625, 2203, 24985, 2375, 25188, 2891, 5047, 6422, 20860, 7594, 23125, 32281, 32016, 5312, 23125, 11484, 10344, 11500, 18172, 3937, 11547, 13500, 28297, 20594, 24641, 7063, 24797, 12922, 1297, 8984, 20625, 13407, 23203, 32016, 15922, 21875, 8750, 12875, 23203, 26453, 26016, 11797, 31782, 24672, 21625, 7672, 18985, 14672, 22157, 26485, 23328, 9907, 5563, 24625, 14078, 4703, 25844, 12328, 11484, 6437, 25937, 26437, 18484, 13719, 16328, 28687, 23141, 14016, 26437, 13187, 25031, 31969 -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2254121.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
On Thu, Jan 13, 2011 at 10:10 PM, supersoft wrote: > > On the one hand, I found really interesting those comments about the reasons > for sharding. Documentation agrees you about why to split an index in > several shards (big sizes problems) but I don't find any explanation about > the inconvenients as an Access Control List. I guess there should be some > and they can be critical in this design. Any example? [...] Can I ask what might be a stupid question? How are you measuring the numbers below, and what do they mean? As your hit ratio is close to 1 (i.e., everything after the first query is coming from the cache), these numbers seem a little strange. Are these really the time for each of the N simultaneous queries? They seem to be monotonically increasing (though with a couple of strange exceptions), which leads me to suspect that they are some kind of cumulative times, e.g., by this interpretation, for the case of the 10 simultaneous queries, the first one takes 1047ms, the second 268ms, the third 125ms, and so on. We have run performance tests with pg_bench on a index of size 40GB on a single Solr server with about 6GB of RAM allocated to Solr, and see what I would think of as expected behaviour, i.e., for every fresh query term, the first query takes the longest, and the time for subsequent queries with the same term goes down dramatically, as the result is coming out of the cache. This is at odds to what you describe here, so I have to go back and check that we did not miss something important. > 1 simultaneous query: 3437 ms (cache fails) > > 2 simultaneous queries: 594, 954 ms > > 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500, > 2938, 3000 ms > > 50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938, > 14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359, > 16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531, > 18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703, > 18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812 > > 100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109, > 4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672, > 8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500, > 11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219, > 13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328, > 16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469, > 21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203, > 23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016, > 26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031, > 32016, 32281 ms [...] Regards, Gora
Re: Improving Solr performance
On the one hand, I found really interesting those comments about the reasons for sharding. Documentation agrees you about why to split an index in several shards (big sizes problems) but I don't find any explanation about the inconvenients as an Access Control List. I guess there should be some and they can be critical in this design. Any example? On the other hand, the performance problems. I have configured big caches and I launch a test of simultaneous requests (with the same query) without commiting during the test. The caches are initially empty and after the test: namequeryResultCache stats lookups 1129 hits1120 hitratio0.99 inserts 16 evictions 0 size9 warmupTime 0 cumulative_lookups 1129 cumulative_hits 1120 cumulative_hitratio 0.99 cumulative_inserts 16 cumulative_evictions0 namedocumentCache stats lookups 6750 hits6440 hitratio0.95 inserts 310 evictions 0 size310 warmupTime 0 cumulative_lookups 6750 cumulative_hits 6440 cumulative_hitratio 0.95 cumulative_inserts 310 cumulative_evictions0 Although most of the queries are cache hits, the performance is still dependent of the number of simultaneous queries: 1 simultaneous query: 3437 ms (cache fails) 2 simultaneous queries: 594, 954 ms 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500, 2938, 3000 ms 50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938, 14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359, 16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531, 18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703, 18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812 100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109, 4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672, 8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500, 11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219, 13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328, 16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469, 21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203, 23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016, 26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031, 32016, 32281 ms Is this an expected situation? Is there any technique for not being so dependent of the number simultaneuos queries? (due to economical reasons, replication in more servers is not an option) Thanks in advance (and also thanks for previous comments) -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2249108.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
Any sources to cite for this statement? And are you talking about RAM allocated to the JVM or available for OS cache? > Not sure if this was mentioned yet, but if you are doing slave/master > replication you'll need 2x the RAM at replication time. Just something to > keep in mind. > > -mike > > On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen wrote: > > On Mon, 2011-01-10 at 21:43 +0100, Paul wrote: > > > > I see from your other messages that these indexes all live on the > > > > same > > > > machine. > > > > > > You're almost certainly I/O bound, because you don't have enough > > > > memory > > > > for the > > > > > > OS to cache your index files. With 100GB of total index size, you'll > > > > get best > > > > > > results with between 64GB and 128GB of total RAM. > > > > > > Is that a general rule of thumb? That it is best to have about the > > > same amount of RAM as the size of your index? > > > > I does not seems like there is a clear current consensus on hardware to > > handle IO problems. I am firmly in the SSD camp, but as you can see from > > the current thread, other people recommend RAM and/or extra machines. > > > > I can say that our tests with RAM and spinning disks showed us that a > > lot of RAM certainly helps a lot, but also that it takes a considerable > > amount of time to warm the index before the performance is satisfactory. > > It might be helped with disk cache tricks, such as copying the whole > > index to /dev/null before opening it in Solr. > > > > > So, with a 5GB index, I should have between 4GB and 8GB of RAM > > > dedicated to solr? > > > > Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~= > > index size recommendation.
Re: Improving Solr performance
And I don't think I've seen anyone suggest a seperate core just for Access Control Lists. I'm not sure what that would get you. Perhaps a separate store that isn't Solr at all, in some cases. On 1/10/2011 5:36 PM, Jonathan Rochkind wrote: Access Control Lists
Re: Improving Solr performance
On 1/10/2011 5:03 PM, Dennis Gearon wrote: What I seem to see suggested here is to use different cores for the things you suggested: different types of documents Access Control Lists I wonder how sharding would work in that scenario? Sharding has nothing to do with that scenario at all. Different cores are essentially _entirely seperate_. While it can be convenient to use different cores like this, it means you don't get ANY searches that 'join' over multiple 'kinds' of data in different cores. Solr is not great at handling hetereogenous data like that. Putting it in seperate cores is one solution, although then they are entirely seperate. If that works, great. Another solution is putting them in the same index, but using mostly different fields, and perhaps having a 'type' field shared amongst all of your 'kinds' of data, and then always querying with an 'fq' for the right 'kind'. Or if the fields they use are entirely different, you don't even need the fq, since a query on a certain field will only match a certain 'kind' of document. Solr is not great at handling complex queries over data with hetereogenous schemata. Solr wants you to to flatten all your data into one single set of documents. Sharding is a way of splitting up a single index (multiple cores are _multiple indexes_) amongst several hosts for performance reasons, mostly when you have a very large index. That is it. The end. if you have multiple cores, that's the same as having multiple solr indexes (which may or may not happen to be on the same machine). Any one or more of those cores could be sharded if you want. This is a seperate issue.
Re: Improving Solr performance
Not sure if this was mentioned yet, but if you are doing slave/master replication you'll need 2x the RAM at replication time. Just something to keep in mind. -mike On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen wrote: > On Mon, 2011-01-10 at 21:43 +0100, Paul wrote: > > > I see from your other messages that these indexes all live on the same > machine. > > > You're almost certainly I/O bound, because you don't have enough memory > for the > > > OS to cache your index files. With 100GB of total index size, you'll > get best > > > results with between 64GB and 128GB of total RAM. > > > > Is that a general rule of thumb? That it is best to have about the > > same amount of RAM as the size of your index? > > I does not seems like there is a clear current consensus on hardware to > handle IO problems. I am firmly in the SSD camp, but as you can see from > the current thread, other people recommend RAM and/or extra machines. > > I can say that our tests with RAM and spinning disks showed us that a > lot of RAM certainly helps a lot, but also that it takes a considerable > amount of time to warm the index before the performance is satisfactory. > It might be helped with disk cache tricks, such as copying the whole > index to /dev/null before opening it in Solr. > > > So, with a 5GB index, I should have between 4GB and 8GB of RAM > > dedicated to solr? > > Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~= > index size recommendation. > >
Re: Improving Solr performance
What I seem to see suggested here is to use different cores for the things you suggested: different types of documents Access Control Lists I wonder how sharding would work in that scenario? Me, I plan on : For security: Using a permissions field For different schmas: Dynamic fields with enough premade fields to handle it. The one thing I don't thing my approach does well with is statistics. Dennis Gearon - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Cc: supersoft Sent: Mon, January 10, 2011 1:08:00 PM Subject: Re: Improving Solr performance I see a lot of people using shards to hold "different types of documents", and it almost always seems to be a bad solution. Shards are intended for distributing a large index over multiple hosts -- that's it. Not for some kind of federated search over multiple schemas, not for access control. Why not put everything in the same index, without shards, and just use an 'fq' limit in order to limit to the specific document you'd like to search over in a given search?I think that would achieve your goal a lot more simply than shards -- then you use sharding only if and when your index grows to be so large you'd like to distribute it over multiple hosts, and when you do so you choose a shard key that will have more or less equal distribution accross shards. Using shards for access control or schema management just leads to headaches. [Apparently Solr could use some highlighted documentation on what shards are really for, as it seems to be a very common issue on this list, someone trying to use them for something else and then inevitably finding problems with that approach.] Jonathan On 1/7/2011 6:48 AM, supersoft wrote: > The reason of this distribution is the kind of the documents. In spite of > having the same schema structure (and solr conf), a document belongs to 1 of > 5 different kinds. > > Each kind corresponds to a concrete shard and due to this, the implemented > client tool avoids searching in all the shards when the users selects just > one or a few of kinds. The tool runs a multisharded query of the proper > shards. I guess this is a right approach but correct me if I am wrong. > > The real problem of this architecture is the correlation between concurrent > users and response time: > 1 query: n seconds > 2 queries: 2*n second each query > 3 queries: 3*n seconds each query > and so... > > This is being a real headache because 1 single query has an acceptable > response time but when many users are accessing to the server the > performance goes hardly down.
Re: Improving Solr performance
On Mon, 2011-01-10 at 21:43 +0100, Paul wrote: > > I see from your other messages that these indexes all live on the same > > machine. > > You're almost certainly I/O bound, because you don't have enough memory for > > the > > OS to cache your index files. With 100GB of total index size, you'll get > > best > > results with between 64GB and 128GB of total RAM. > > Is that a general rule of thumb? That it is best to have about the > same amount of RAM as the size of your index? I does not seems like there is a clear current consensus on hardware to handle IO problems. I am firmly in the SSD camp, but as you can see from the current thread, other people recommend RAM and/or extra machines. I can say that our tests with RAM and spinning disks showed us that a lot of RAM certainly helps a lot, but also that it takes a considerable amount of time to warm the index before the performance is satisfactory. It might be helped with disk cache tricks, such as copying the whole index to /dev/null before opening it in Solr. > So, with a 5GB index, I should have between 4GB and 8GB of RAM > dedicated to solr? Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~= index size recommendation.
Re: Improving Solr performance
I see a lot of people using shards to hold "different types of documents", and it almost always seems to be a bad solution. Shards are intended for distributing a large index over multiple hosts -- that's it. Not for some kind of federated search over multiple schemas, not for access control. Why not put everything in the same index, without shards, and just use an 'fq' limit in order to limit to the specific document you'd like to search over in a given search?I think that would achieve your goal a lot more simply than shards -- then you use sharding only if and when your index grows to be so large you'd like to distribute it over multiple hosts, and when you do so you choose a shard key that will have more or less equal distribution accross shards. Using shards for access control or schema management just leads to headaches. [Apparently Solr could use some highlighted documentation on what shards are really for, as it seems to be a very common issue on this list, someone trying to use them for something else and then inevitably finding problems with that approach.] Jonathan On 1/7/2011 6:48 AM, supersoft wrote: The reason of this distribution is the kind of the documents. In spite of having the same schema structure (and solr conf), a document belongs to 1 of 5 different kinds. Each kind corresponds to a concrete shard and due to this, the implemented client tool avoids searching in all the shards when the users selects just one or a few of kinds. The tool runs a multisharded query of the proper shards. I guess this is a right approach but correct me if I am wrong. The real problem of this architecture is the correlation between concurrent users and response time: 1 query: n seconds 2 queries: 2*n second each query 3 queries: 3*n seconds each query and so... This is being a real headache because 1 single query has an acceptable response time but when many users are accessing to the server the performance goes hardly down.
Re: Improving Solr performance
No, it also depends on the queries you execute (sorting is a big consumer) and the number of concurrent users. > Is that a general rule of thumb? That it is best to have about the > same amount of RAM as the size of your index? > > So, with a 5GB index, I should have between 4GB and 8GB of RAM > dedicated to solr?
Re: Improving Solr performance
> I see from your other messages that these indexes all live on the same > machine. > You're almost certainly I/O bound, because you don't have enough memory for > the > OS to cache your index files. With 100GB of total index size, you'll get best > results with between 64GB and 128GB of total RAM. Is that a general rule of thumb? That it is best to have about the same amount of RAM as the size of your index? So, with a 5GB index, I should have between 4GB and 8GB of RAM dedicated to solr?
Re: Improving Solr performance
These are definitely server grade machines. There aren't any desktops I know of (that aren't made for HD video editing/rendition) that ever need that kind of memory. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Shawn Heisey To: solr-user@lucene.apache.org Sent: Sun, January 9, 2011 4:34:08 PM Subject: Re: Improving Solr performance On 1/7/2011 2:57 AM, supersoft wrote: > have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs > shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5 > has 11915639 docs Indexes total size: 100GB > > The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I > run the server using Jetty (from Solr example download) with: java -Xmx3024M > -Dsolr.solr.home=multicore -jar start.jar > > The response time for a query is around 2-3 seconds. Nevertheless, if I > execute several queries at the same time the performance goes down > inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469 > ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484, > 7203, 7719, 7781 ms... I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Alternatively, you could use SSD to store the indexes instead of spinning hard drives, or put each shard on its own physical machine with RAM appropriately sized for the index. For shard5 on its own machine, at 64GB index size, you might be able to get away with 32GB, but ideally you'd want 48-64GB. Can you do anything to reduce the index size? Perhaps you are storing fields that you don't need to be returned in the search results. Ideally, you should only include enough information to fully populate a search results grid, and retrieve detail information for an individual document from the original data source instead of Solr. Thanks, Shawn
Re: Improving Solr performance
On 1/7/2011 2:57 AM, supersoft wrote: have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5 has 11915639 docs Indexes total size: 100GB The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I run the server using Jetty (from Solr example download) with: java -Xmx3024M -Dsolr.solr.home=multicore -jar start.jar The response time for a query is around 2-3 seconds. Nevertheless, if I execute several queries at the same time the performance goes down inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469 ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484, 7203, 7719, 7781 ms... I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Alternatively, you could use SSD to store the indexes instead of spinning hard drives, or put each shard on its own physical machine with RAM appropriately sized for the index. For shard5 on its own machine, at 64GB index size, you might be able to get away with 32GB, but ideally you'd want 48-64GB. Can you do anything to reduce the index size? Perhaps you are storing fields that you don't need to be returned in the search results. Ideally, you should only include enough information to fully populate a search results grid, and retrieve detail information for an individual document from the original data source instead of Solr. Thanks, Shawn
Re: Improving Solr performance
Are you using the Solr caches? These are configured in solrconfig.xml in each core. Make sure you have at least 50-100 configured for each kind. Also, use filter queries: a filter query describes a subset of documents. When you do a bunch of queries against the same filter query, the second and subsequent queries are much much faster. All of this is explained in the Enterprise Solr 1.4 book. On Fri, Jan 7, 2011 at 7:20 AM, mike anderson wrote: > Making sure the index can fit in memory (you don't have to allocate that > much to Solr, just make sure it's available to the OS so it can cache it -- > otherwise you are paging the hard drive, which is why you are probably IO > bound) has been the key to our performance. We recently opted to use less > RAM and store the indices on SSDs, we're still evaluating this approach but > so far it seems to be comparable, so I agree with Toke! (We have 18 shards > and over 100GB of index). > > On Fri, Jan 7, 2011 at 10:07 AM, Toke Eskildsen > wrote: > >> On Fri, 2011-01-07 at 10:57 +0100, supersoft wrote: >> >> [5 shards, 100GB, ~20M documents] >> >> ... >> >> [Low performance for concurrent searches] >> >> > Using JConsole for monitoring the server java proccess I checked that >> Heap >> > Memory and the CPU Usages don't reach the upper limits so the server >> > shouldn't perform as overloaded. >> >> If memory and CPU is okay, the culprit is I/O. >> >> Solid state Drives has more than proven their worth for random access >> I/O, which is used a lot when searching with Solr/Lucene. SSD's are >> plug-in replacements for harddrives and they virtually eliminate I/O >> performance bottlenecks when searching. This also means shortened warm >> up requirements and less need for disk caching. Expanding RAM capacity >> does not scale well and requires extensive warmup. Adding more machines >> is expensive and often require architectural changes. With the current >> prices for SSD's, I consider them the generic first suggestion for >> improving search performance. >> >> Extra spinning disks improves the query throughput in general and speeds >> up single queries when the chards are searched in parallel. They do not >> help much for a single sequential searching of shards as the seek time >> for a single I/O request is the same regardless of the number of drives. >> If your current response time for a single user is satisfactory, adding >> drives is a viable solution for you. I'll still recommend the SSD option >> though, as it will also lower the response time for a single query. >> >> Regards, >> Toke Eskildsen >> >> > -- Lance Norskog goks...@gmail.com
Re: Improving Solr performance
Making sure the index can fit in memory (you don't have to allocate that much to Solr, just make sure it's available to the OS so it can cache it -- otherwise you are paging the hard drive, which is why you are probably IO bound) has been the key to our performance. We recently opted to use less RAM and store the indices on SSDs, we're still evaluating this approach but so far it seems to be comparable, so I agree with Toke! (We have 18 shards and over 100GB of index). On Fri, Jan 7, 2011 at 10:07 AM, Toke Eskildsen wrote: > On Fri, 2011-01-07 at 10:57 +0100, supersoft wrote: > > [5 shards, 100GB, ~20M documents] > > ... > > [Low performance for concurrent searches] > > > Using JConsole for monitoring the server java proccess I checked that > Heap > > Memory and the CPU Usages don't reach the upper limits so the server > > shouldn't perform as overloaded. > > If memory and CPU is okay, the culprit is I/O. > > Solid state Drives has more than proven their worth for random access > I/O, which is used a lot when searching with Solr/Lucene. SSD's are > plug-in replacements for harddrives and they virtually eliminate I/O > performance bottlenecks when searching. This also means shortened warm > up requirements and less need for disk caching. Expanding RAM capacity > does not scale well and requires extensive warmup. Adding more machines > is expensive and often require architectural changes. With the current > prices for SSD's, I consider them the generic first suggestion for > improving search performance. > > Extra spinning disks improves the query throughput in general and speeds > up single queries when the chards are searched in parallel. They do not > help much for a single sequential searching of shards as the seek time > for a single I/O request is the same regardless of the number of drives. > If your current response time for a single user is satisfactory, adding > drives is a viable solution for you. I'll still recommend the SSD option > though, as it will also lower the response time for a single query. > > Regards, > Toke Eskildsen > >
Re: Improving Solr performance
On Fri, 2011-01-07 at 10:57 +0100, supersoft wrote: [5 shards, 100GB, ~20M documents] ... [Low performance for concurrent searches] > Using JConsole for monitoring the server java proccess I checked that Heap > Memory and the CPU Usages don't reach the upper limits so the server > shouldn't perform as overloaded. If memory and CPU is okay, the culprit is I/O. Solid state Drives has more than proven their worth for random access I/O, which is used a lot when searching with Solr/Lucene. SSD's are plug-in replacements for harddrives and they virtually eliminate I/O performance bottlenecks when searching. This also means shortened warm up requirements and less need for disk caching. Expanding RAM capacity does not scale well and requires extensive warmup. Adding more machines is expensive and often require architectural changes. With the current prices for SSD's, I consider them the generic first suggestion for improving search performance. Extra spinning disks improves the query throughput in general and speeds up single queries when the chards are searched in parallel. They do not help much for a single sequential searching of shards as the seek time for a single I/O request is the same regardless of the number of drives. If your current response time for a single user is satisfactory, adding drives is a viable solution for you. I'll still recommend the SSD option though, as it will also lower the response time for a single query. Regards, Toke Eskildsen
Re: Improving Solr performance
It sounds like your system is I/O bound and I suspect (bet even) that all your index files are on the same disk drive. Also you have only 8GB of RAM for 100GB of index, so while your SOLR instance will cache some stuff and the balance will be used for caching file blocks, there really isn't enough memory there for effective caching. I would suggest you check your machine's performance with something like atop ( http://www.atoptool.nl/ ) to see where your bottlenecks are (check the disk I/O). As I said I think you are I/O bound, and if all your shards are on the same drive there will be I/O contention when running simultaneous searches. Your solutions are (in rough ascending order of cost): - make your indices smaller (reduce disk I/O) - buy more drives and spread your indices across the drives (reduce contention). - buy more RAM (increase caching). - buy more machines (more throughput). Good luck! François On Jan 7, 2011, at 4:57 AM, supersoft wrote: > > have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs > shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5 > has 11915639 docs Indexes total size: 100GB > > The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I > run the server using Jetty (from Solr example download) with: java -Xmx3024M > -Dsolr.solr.home=multicore -jar start.jar > > The response time for a query is around 2-3 seconds. Nevertheless, if I > execute several queries at the same time the performance goes down > inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469 > ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484, > 7203, 7719, 7781 ms... > > Using JConsole for monitoring the server java proccess I checked that Heap > Memory and the CPU Usages don't reach the upper limits so the server > shouldn't perform as overloaded. Can anyone give me an approach of how I > should tune the instance for not being so hardly dependent of the number of > simultaneous queries? > > Thanks in advance > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2210843.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
The reason of this distribution is the kind of the documents. In spite of having the same schema structure (and solr conf), a document belongs to 1 of 5 different kinds. Each kind corresponds to a concrete shard and due to this, the implemented client tool avoids searching in all the shards when the users selects just one or a few of kinds. The tool runs a multisharded query of the proper shards. I guess this is a right approach but correct me if I am wrong. The real problem of this architecture is the correlation between concurrent users and response time: 1 query: n seconds 2 queries: 2*n second each query 3 queries: 3*n seconds each query and so... This is being a real headache because 1 single query has an acceptable response time but when many users are accessing to the server the performance goes hardly down. -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211305.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Improving Solr performance
open a new mail conversation for that - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211300.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Improving Solr performance
Hi, Always interesting question! Anyone could purpose a generic (and approximate) equation: Search_time = F(Nb_of_servers, RAM_size_per_server, CPU_of_servers, Nb_of_shards, Nb_of_documents, Total_size_of_documents or Average_size_of_a_document, Nb_requests_in_minute, Nb_indexed_fields_in_index, ...) ? Regards, --- Hong-Thai -Message d'origine- De : Grijesh.singh [mailto:pintu.grij...@gmail.com] Envoyé : vendredi 7 janvier 2011 12:29 À : solr-user@lucene.apache.org Objet : Re: Improving Solr performance shards are used when index size become huge and performance going down . shards mean distributed indexes. But if you will put all shards on same machine as multicore then it will not help too much on performance. and also shards distributes indexes near equals in size. There is also not enough Ram to perform better.If your all index can load in Cache then it will give you better performance. Also there are not equally distributed indexes so all shards have different response time. When working with shards please keep in mind that main searcher sends query to all shards and waits for response from all shards and incorporate all responses in a single result and returns. So if any of shards taking more time to response then your total response time will affect - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211228.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
shards are used when index size become huge and performance going down . shards mean distributed indexes. But if you will put all shards on same machine as multicore then it will not help too much on performance. and also shards distributes indexes near equals in size. There is also not enough Ram to perform better.If your all index can load in Cache then it will give you better performance. Also there are not equally distributed indexes so all shards have different response time. When working with shards please keep in mind that main searcher sends query to all shards and waits for response from all shards and incorporate all responses in a single result and returns. So if any of shards taking more time to response then your total response time will affect - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211228.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
shards are used when index size become huge and performance going down . shards mean distributed indexes. But if you will put all shards on same machine as multicore then it will not help too much on performance. and also shards distributes indexes near equals in size. There is also not enough Ram to perform better.If your all index can load in Cache then it will give you better performance. Also there are not equally distributed indexes so all shards have different response time. When working with shards please keep in mind that main searcher sends query to all shards and waits for response from all shards and incorporate all responses in a single result and returns. So if any of shards taking more time to response then your total response time will affect - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211226.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
1 - Yes, all the shards are in the same machine 2 - The machine RAM is 7.8GB and I assign 3.4GB to Solr server 3 - The shards sizes (GB) are 17, 5, 3, 11, 64 -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211135.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Improving Solr performance
Some questions- 1-Are all shards on same machine 2-What is your Ram Size 3-What are the size of index on each shards in GB - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2210878.html Sent from the Solr - User mailing list archive at Nabble.com.
Improving Solr performance
have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5 has 11915639 docs Indexes total size: 100GB The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I run the server using Jetty (from Solr example download) with: java -Xmx3024M -Dsolr.solr.home=multicore -jar start.jar The response time for a query is around 2-3 seconds. Nevertheless, if I execute several queries at the same time the performance goes down inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469 ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484, 7203, 7719, 7781 ms... Using JConsole for monitoring the server java proccess I checked that Heap Memory and the CPU Usages don't reach the upper limits so the server shouldn't perform as overloaded. Can anyone give me an approach of how I should tune the instance for not being so hardly dependent of the number of simultaneous queries? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210842p2210842.html Sent from the Solr - User mailing list archive at Nabble.com.
Improving Solr performance
have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5 has 11915639 docs Indexes total size: 100GB The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I run the server using Jetty (from Solr example download) with: java -Xmx3024M -Dsolr.solr.home=multicore -jar start.jar The response time for a query is around 2-3 seconds. Nevertheless, if I execute several queries at the same time the performance goes down inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469 ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484, 7203, 7719, 7781 ms... Using JConsole for monitoring the server java proccess I checked that Heap Memory and the CPU Usages don't reach the upper limits so the server shouldn't perform as overloaded. Can anyone give me an approach of how I should tune the instance for not being so hardly dependent of the number of simultaneous queries? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2210843.html Sent from the Solr - User mailing list archive at Nabble.com.