Solr8.0.0 Performance Test
Hello, Apache Solr community members: I have a few questions about the load test of Solr8. - for Solr8, optimization command merge segment to 2, but not 1. Is that ok behavior? When indexing Wikipedia data, Solr8 generated multiple segments. So, I executed command from the Admin UI. Solr8 did reduce the number of segments, but it left two segments. Hence, I wonder if it is ok behavior or it is weird. - in a certain situation (explained below), Solr8 (without the use of http/2 and block-max WAND algorithm) is faster then Solr7.4.0. What are the considerable causes of this performance improvement? Or did I plan load-test badly? Here is the story I came up with these questions. I performed a simple load test on Solr8 to observe the difference in performance. So, I wondered how fast it became, comparing to Solr7.4.0, which is the version I currently use. My testing environment is below: OS: Ubuntu 16.04 Vendor: DELL PowerEdge T410 CPU:Intel(R) Xeon(R) CPU E5620 @2.40 GHz 8 Core Memory: 16GB Hard Disk: 3.5 Inch SATA (7,200 rpm): 500 GB The data is from the Japanese Wikipedia dump. By indexing them, both versions of Solrs store 2'366'754 documents, which the index size and JVM memory are 8.48 GB and 8GB accordingly. In order to perform several times of load-tests, only fieldValueCache and fieldCache are working; other Solr's caches are turned off. I use Jmeter(5.1.1) to measure average response time and throughput. I know Jmeter only sends HTTP/1 requests, without a plugin. (and I did not use the plugin) So, this result should not be affected by HTTP/2. Also, according to a JIRA ( https://issues.apache.org/jira/browse/SOLR-13289 ). Solr8 has not supported block-max WAND algorithm yet, so again this result should not be affected by the algorithm, which makes Lucene faster. The results from Jmeter is attached as a PDF file. According to these results, Solr8 is somehow superior then Solr7.4.0. But, I have no idea what are the considerable causes of this difference. Does anyone have any idea about this? Sincerely, Kaya Ota
Re: minimize disc space requirement.
It Depends (tm). No, limiting the background threads won’t help much. Here’s the issue: At time T, the segments file contains the current “snapshot” of the index, i.e. the names of all the segments that have been committed. At time T+N, another commit happens. Or, consider an optimize which for 6x defaults to merging into a single segment. During any merge, _all_ the new segments are written before _any_ old segment is deleted. The very last operation is to rewrite the segments file, but only after all the new segments are flushed. After this point, the next time a searcher is opened all the old, no-longer-used segments will be deleted, but the trigger is opening a new searcher. To make matters more interesting, during the merge process say new documents are indexed. Those go into new segments that aren’t in the totals above. Plus you have transaction logs being written which are usually pretty small, but can grow between commits. I’ve used optimize as the example, but it’t at least theoretically possible that all the current segments are rewritten into a larger segment as part of a normal merge. This is frankly not very likely with large indexes (say > 20G) but still possible. Now all that said, on a disk that’s hosting multiple replicas from multiple shards and/or multiple collections, the likelihood of all this happening at once (barring someone issuing an optimize for all the collections hosted on the machine) is very low. But what you’re risking is an unknown. Lucene/Solr try very hard to prevent bad stuff happening on a “disk full” situation, but given the number of possible code paths that could be affected it can’t be guaranteed to have benign outcomes. So perhaps you can run forever with, say, 25% of the aggregate index size free. Perhaps you’ll blow up unexpectedly and there’s really no way to say ahead of time. Best, Erick > On May 18, 2019, at 8:36 AM, tom_s wrote: > > hey, > im aware that the best practice is to have disk space on your solr servers > to be 2 times the size of the index. but my goal to minimize this overhead > and have my index occupy more than 50% of disk space. in our index documents > have TTL, so documents are deleted every day and it causes background merge > of segments. can i change the merge policy and make the overhead of > background merging lower? > will limiting the number of concurrent merges help(with the maxMergeCount > parameter)? do you know other methods that will help? > > info about my server: > i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit > every 5 minutes. the size of the index in each shard is around 70GB (with > around 15% deletions) . > i use the following merge policy: > > 2 > 4 > > (the rest of the params are default) > > thanks > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: minimize disc space requirement.
Oh, and none of that includes people adding more and more documents to the existing replicas…. > On May 18, 2019, at 10:22 AM, Shawn Heisey wrote: > > On 5/18/2019 9:36 AM, tom_s wrote: >> im aware that the best practice is to have disk space on your solr servers >> to be 2 times the size of the index. but my goal to minimize this overhead >> and have my index occupy more than 50% of disk space. in our index documents >> have TTL, so documents are deleted every day and it causes background merge >> of segments. can i change the merge policy and make the overhead of >> background merging lower? >> will limiting the number of concurrent merges help(with the maxMergeCount >> parameter)? do you know other methods that will help? > > Actually the recommendation is to have enough space for the index to triple, > not just double. This can happen in the wild. > > There are no merge settings that can prevent situations where the index > doubles in size temporarily due to merging. Chances are that it's going to > happen eventually to any index. > > Thanks, > Shawn
Re: minimize disc space requirement.
On 5/18/2019 9:36 AM, tom_s wrote: im aware that the best practice is to have disk space on your solr servers to be 2 times the size of the index. but my goal to minimize this overhead and have my index occupy more than 50% of disk space. in our index documents have TTL, so documents are deleted every day and it causes background merge of segments. can i change the merge policy and make the overhead of background merging lower? will limiting the number of concurrent merges help(with the maxMergeCount parameter)? do you know other methods that will help? Actually the recommendation is to have enough space for the index to triple, not just double. This can happen in the wild. There are no merge settings that can prevent situations where the index doubles in size temporarily due to merging. Chances are that it's going to happen eventually to any index. Thanks, Shawn
minimize disc space requirement.
hey, im aware that the best practice is to have disk space on your solr servers to be 2 times the size of the index. but my goal to minimize this overhead and have my index occupy more than 50% of disk space. in our index documents have TTL, so documents are deleted every day and it causes background merge of segments. can i change the merge policy and make the overhead of background merging lower? will limiting the number of concurrent merges help(with the maxMergeCount parameter)? do you know other methods that will help? info about my server: i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit every 5 minutes. the size of the index in each shard is around 70GB (with around 15% deletions) . i use the following merge policy: 2 4 (the rest of the params are default) thanks -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Distributed IDF in Alias
In a word, “yes”. For time routed alias, you also have to be aware of the nature of your data. Take the canonical example of news stories for instance, and let’s assume that every day a new collection is created. Now a hot news story breaks and the news is flooded with the latest story, “Hurricane hits Florida" for instance. The recent news will contain many more mentions of Florida .vs. older collections. So the TF/IDF statistics for recent collections will be much different than old collections. In the normal SolrCloud case where routing is done by hashing the , the assumption is that the close-to-random distribution of stories will make the stats on individual shards “close enough”. Best, Erick > On May 17, 2019, at 11:14 PM, SOLR4189 wrote: > > I ask my question due to I want to use TRA (Time Routed Aliases). Let's say > SOLR will open new collection every month. In the beginning of month a new > collection will be empty almost. > So IDF will be different between new collection and collection of previous > month? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Distributed IDF in Alias
Yes, the IDFs will be different. You could probably implement a custom component that would take term statistics from the previous collections to pre-populate the stats of the current collection, but this is an uncharted area, there’s a lot that could go wrong. Eg. if there’s a genuine shift in the term distribution in more recent documents then you probably would not want the old statistics to skew the more recent results, at least you would want to use some weighting factor - and at this point predicting the final term IDFs (and consequently document rankings) becomes quite complicated. > On 18 May 2019, at 08:14, SOLR4189 wrote: > > I ask my question due to I want to use TRA (Time Routed Aliases). Let's say > SOLR will open new collection every month. In the beginning of month a new > collection will be empty almost. > So IDF will be different between new collection and collection of previous > month? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >