Solr8.0.0 Performance Test

2019-05-18 Thread Kayak28
Hello, Apache Solr community members:

I have a few questions about the load test of Solr8.

- for Solr8, optimization command merge segment to 2, but not 1.
Is that ok behavior?
When indexing Wikipedia data, Solr8 generated multiple segments.
So, I executed  command from the Admin UI.
Solr8 did reduce the number of segments, but it left two segments.
Hence, I wonder if it is ok behavior or it is weird.



- in a certain situation (explained below), Solr8 (without the use of
http/2 and block-max WAND algorithm) is faster then Solr7.4.0. What are the
considerable causes of this performance improvement? Or did I plan
load-test badly?

Here is the story I came up with these questions.
I performed a simple load test on Solr8 to observe the difference in
performance.
So, I wondered how fast it became, comparing to Solr7.4.0, which is the
version I currently use.

My testing environment is below:
OS: Ubuntu 16.04
Vendor: DELL PowerEdge T410
CPU:Intel(R) Xeon(R) CPU E5620 @2.40 GHz 8 Core
Memory: 16GB
Hard Disk: 3.5 Inch SATA (7,200 rpm): 500 GB

The data is from the Japanese Wikipedia dump.

By indexing them, both versions of Solrs store 2'366'754 documents, which
the index size and JVM memory are 8.48 GB and 8GB accordingly.

In order to perform several times of load-tests, only fieldValueCache and
fieldCache are working; other Solr's caches are turned off.

I use Jmeter(5.1.1) to measure average response time and throughput.
I know Jmeter only sends HTTP/1 requests, without a plugin. (and I did not
use the plugin)
So, this result should not be affected by HTTP/2.

Also, according to a JIRA ( https://issues.apache.org/jira/browse/SOLR-13289
).
Solr8 has not supported block-max WAND algorithm yet, so again this result
should not be affected by the algorithm, which makes Lucene faster.

The results from Jmeter is attached as a PDF file.

According to these results, Solr8 is somehow superior then Solr7.4.0.

But, I have no idea what are the considerable causes of this difference.
Does anyone have any idea about this?


Sincerely,
Kaya Ota


Re: minimize disc space requirement.

2019-05-18 Thread Erick Erickson
It Depends (tm).

No, limiting the background threads won’t help much. Here’s the issue:
At time T, the segments file contains the current “snapshot” of the index, i.e. 
the names of all the segments that have been committed.

At time T+N, another commit happens. Or, consider an optimize which for 6x 
defaults to merging into a single segment. During any merge, _all_ the new 
segments are written before _any_ old segment is deleted. The very last 
operation is to rewrite the segments file, but only after all the new segments 
are flushed.

After this point, the next time a searcher is opened all the old, 
no-longer-used segments will be deleted, but the trigger is opening a new 
searcher.

To make matters more interesting, during the merge process say new documents 
are indexed. Those go into new segments that aren’t in the totals above. Plus 
you have transaction logs being written which are usually pretty small, but can 
grow between commits.

I’ve used optimize as the example, but it’t at least theoretically possible 
that all the current segments are rewritten into a larger segment as part of a 
normal merge. This is frankly not very likely with large indexes (say > 20G) 
but still possible.

Now all that said, on a disk that’s hosting multiple replicas from multiple 
shards and/or multiple collections, the likelihood of all this happening at 
once (barring someone issuing an optimize for all the collections hosted on the 
machine) is very low. But what you’re risking is an unknown. Lucene/Solr try 
very hard to prevent bad stuff happening on a “disk full” situation, but given 
the number of possible code paths that could be affected it can’t be guaranteed 
to have benign outcomes.

So perhaps you can run forever with, say, 25% of the aggregate index size free. 
Perhaps you’ll blow up unexpectedly and there’s really no way to say ahead of 
time.

Best,
Erick

> On May 18, 2019, at 8:36 AM, tom_s  wrote:
> 
> hey, 
> im aware that the best practice is to have disk space on your solr servers
> to be 2 times the size of the index. but my goal to minimize this overhead
> and have my index occupy more than 50% of disk space. in our index documents
> have TTL, so documents are deleted every day and it causes background merge
> of segments. can i change the merge policy and make the overhead of
> background merging lower?  
> will limiting the number of concurrent merges help(with the maxMergeCount
> parameter)? do you know other methods that will help? 
> 
> info about my server: 
> i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit
> every 5 minutes. the size of the index in each shard is around 70GB (with
> around 15% deletions) . 
> i use the following merge policy:
> 
>  2
>  4
> 
> (the rest of the params are default) 
> 
> thanks
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: minimize disc space requirement.

2019-05-18 Thread Erick Erickson
Oh, and none of that includes people adding more and more documents to the 
existing replicas….

> On May 18, 2019, at 10:22 AM, Shawn Heisey  wrote:
> 
> On 5/18/2019 9:36 AM, tom_s wrote:
>> im aware that the best practice is to have disk space on your solr servers
>> to be 2 times the size of the index. but my goal to minimize this overhead
>> and have my index occupy more than 50% of disk space. in our index documents
>> have TTL, so documents are deleted every day and it causes background merge
>> of segments. can i change the merge policy and make the overhead of
>> background merging lower?
>> will limiting the number of concurrent merges help(with the maxMergeCount
>> parameter)? do you know other methods that will help?
> 
> Actually the recommendation is to have enough space for the index to triple, 
> not just double.  This can happen in the wild.
> 
> There are no merge settings that can prevent situations where the index 
> doubles in size temporarily due to merging.  Chances are that it's going to 
> happen eventually to any index.
> 
> Thanks,
> Shawn



Re: minimize disc space requirement.

2019-05-18 Thread Shawn Heisey

On 5/18/2019 9:36 AM, tom_s wrote:

im aware that the best practice is to have disk space on your solr servers
to be 2 times the size of the index. but my goal to minimize this overhead
and have my index occupy more than 50% of disk space. in our index documents
have TTL, so documents are deleted every day and it causes background merge
of segments. can i change the merge policy and make the overhead of
background merging lower?
will limiting the number of concurrent merges help(with the maxMergeCount
parameter)? do you know other methods that will help?


Actually the recommendation is to have enough space for the index to 
triple, not just double.  This can happen in the wild.


There are no merge settings that can prevent situations where the index 
doubles in size temporarily due to merging.  Chances are that it's going 
to happen eventually to any index.


Thanks,
Shawn


minimize disc space requirement.

2019-05-18 Thread tom_s
hey, 
im aware that the best practice is to have disk space on your solr servers
to be 2 times the size of the index. but my goal to minimize this overhead
and have my index occupy more than 50% of disk space. in our index documents
have TTL, so documents are deleted every day and it causes background merge
of segments. can i change the merge policy and make the overhead of
background merging lower?  
will limiting the number of concurrent merges help(with the maxMergeCount
parameter)? do you know other methods that will help? 

info about my server: 
i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit
every 5 minutes. the size of the index in each shard is around 70GB (with
around 15% deletions) . 
i use the following merge policy:

  2
  4

(the rest of the params are default) 

thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Distributed IDF in Alias

2019-05-18 Thread Erick Erickson
In a word, “yes”. For time routed alias, you also have to be aware of the 
nature of your data. Take the canonical example of news stories for instance, 
and let’s assume that every day a new collection is created.

Now a hot news story breaks and the news is flooded with the latest story, 
“Hurricane hits Florida" for instance. The recent news will contain many more 
mentions of Florida .vs. older collections. So the TF/IDF statistics for recent 
collections will be much different than old collections.

In the normal SolrCloud case where routing is done by hashing the , 
the assumption is that the close-to-random distribution of stories will make 
the stats on individual shards “close enough”.

Best,
Erick

> On May 17, 2019, at 11:14 PM, SOLR4189  wrote:
> 
> I ask my question due to I want to use TRA (Time Routed Aliases). Let's say
> SOLR will open new collection every month. In the beginning of month a new
> collection will be empty almost. 
> So IDF will be different between new collection and collection of previous
> month? 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Distributed IDF in Alias

2019-05-18 Thread Andrzej Białecki
Yes, the IDFs will be different. You could probably implement a custom 
component that would take term statistics from the previous collections to 
pre-populate the stats of the current collection, but this is an uncharted 
area, there’s a lot that could go wrong. Eg. if there’s a genuine shift in the 
term distribution in more recent documents then you probably would not want the 
old statistics to skew the more recent results, at least you would want to use 
some weighting factor - and at this point predicting the final term IDFs (and 
consequently document rankings) becomes quite complicated.

> On 18 May 2019, at 08:14, SOLR4189  wrote:
> 
> I ask my question due to I want to use TRA (Time Routed Aliases). Let's say
> SOLR will open new collection every month. In the beginning of month a new
> collection will be empty almost. 
> So IDF will be different between new collection and collection of previous
> month? 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>