Re[2]: Optimize question

2018-04-23 Thread Scott M.
So, basically I made the first mistake by Optimizing ? At this point, since it 
seems I can't stop these optimizations from running, should I just drop all 
data and start fresh ?
On Mon, Apr 23, 2018 at 01:23 PM, Erick Erickson  wrote:
No, it's not "optimizing on its own". At least it better not be.

As far as your index growing after optimize, that's the little
"gotcha" with optimize, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
 
(https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/)

This is being addressed in the 7.4 time frame (hopefully), see LUCENE-7976.

Best,
Erick

On Mon, Apr 23, 2018 at 10:13 AM, Scott M.  wrote:
I recently installed Solr 7.1 and configured it to work with Dovecot for 
full-text searching. It works great but after about 2 days of indexing, I've 
pressed the 'Optimize' button. At that point it had collected about 17 million 
documents and it was taking up about 60-70GB of space.

It completed once and the space dropped down to 30-45GB but since then it 
appears to be doing Optimize again on its own, regularly swelling up the total 
space used to double, then it shrinks again, stays a bit that way then it 
starts another optimize!

Logs show:
4/22/2018, 11:04:22 PM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with care.
4/23/2018, 3:18:35 AM
WARN true
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with care.
4/23/2018, 7:33:46 AM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with care.
4/23/2018, 9:48:32 AM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with care.
4/23/2018, 11:25:13 AM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with care.
4/23/2018, 1:00:42 PM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with care.
It's absolutely killing the computer this is running on. Now it just started 
another run...

In the logs all I see is entries like these, and it doesn't say anywhere 
optimize=true

2018-04-23 17:12:31.995 INFO  (qtp947679291-17200) [   x:dovecot] 
o.a.s.u.DirectUpdateHandler2 start 
commit{_version_=1598557836536709120,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=true,prepareCommit=false}


Re[2]: Optimize question

2018-04-23 Thread Scott M.
I only have one core, 'dovecot'. This is a pretty standard config. How do I 
stop it from doing all these 'Optimizes' ? Is there an automatic process that 
triggers them ?
On Mon, Apr 23, 2018 at 01:25 PM, Shawn Heisey  wrote:
On 4/23/2018 11:13 AM, Scott M. wrote:
I recently installed Solr 7.1 and configured it to work with Dovecot for 
full-text searching. It works great but after about 2 days of indexing, I've 
pressed the 'Optimize' button. At that point it had collected about 17 million 
documents and it was taking up about 60-70GB of space.

It completed once and the space dropped down to 30-45GB but since then it 
appears to be doing Optimize again on its own, regularly swelling up the total 
space used to double, then it shrinks again, stays a bit that way then it 
starts another optimize!

Are you running in SolrCloud mode with multiple replicas and/or multiple 
shards?

If so, SolrCloud does optimize a little differently than standalone 
mode.  It will optimize every core in the entire collection, one at a 
time, regardless of which actual core receives the optimize request.  In 
standalone mode, only the specific core you run the command on will be 
optimized.

Thanks,
Shawn


Re: Optimize question

2018-04-23 Thread Shawn Heisey

On 4/23/2018 11:13 AM, Scott M. wrote:

I recently installed Solr 7.1 and configured it to work with Dovecot for 
full-text searching. It works great but after about 2 days of indexing, I've 
pressed the 'Optimize' button. At that point it had collected about 17 million 
documents and it was taking up about 60-70GB of space.

It completed once and the space dropped down to 30-45GB but since then it 
appears to be doing Optimize again on its own, regularly swelling up the total 
space used to double, then it shrinks again, stays a bit that way then it 
starts another optimize!


Are you running in SolrCloud mode with multiple replicas and/or multiple 
shards?


If so, SolrCloud does optimize a little differently than standalone 
mode.  It will optimize every core in the entire collection, one at a 
time, regardless of which actual core receives the optimize request.  In 
standalone mode, only the specific core you run the command on will be 
optimized.


Thanks,
Shawn



Re: Optimize question

2018-04-23 Thread Erick Erickson
No, it's not "optimizing on its own". At least it better not be.

As far as your index growing after optimize, that's the little
"gotcha" with optimize, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

This is being addressed in the 7.4 time frame (hopefully), see LUCENE-7976.

Best,
Erick

On Mon, Apr 23, 2018 at 10:13 AM, Scott M.  wrote:
> I recently installed Solr 7.1 and configured it to work with Dovecot for 
> full-text searching. It works great but after about 2 days of indexing, I've 
> pressed the 'Optimize' button. At that point it had collected about 17 
> million documents and it was taking up about 60-70GB of space.
>
> It completed once and the space dropped down to 30-45GB but since then it 
> appears to be doing Optimize again on its own, regularly swelling up the 
> total space used to double, then it shrinks again, stays a bit that way then 
> it starts another optimize!
>
> Logs show:
> 4/22/2018, 11:04:22 PM
> WARN false
> DirectUpdateHandler2
> Starting optimize... Reading and rewriting the entire index! Use with 
> care.
> 4/23/2018, 3:18:35 AM
> WARN true
> DirectUpdateHandler2
> Starting optimize... Reading and rewriting the entire index! Use with 
> care.
> 4/23/2018, 7:33:46 AM
> WARN false
> DirectUpdateHandler2
> Starting optimize... Reading and rewriting the entire index! Use with 
> care.
> 4/23/2018, 9:48:32 AM
> WARN false
> DirectUpdateHandler2
> Starting optimize... Reading and rewriting the entire index! Use with 
> care.
> 4/23/2018, 11:25:13 AM
> WARN false
> DirectUpdateHandler2
> Starting optimize... Reading and rewriting the entire index! Use with 
> care.
> 4/23/2018, 1:00:42 PM
> WARN false
> DirectUpdateHandler2
> Starting optimize... Reading and rewriting the entire index! Use with 
> care.
> It's absolutely killing the computer this is running on. Now it just started 
> another run...
>
> In the logs all I see is entries like these, and it doesn't say anywhere 
> optimize=true
>
> 2018-04-23 17:12:31.995 INFO  (qtp947679291-17200) [   x:dovecot] 
> o.a.s.u.DirectUpdateHandler2 start 
> commit{_version_=1598557836536709120,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=true,prepareCommit=false}


Optimize question

2018-04-23 Thread Scott M.
I recently installed Solr 7.1 and configured it to work with Dovecot for 
full-text searching. It works great but after about 2 days of indexing, I've 
pressed the 'Optimize' button. At that point it had collected about 17 million 
documents and it was taking up about 60-70GB of space. 

It completed once and the space dropped down to 30-45GB but since then it 
appears to be doing Optimize again on its own, regularly swelling up the total 
space used to double, then it shrinks again, stays a bit that way then it 
starts another optimize!

Logs show:
4/22/2018, 11:04:22 PM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with 
care.
4/23/2018, 3:18:35 AM
WARN true
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with 
care.
4/23/2018, 7:33:46 AM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with 
care.
4/23/2018, 9:48:32 AM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with 
care.
4/23/2018, 11:25:13 AM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with 
care.
4/23/2018, 1:00:42 PM
WARN false
DirectUpdateHandler2
Starting optimize... Reading and rewriting the entire index! Use with 
care.
It's absolutely killing the computer this is running on. Now it just started 
another run...

In the logs all I see is entries like these, and it doesn't say anywhere 
optimize=true

2018-04-23 17:12:31.995 INFO  (qtp947679291-17200) [   x:dovecot] 
o.a.s.u.DirectUpdateHandler2 start 
commit{_version_=1598557836536709120,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=true,prepareCommit=false}


Re: yet another optimize question

2013-06-20 Thread Jack Krupansky
Take a look at using DocValues for facets that are problematic. It not only 
moves the memory off-heap, but stores values in a much more optimal manner.


-- Jack Krupansky

-Original Message- 
From: Toke Eskildsen

Sent: Thursday, June 20, 2013 3:26 AM
To: solr-user@lucene.apache.org
Subject: RE: yet another optimize question

Petersen, Robert [robert.peter...@mail.rakuten.com] wrote:

We actually have hundreds of facet-able fields, but most are specialized
and are only faceted upon if the user has drilled into the particular 
category

to which they are applicable and so they are only indexed for products
in those categories.  I guess it is the facets that eat up so much of our
memory.


As Andre mentions, the problem is that the fc facet method maintains a list 
of values (or pointers to values, if we're talking text) for each document 
in the whole index. Faceting on a field that only has a single value in a 
single document in the whole index still allocates memory linear to the 
total number of documents. You are in the same situation as John Nielsen in 
the thread "Solr using a ridiculous amount of memory" 
http://lucene.472066.n3.nabble.com/Solr-using-a-ridiculous-amount-of-memory-tt4050840.html#none


You could try and change the way you index the facet information to get 
around this waste, but it is quite a lot of work:

http://sbdevel.wordpress.com/2013/04/16/you-are-faceting-itwrong/


It was suggested that if I use facet method = enum for those particular
specialized facets then my memory usage would go down.


If the number of unique values in the individual facets is low, this could 
work. If nothing else, it is very easy to try.


- Toke Eskildsen= 



RE: yet another optimize question

2013-06-20 Thread Toke Eskildsen
Petersen, Robert [robert.peter...@mail.rakuten.com] wrote:
> We actually have hundreds of facet-able fields, but most are specialized
> and are only faceted upon if the user has drilled into the particular category
> to which they are applicable and so they are only indexed for products
> in those categories.  I guess it is the facets that eat up so much of our
> memory.

As Andre mentions, the problem is that the fc facet method maintains a list of 
values (or pointers to values, if we're talking text) for each document in the 
whole index. Faceting on a field that only has a single value in a single 
document in the whole index still allocates memory linear to the total number 
of documents. You are in the same situation as John Nielsen in the thread "Solr 
using a ridiculous amount of memory" 
http://lucene.472066.n3.nabble.com/Solr-using-a-ridiculous-amount-of-memory-tt4050840.html#none

You could try and change the way you index the facet information to get around 
this waste, but it is quite a lot of work:
http://sbdevel.wordpress.com/2013/04/16/you-are-faceting-itwrong/

> It was suggested that if I use facet method = enum for those particular
> specialized facets then my memory usage would go down.

If the number of unique values in the individual facets is low, this could 
work. If nothing else, it is very easy to try.

- Toke Eskildsen

RE: yet another optimize question

2013-06-19 Thread Petersen, Robert
We actually have hundreds of facet-able fields, but most are specialized and 
are only faceted upon if the user has drilled into the particular category to 
which they are applicable and so they are only indexed for products in those 
categories.  I guess it is the facets that eat up so much of our memory.  It 
was suggested that if I use facet method = enum for those particular 
specialized facets then my memory usage would go down.  I'm going to try that 
out and see how much it helps.

Thanks
Robi

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, June 19, 2013 10:50 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

I generally run with an 8GB heap for a system that does no faceting. 32GB does 
seem rather large, but you really should have room for bigger caches.

The Akamai cache will reduce your hit rate a lot. That is OK, because users are 
getting faster responses than they would from Solr. A 5% hit rate may be OK 
since you have that front end HTTP cache.

The Netflix index was updated daily. 

wunder

On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote:

> Hi Walter,
> 
> I used to have larger settings on our caches but it seemed like I had to make 
> the caches that small to reduce memory usage to keep from getting the dreaded 
> OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
> slave farm has a load balancer in front of twelve slave servers and our index 
> is being updated constantly, pretty much 24/7.  
> 
> So my question would be how do you run with such big caches without going 
> into the OOM zone?  Was the Netflix index only updated based upon the release 
> schedules of the studios, like once a week?  Our entertainment stores used to 
> be like that before we turned into a marketplace based e-tailer, but now we 
> get new listings from merchants all the time and so have a constant churn of 
> additions and deletions in our index.
> 
> I feel like at 32GB our heap is really huge, but we seem to use almost all of 
> it with these settings.   I am trying out the G1GC on one slave to see if 
> that gets memory usage lower but while it has a different collection pattern 
> in the various spaces it seems like the total memory usage peaks out at about 
> the same level.
> 
> Thanks
> Robi
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Tuesday, June 18, 2013 6:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
> 
> Your query cache is far too small. Most of the default caches are too small.
> 
> We run with 10K entries and get a hit rate around 0.30 across four servers. 
> This rate goes up with more queries, down with less, but try a bigger cache, 
> especially if you are updating the index infrequently, like once per day.
> 
> At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP 
> cache in front of it. The HTTP cache had an 80% hit rate.
> 
> I'd increase your document cache, too. I usually see about 0.75 or better on 
> that.
> 
> wunder
> 
> On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:
> 
>> Hi Otis, 
>> 
>> Yes the query results cache is just about worthless.   I guess we have too 
>> diverse of a set of user queries.  The business unit has decided to let bots 
>> crawl our search pages too so that doesn't help either.  I turned it way 
>> down but decided to keep it because my understanding was that it would still 
>> help for users going from page 1 to page 2 in a search.  Is that true?
>> 
>> Thanks
>> Robi
>> 
>> -Original Message-
>> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
>> Sent: Monday, June 17, 2013 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: yet another optimize question
>> 
>> Hi Robi,
>> 
>> This goes against the original problem of getting OOMEs, but it looks like 
>> each of your Solr caches could be a little bigger if you want to eliminate 
>> evictions, with the query results one possibly not being worth keeping if 
>> you can't get the hit % up enough.
>> 
>> Otis
>> --
>> Solr & ElasticSearch Support -- http://sematext.com/
>> 
>> 
>> On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
>>  wrote:
>>> Hi Otis,
>>> 
>>> Right I didn't restart the JVMs except on the one slave where I was 
>>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
>>> made all our caches small enough to keep us from getting OOMs while still 
>>> having a good hit rate.Our index has about 50 fields which are m

Re: yet another optimize question

2013-06-19 Thread Walter Underwood
I generally run with an 8GB heap for a system that does no faceting. 32GB does 
seem rather large, but you really should have room for bigger caches.

The Akamai cache will reduce your hit rate a lot. That is OK, because users are 
getting faster responses than they would from Solr. A 5% hit rate may be OK 
since you have that front end HTTP cache.

The Netflix index was updated daily. 

wunder

On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote:

> Hi Walter,
> 
> I used to have larger settings on our caches but it seemed like I had to make 
> the caches that small to reduce memory usage to keep from getting the dreaded 
> OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
> slave farm has a load balancer in front of twelve slave servers and our index 
> is being updated constantly, pretty much 24/7.  
> 
> So my question would be how do you run with such big caches without going 
> into the OOM zone?  Was the Netflix index only updated based upon the release 
> schedules of the studios, like once a week?  Our entertainment stores used to 
> be like that before we turned into a marketplace based e-tailer, but now we 
> get new listings from merchants all the time and so have a constant churn of 
> additions and deletions in our index.
> 
> I feel like at 32GB our heap is really huge, but we seem to use almost all of 
> it with these settings.   I am trying out the G1GC on one slave to see if 
> that gets memory usage lower but while it has a different collection pattern 
> in the various spaces it seems like the total memory usage peaks out at about 
> the same level.
> 
> Thanks
> Robi
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Tuesday, June 18, 2013 6:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
> 
> Your query cache is far too small. Most of the default caches are too small.
> 
> We run with 10K entries and get a hit rate around 0.30 across four servers. 
> This rate goes up with more queries, down with less, but try a bigger cache, 
> especially if you are updating the index infrequently, like once per day.
> 
> At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP 
> cache in front of it. The HTTP cache had an 80% hit rate.
> 
> I'd increase your document cache, too. I usually see about 0.75 or better on 
> that.
> 
> wunder
> 
> On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:
> 
>> Hi Otis, 
>> 
>> Yes the query results cache is just about worthless.   I guess we have too 
>> diverse of a set of user queries.  The business unit has decided to let bots 
>> crawl our search pages too so that doesn't help either.  I turned it way 
>> down but decided to keep it because my understanding was that it would still 
>> help for users going from page 1 to page 2 in a search.  Is that true?
>> 
>> Thanks
>> Robi
>> 
>> -Original Message-
>> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
>> Sent: Monday, June 17, 2013 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: yet another optimize question
>> 
>> Hi Robi,
>> 
>> This goes against the original problem of getting OOMEs, but it looks like 
>> each of your Solr caches could be a little bigger if you want to eliminate 
>> evictions, with the query results one possibly not being worth keeping if 
>> you can't get the hit % up enough.
>> 
>> Otis
>> --
>> Solr & ElasticSearch Support -- http://sematext.com/
>> 
>> 
>> On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
>>  wrote:
>>> Hi Otis,
>>> 
>>> Right I didn't restart the JVMs except on the one slave where I was 
>>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
>>> made all our caches small enough to keep us from getting OOMs while still 
>>> having a good hit rate.Our index has about 50 fields which are mostly 
>>> int IDs and there are some dynamic fields also.  These dynamic fields can 
>>> be used for custom faceting.  We have some standard facets we always facet 
>>> on and other dynamic facets which are only used if the query is filtering 
>>> on a particular category.  There are hundreds of these fields but since 
>>> they are only for a small subset of the overall index they are very 
>>> sparsely populated with regard to the overall index.  With CMS GC we get a 
>>> sawtooth on the old generation (I guess every replication and commit causes 
>>> it's usage to drop down to 10GB or so) and it seems to be the old 
>

RE: yet another optimize question

2013-06-19 Thread Petersen, Robert
Hi Walter,

I used to have larger settings on our caches but it seemed like I had to make 
the caches that small to reduce memory usage to keep from getting the dreaded 
OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
slave farm has a load balancer in front of twelve slave servers and our index 
is being updated constantly, pretty much 24/7.  

So my question would be how do you run with such big caches without going into 
the OOM zone?  Was the Netflix index only updated based upon the release 
schedules of the studios, like once a week?  Our entertainment stores used to 
be like that before we turned into a marketplace based e-tailer, but now we get 
new listings from merchants all the time and so have a constant churn of 
additions and deletions in our index.

I feel like at 32GB our heap is really huge, but we seem to use almost all of 
it with these settings.   I am trying out the G1GC on one slave to see if that 
gets memory usage lower but while it has a different collection pattern in the 
various spaces it seems like the total memory usage peaks out at about the same 
level.

Thanks
Robi

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, June 18, 2013 6:57 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Your query cache is far too small. Most of the default caches are too small.

We run with 10K entries and get a hit rate around 0.30 across four servers. 
This rate goes up with more queries, down with less, but try a bigger cache, 
especially if you are updating the index infrequently, like once per day.

At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache 
in front of it. The HTTP cache had an 80% hit rate.

I'd increase your document cache, too. I usually see about 0.75 or better on 
that.

wunder

On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:

> Hi Otis, 
> 
> Yes the query results cache is just about worthless.   I guess we have too 
> diverse of a set of user queries.  The business unit has decided to let bots 
> crawl our search pages too so that doesn't help either.  I turned it way down 
> but decided to keep it because my understanding was that it would still help 
> for users going from page 1 to page 2 in a search.  Is that true?
> 
> Thanks
> Robi
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
> Sent: Monday, June 17, 2013 6:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
> 
> Hi Robi,
> 
> This goes against the original problem of getting OOMEs, but it looks like 
> each of your Solr caches could be a little bigger if you want to eliminate 
> evictions, with the query results one possibly not being worth keeping if you 
> can't get the hit % up enough.
> 
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> 
> 
> On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
>  wrote:
>> Hi Otis,
>> 
>> Right I didn't restart the JVMs except on the one slave where I was 
>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
>> made all our caches small enough to keep us from getting OOMs while still 
>> having a good hit rate.Our index has about 50 fields which are mostly 
>> int IDs and there are some dynamic fields also.  These dynamic fields can be 
>> used for custom faceting.  We have some standard facets we always facet on 
>> and other dynamic facets which are only used if the query is filtering on a 
>> particular category.  There are hundreds of these fields but since they are 
>> only for a small subset of the overall index they are very sparsely 
>> populated with regard to the overall index.  With CMS GC we get a sawtooth 
>> on the old generation (I guess every replication and commit causes it's 
>> usage to drop down to 10GB or so) and it seems to be the old generation 
>> which is the main space consumer.  With the G1GC, the memory map looked 
>> totally different!  I was a little lost looking at memory consumption with 
>> that GC.  Maybe I'll try it again now that the index is a bit smaller than 
>> it was last time I tried it.  After four days without running an optimize 
>> now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
>> reducing the segments might be ok...
>> 
>> Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
>> but unfortunately I guess I can't send the history graphics to the solr-user 
>> list to show their changes over time:
>>NameUsedCommitted   Max   
>>   Initial Group
>> Par Survivor S

Re: yet another optimize question

2013-06-19 Thread Andre Bois-Crettez

indeed the actual syntax for per field facet is :

f.mysparefieldname.facet.method=enum

André

On 06/18/2013 09:00 PM, Petersen, Robert wrote:

Hi Andre,

Wow that is astonishing!  I will definitely also try that out!  Just set the 
facet method on a per field basis for the less used sparse facet fields eh?  
Thanks for the tip.

Thanks
Robi

-Original Message-
From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com]
Sent: Tuesday, June 18, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Recently we had steadily increasing memory usage and OOM due to facets on 
dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints for 
each field (a fieldCache or fieldValueCahe entry), whether it is sparsely 
populated or not.

Once you have reduced your number of maxDocs with the merge policy, it can be 
interesting to try facet.method=enum for all the sparsely populated dynamic 
fields.
Despite what is said in the wiki, in our case the performance was similar to 
facet.method=fc, however the JVM heap usage went down from about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:

Also some time ago I made all our caches small enough to keep us from getting 
OOMs while still having a good hit rate.Our index has about 50 fields which 
are mostly int IDs and there are some dynamic fields also.  These dynamic 
fields can be used for custom faceting.  We have some standard facets we always 
facet on and other dynamic facets which are only used if the query is filtering 
on a particular category.  There are hundreds of these fields but since they 
are only for a small subset of the overall index they are very sparsely 
populated with regard to the overall index.

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: yet another optimize question

2013-06-18 Thread Walter Underwood
Your query cache is far too small. Most of the default caches are too small.

We run with 10K entries and get a hit rate around 0.30 across four servers. 
This rate goes up with more queries, down with less, but try a bigger cache, 
especially if you are updating the index infrequently, like once per day.

At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache 
in front of it. The HTTP cache had an 80% hit rate.

I'd increase your document cache, too. I usually see about 0.75 or better on 
that.

wunder

On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:

> Hi Otis, 
> 
> Yes the query results cache is just about worthless.   I guess we have too 
> diverse of a set of user queries.  The business unit has decided to let bots 
> crawl our search pages too so that doesn't help either.  I turned it way down 
> but decided to keep it because my understanding was that it would still help 
> for users going from page 1 to page 2 in a search.  Is that true?
> 
> Thanks
> Robi
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
> Sent: Monday, June 17, 2013 6:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
> 
> Hi Robi,
> 
> This goes against the original problem of getting OOMEs, but it looks like 
> each of your Solr caches could be a little bigger if you want to eliminate 
> evictions, with the query results one possibly not being worth keeping if you 
> can't get the hit % up enough.
> 
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> 
> 
> On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
>  wrote:
>> Hi Otis,
>> 
>> Right I didn't restart the JVMs except on the one slave where I was 
>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
>> made all our caches small enough to keep us from getting OOMs while still 
>> having a good hit rate.Our index has about 50 fields which are mostly 
>> int IDs and there are some dynamic fields also.  These dynamic fields can be 
>> used for custom faceting.  We have some standard facets we always facet on 
>> and other dynamic facets which are only used if the query is filtering on a 
>> particular category.  There are hundreds of these fields but since they are 
>> only for a small subset of the overall index they are very sparsely 
>> populated with regard to the overall index.  With CMS GC we get a sawtooth 
>> on the old generation (I guess every replication and commit causes it's 
>> usage to drop down to 10GB or so) and it seems to be the old generation 
>> which is the main space consumer.  With the G1GC, the memory map looked 
>> totally different!  I was a little lost looking at memory consumption with 
>> that GC.  Maybe I'll try it again now that the index is a bit smaller than 
>> it was last time I tried it.  After four days without running an optimize 
>> now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
>> reducing the segments might be ok...
>> 
>> Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
>> but unfortunately I guess I can't send the history graphics to the solr-user 
>> list to show their changes over time:
>>NameUsedCommitted   Max   
>>   Initial Group
>> Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
>>   108.13 MB   HEAP
>> CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
>> MBNON_HEAP
>> Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
>> NON_HEAP
>> CMS Old Gen20.22 GB30.94 GB30.94 GB  
>>   30.94 GBHEAP
>> Par Eden Space 42.20 MB865.31 MB   865.31 MB   
>> 865.31 MB   HEAP
>> Total  20.33 GB31.97 GB32.02 GB  
>>   31.92 GBTOTAL
>> 
>> And here's our current cache stats from a random slave:
>> 
>> name:queryResultCache
>> class:   org.apache.solr.search.LRUCache
>> version: 1.0
>> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6, 
>> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
>> stats:  lookups : 619
>> hits : 36
>> hitratio : 0.05
>> inserts : 592
>> evictions : 101
>> size : 488
>> warmupTime : 2949
>> cumulative_lookups : 681225
>> cumulative_hits : 73126
>> cumulative_hitratio : 0.10
>> cumulative_inserts : 602396
>> cumulati

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
Hi Andre,

Wow that is astonishing!  I will definitely also try that out!  Just set the 
facet method on a per field basis for the less used sparse facet fields eh?  
Thanks for the tip.

Thanks
Robi

-Original Message-
From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com] 
Sent: Tuesday, June 18, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Recently we had steadily increasing memory usage and OOM due to facets on 
dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints for 
each field (a fieldCache or fieldValueCahe entry), whether it is sparsely 
populated or not.

Once you have reduced your number of maxDocs with the merge policy, it can be 
interesting to try facet.method=enum for all the sparsely populated dynamic 
fields.
Despite what is said in the wiki, in our case the performance was similar to 
facet.method=fc, however the JVM heap usage went down from about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:
> Also some time ago I made all our caches small enough to keep us from getting 
> OOMs while still having a good hit rate.Our index has about 50 fields 
> which are mostly int IDs and there are some dynamic fields also.  These 
> dynamic fields can be used for custom faceting.  We have some standard facets 
> we always facet on and other dynamic facets which are only used if the query 
> is filtering on a particular category.  There are hundreds of these fields 
> but since they are only for a small subset of the overall index they are very 
> sparsely populated with regard to the overall index.
--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
In reading the newer solrconfig in the example conf folder it seems like it is 
saying this setting ' 10' is shorthand to putting 
the below and that these both are the defaults?  It says 'The default since 
Solr/Lucene 3.3 is TieredMergePolicy.' So isn't this setting already in effect 
for me?


  10
  10
  

Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Monday, June 17, 2013 6:36 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Yes, in one of the example solrconfig.xml files this is right above the merge 
factor definition.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 8:00 PM, Petersen, Robert 
 wrote:
> Hi Upayavira,
>
> You might have gotten it.  Yes we noticed maxdocs was way bigger than 
> numdocs.  There were a lot of files ending in '.del' in the index folder 
> also.  We started on 1.3 also.   I don't currently have any solr config 
> settings for MergePolicy at all.  Am I going to want to put something like 
> this into my index defaults section?
>
> 
>10
>10 
>
> Thanks
> Robi
>
> -Original Message-
> From: Upayavira [mailto:u...@odoko.co.uk]
> Sent: Monday, June 17, 2013 12:29 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
>
> The key figures are numdocs vs maxdocs. Maxdocs-numdocs is the number of 
> deleted docs in your index.
>
> This is a 3.6 system you say. But has it been upgraded? I've seen folks 
> who've upgraded from 1.4 or 3.0/3.1 over time, keeping the old config.
> The consequence of this is that they don't get the right config for the 
> TieredMergePolicy, and therefore don't get to use it, seeing the old 
> behaviour which does require periodic optimise.
>
> Upayavira
>
> On Mon, Jun 17, 2013, at 07:21 PM, Petersen, Robert wrote:
>> Hi Otis,
>>
>> Right I didn't restart the JVMs except on the one slave where I was
>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I
>> made all our caches small enough to keep us from getting OOMs while still
>> having a good hit rate.Our index has about 50 fields which are mostly
>> int IDs and there are some dynamic fields also.  These dynamic fields 
>> can be used for custom faceting.  We have some standard facets we 
>> always facet on and other dynamic facets which are only used if the 
>> query is filtering on a particular category.  There are hundreds of 
>> these fields but since they are only for a small subset of the 
>> overall index they are very sparsely populated with regard to the 
>> overall index.  With CMS GC we get a sawtooth on the old generation 
>> (I guess every replication and commit causes it's usage to drop down 
>> to 10GB or
>> so) and it seems to be the old generation which is the main space 
>> consumer.  With the G1GC, the memory map looked totally different!  I 
>> was a little lost looking at memory consumption with that GC.  Maybe 
>> I'll try it again now that the index is a bit smaller than it was 
>> last time I tried it.  After four days without running an optimize 
>> now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
>> reducing the segments might be ok...
>>
>> Here is a quick snapshot of one slaves memory map as reported by 
>> PSI-Probe, but unfortunately I guess I can't send the history 
>> graphics to the solr-user list to show their changes over time:
>>   NameUsedCommitted   Max
>>  Initial Group
>>Par Survivor Space 20.02 MB108.13 MB   108.13 MB  
>>  108.13 MB   HEAP
>>CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
>> MBNON_HEAP
>>Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
>> NON_HEAP
>>CMS Old Gen20.22 GB30.94 GB30.94 GB   
>>  30.94 GBHEAP
>>Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
>> MB   HEAP
>>Total  20.33 GB31.97 GB32.02 GB   
>>  31.92 GBTOTAL
>>
>> And here's our current cache stats from a random slave:
>>
>> name:queryResultCache
>> class:   org.apache.solr.search.LRUCache
>> version: 1.0
>> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6,
>> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
>> stats:  lookups : 619
&

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
Hi Otis, 

Yes the query results cache is just about worthless.   I guess we have too 
diverse of a set of user queries.  The business unit has decided to let bots 
crawl our search pages too so that doesn't help either.  I turned it way down 
but decided to keep it because my understanding was that it would still help 
for users going from page 1 to page 2 in a search.  Is that true?

Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Monday, June 17, 2013 6:39 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Hi Robi,

This goes against the original problem of getting OOMEs, but it looks like each 
of your Solr caches could be a little bigger if you want to eliminate 
evictions, with the query results one possibly not being worth keeping if you 
can't get the hit % up enough.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 wrote:
> Hi Otis,
>
> Right I didn't restart the JVMs except on the one slave where I was 
> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
> made all our caches small enough to keep us from getting OOMs while still 
> having a good hit rate.Our index has about 50 fields which are mostly int 
> IDs and there are some dynamic fields also.  These dynamic fields can be used 
> for custom faceting.  We have some standard facets we always facet on and 
> other dynamic facets which are only used if the query is filtering on a 
> particular category.  There are hundreds of these fields but since they are 
> only for a small subset of the overall index they are very sparsely populated 
> with regard to the overall index.  With CMS GC we get a sawtooth on the old 
> generation (I guess every replication and commit causes it's usage to drop 
> down to 10GB or so) and it seems to be the old generation which is the main 
> space consumer.  With the G1GC, the memory map looked totally different!  I 
> was a little lost looking at memory consumption with that GC.  Maybe I'll try 
> it again now that the index is a bit smaller than it was last time I tried 
> it.  After four days without running an optimize now it is 21GB.  BTW our 
> indexing speed is mostly bound by the DB so reducing the segments might be 
> ok...
>
> Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
> but unfortunately I guess I can't send the history graphics to the solr-user 
> list to show their changes over time:
> NameUsedCommitted   Max   
>   Initial Group
>  Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
>   108.13 MB   HEAP
>  CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
> MBNON_HEAP
>  Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
> NON_HEAP
>  CMS Old Gen20.22 GB30.94 GB30.94 GB  
>   30.94 GBHEAP
>  Par Eden Space 42.20 MB865.31 MB   865.31 MB   
> 865.31 MB   HEAP
>  Total  20.33 GB31.97 GB32.02 GB  
>   31.92 GBTOTAL
>
> And here's our current cache stats from a random slave:
>
> name:queryResultCache
> class:   org.apache.solr.search.LRUCache
> version: 1.0
> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6, 
> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
> stats:  lookups : 619
> hits : 36
> hitratio : 0.05
> inserts : 592
> evictions : 101
> size : 488
> warmupTime : 2949
> cumulative_lookups : 681225
> cumulative_hits : 73126
> cumulative_hitratio : 0.10
> cumulative_inserts : 602396
> cumulative_evictions : 428868
>
>
>  name:   fieldCache
> class:   org.apache.solr.search.SolrFieldCacheMBean
> version: 1.0
> description: Provides introspection of the Lucene FieldCache, this is 
> **NOT** a cache that is managed by Solr.
> stats:  entries_count : 359
>
>
> name:documentCache
> class:   org.apache.solr.search.LRUCache
> version: 1.0
> description: LRU Cache(maxSize=2048, initialSize=512, autowarmCount=10, 
> regenerator=null)
> stats:  lookups : 12710
> hits : 7160
> hitratio : 0.56
> inserts : 5636
> evictions : 3588
> size : 2048
> warmupTime : 0
> cumulative_lookups : 10590054
> cumulative_hits : 6166913
> cumulative_hitratio : 0.58
> cumulative_inserts : 4423141
> cumulative_evictions : 3714653
>
>
> name:fieldValueCache
> class:   org.apache.solr.search.FastLRUCache
> version: 1.0
> description: Co

Re: yet another optimize question

2013-06-18 Thread Andre Bois-Crettez

Recently we had steadily increasing memory usage and OOM due to facets
on dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints
for each field (a fieldCache or fieldValueCahe entry), whether it is
sparsely populated or not.

Once you have reduced your number of maxDocs with the merge policy, it
can be interesting to try facet.method=enum for all the sparsely
populated dynamic fields.
Despite what is said in the wiki, in our case the performance was
similar to facet.method=fc, however the JVM heap usage went down from
about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:

Also some time ago I made all our caches small enough to keep us from getting 
OOMs while still having a good hit rate.Our index has about 50 fields which 
are mostly int IDs and there are some dynamic fields also.  These dynamic 
fields can be used for custom faceting.  We have some standard facets we always 
facet on and other dynamic facets which are only used if the query is filtering 
on a particular category.  There are hundreds of these fields but since they 
are only for a small subset of the overall index they are very sparsely 
populated with regard to the overall index.

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: yet another optimize question

2013-06-17 Thread Otis Gospodnetic
0
> description: Concurrent LRU Cache(maxSize=248, initialSize=12, 
> minSize=223, acceptableSize=235, cleanupThread=false, autowarmCount=10, 
> regenerator=org.apache.solr.search.SolrIndexSearcher$2@36e831d6)
> stats:  lookups : 3990
> hits : 3831
> hitratio : 0.96
> inserts : 239
> evictions : 26
> size : 244
> warmupTime : 1
> cumulative_lookups : 5745011
> cumulative_hits : 5496150
> cumulative_hitratio : 0.95
> cumulative_inserts : 351485
> cumulative_evictions : 276308
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
> Sent: Saturday, June 15, 2013 5:52 AM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
>
> Hi Robi,
>
> I'm going to guess you are seeing smaller heap also simply because you 
> restarted the JVM recently (hm, you don't say you restarted, maybe I'm making 
> this up). If you are indeed indexing continuously then you shouldn't 
> optimize. Lucene will merge segments itself. Lower mergeFactor will force it 
> to do it more often (it means slower indexing, bigger IO hit when segments 
> are merged, more per-segment data that Lucene/Solr need to read from the 
> segment for faceting and such, etc.) so maybe you shouldn't mess with that.  
> Do you know what your caches are like in terms of size, hit %, evictions?  
> We've recently seen people set those to a few hundred K or even higher, which 
> can eat a lot of heap.  We have had luck with G1 recently, too.
> Maybe you can run jstat and see which of the memory pools get filled up and 
> change/increase appropriate JVM param based on that?  How many fields do you 
> index, facet, or group on?
>
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Solr & ElasticSearch Support -- http://sematext.com/
>
>
>
>
>
> On Fri, Jun 14, 2013 at 8:04 PM, Petersen, Robert 
>  wrote:
>> Hi guys,
>>
>> We're on solr 3.6.1 and I've read the discussions about whether to optimize 
>> or not to optimize.  I decided to try not optimizing our index as was 
>> recommended.  We have a little over 15 million docs in our biggest index and 
>> a 32gb heap for our jvm.  So without the optimizes the index folder seemed 
>> to grow in size and quantity of files.  There seemed to be an upper limit 
>> but eventually it hit 300 files consuming 26gb of space and that seemed to 
>> push our slave farm over the edge and we started getting the dreaded OOMs.  
>> We have continuous indexing activity, so I stopped the indexer and manually 
>> ran an optimize which made the index become 9 files consuming 15gb of space 
>> and our slave farm started having acceptable memory usage.  Our merge factor 
>> is 10, we're on java 7.  Before optimizing, I tried on one slave machine to 
>> go with the latest JVM and tried switching from the CMS GC to the G1GC but 
>> it hit OOM condition even faster.  So it seems like I have to continue to 
>> schedule a regular optimize.  Right now it has been a couple of days since 
>> running the optimize and the index is slowly growing bigger, now up to a bit 
>> over 19gb.  What do you guys think?  Did I miss something that would make us 
>> able to run without doing an optimize?
>>
>> Robert (Robi) Petersen
>> Senior Software Engineer
>> Search Department
>
>


Re: yet another optimize question

2013-06-17 Thread Otis Gospodnetic
Yes, in one of the example solrconfig.xml files this is right above
the merge factor definition.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 8:00 PM, Petersen, Robert
 wrote:
> Hi Upayavira,
>
> You might have gotten it.  Yes we noticed maxdocs was way bigger than 
> numdocs.  There were a lot of files ending in '.del' in the index folder 
> also.  We started on 1.3 also.   I don't currently have any solr config 
> settings for MergePolicy at all.  Am I going to want to put something like 
> this into my index defaults section?
>
> 
>10
>10
> 
>
> Thanks
> Robi
>
> -Original Message-
> From: Upayavira [mailto:u...@odoko.co.uk]
> Sent: Monday, June 17, 2013 12:29 PM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
>
> The key figures are numdocs vs maxdocs. Maxdocs-numdocs is the number of 
> deleted docs in your index.
>
> This is a 3.6 system you say. But has it been upgraded? I've seen folks 
> who've upgraded from 1.4 or 3.0/3.1 over time, keeping the old config.
> The consequence of this is that they don't get the right config for the 
> TieredMergePolicy, and therefore don't get to use it, seeing the old 
> behaviour which does require periodic optimise.
>
> Upayavira
>
> On Mon, Jun 17, 2013, at 07:21 PM, Petersen, Robert wrote:
>> Hi Otis,
>>
>> Right I didn't restart the JVMs except on the one slave where I was
>> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I
>> made all our caches small enough to keep us from getting OOMs while still
>> having a good hit rate.Our index has about 50 fields which are mostly
>> int IDs and there are some dynamic fields also.  These dynamic fields
>> can be used for custom faceting.  We have some standard facets we
>> always facet on and other dynamic facets which are only used if the
>> query is filtering on a particular category.  There are hundreds of
>> these fields but since they are only for a small subset of the overall
>> index they are very sparsely populated with regard to the overall
>> index.  With CMS GC we get a sawtooth on the old generation (I guess
>> every replication and commit causes it's usage to drop down to 10GB or
>> so) and it seems to be the old generation which is the main space
>> consumer.  With the G1GC, the memory map looked totally different!  I
>> was a little lost looking at memory consumption with that GC.  Maybe
>> I'll try it again now that the index is a bit smaller than it was last
>> time I tried it.  After four days without running an optimize now it
>> is 21GB.  BTW our indexing speed is mostly bound by the DB so reducing the 
>> segments might be ok...
>>
>> Here is a quick snapshot of one slaves memory map as reported by
>> PSI-Probe, but unfortunately I guess I can't send the history graphics
>> to the solr-user list to show their changes over time:
>>   NameUsedCommitted   Max
>>  Initial Group
>>Par Survivor Space 20.02 MB108.13 MB   108.13 MB  
>>  108.13 MB   HEAP
>>CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
>> MBNON_HEAP
>>Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
>> NON_HEAP
>>CMS Old Gen20.22 GB30.94 GB30.94 GB   
>>  30.94 GBHEAP
>>Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
>> MB   HEAP
>>Total  20.33 GB31.97 GB32.02 GB   
>>  31.92 GBTOTAL
>>
>> And here's our current cache stats from a random slave:
>>
>> name:queryResultCache
>> class:   org.apache.solr.search.LRUCache
>> version: 1.0
>> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6,
>> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
>> stats:  lookups : 619
>> hits : 36
>> hitratio : 0.05
>> inserts : 592
>> evictions : 101
>> size : 488
>> warmupTime : 2949
>> cumulative_lookups : 681225
>> cumulative_hits : 73126
>> cumulative_hitratio : 0.10
>> cumulative_inserts : 602396
>> cumulative_evictions : 428868
>>
>>
>>  name:fieldCache
>> class:   org.apache.solr.search.SolrFieldCacheMBean
>> version: 1.0
>> description: Provides introspection of the Lucene FieldCache, this is
>> **NOT** a cache that is managed by Solr.
&

RE: yet another optimize question

2013-06-17 Thread Petersen, Robert
Hi Upayavira,

You might have gotten it.  Yes we noticed maxdocs was way bigger than numdocs.  
There were a lot of files ending in '.del' in the index folder also.  We 
started on 1.3 also.   I don't currently have any solr config settings for 
MergePolicy at all.  Am I going to want to put something like this into my 
index defaults section?


   10
   10


Thanks
Robi

-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: Monday, June 17, 2013 12:29 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

The key figures are numdocs vs maxdocs. Maxdocs-numdocs is the number of 
deleted docs in your index.

This is a 3.6 system you say. But has it been upgraded? I've seen folks who've 
upgraded from 1.4 or 3.0/3.1 over time, keeping the old config.
The consequence of this is that they don't get the right config for the 
TieredMergePolicy, and therefore don't get to use it, seeing the old behaviour 
which does require periodic optimise.

Upayavira

On Mon, Jun 17, 2013, at 07:21 PM, Petersen, Robert wrote:
> Hi Otis,
> 
> Right I didn't restart the JVMs except on the one slave where I was
> experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I
> made all our caches small enough to keep us from getting OOMs while still
> having a good hit rate.Our index has about 50 fields which are mostly
> int IDs and there are some dynamic fields also.  These dynamic fields 
> can be used for custom faceting.  We have some standard facets we 
> always facet on and other dynamic facets which are only used if the 
> query is filtering on a particular category.  There are hundreds of 
> these fields but since they are only for a small subset of the overall 
> index they are very sparsely populated with regard to the overall 
> index.  With CMS GC we get a sawtooth on the old generation (I guess 
> every replication and commit causes it's usage to drop down to 10GB or 
> so) and it seems to be the old generation which is the main space 
> consumer.  With the G1GC, the memory map looked totally different!  I 
> was a little lost looking at memory consumption with that GC.  Maybe 
> I'll try it again now that the index is a bit smaller than it was last 
> time I tried it.  After four days without running an optimize now it 
> is 21GB.  BTW our indexing speed is mostly bound by the DB so reducing the 
> segments might be ok...
> 
> Here is a quick snapshot of one slaves memory map as reported by 
> PSI-Probe, but unfortunately I guess I can't send the history graphics 
> to the solr-user list to show their changes over time:
>   NameUsedCommitted   Max 
> Initial Group
>Par Survivor Space 20.02 MB108.13 MB   108.13 MB   
> 108.13 MB   HEAP
>CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
> MBNON_HEAP
>Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB NON_HEAP
>CMS Old Gen20.22 GB30.94 GB30.94 GB
> 30.94 GBHEAP
>Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
> MB   HEAP
>Total  20.33 GB31.97 GB32.02 GB
> 31.92 GBTOTAL
> 
> And here's our current cache stats from a random slave:
> 
> name:queryResultCache  
> class:   org.apache.solr.search.LRUCache  
> version: 1.0  
> description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6,
> regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
> stats:  lookups : 619
> hits : 36
> hitratio : 0.05
> inserts : 592
> evictions : 101
> size : 488
> warmupTime : 2949
> cumulative_lookups : 681225
> cumulative_hits : 73126
> cumulative_hitratio : 0.10
> cumulative_inserts : 602396
> cumulative_evictions : 428868
> 
> 
>  name:fieldCache  
> class:   org.apache.solr.search.SolrFieldCacheMBean  
> version: 1.0  
> description: Provides introspection of the Lucene FieldCache, this is
> **NOT** a cache that is managed by Solr.  
> stats:  entries_count : 359
> 
> 
> name:documentCache  
> class:   org.apache.solr.search.LRUCache  
> version: 1.0  
> description: LRU Cache(maxSize=2048, initialSize=512,
> autowarmCount=10, regenerator=null)
> stats:  lookups : 12710
> hits : 7160
> hitratio : 0.56
> inserts : 5636
> evictions : 3588
> size : 2048
> warmupTime : 0
> cumulative_lookups : 10590054
> cumulative_hits : 6166913
> cumulative_hitratio : 0.58
> cumulative_inserts : 4423141
> cumulative_evictions : 3714653
> 
> 
> name:fieldValueCache  
> clas

Re: yet another optimize question

2013-06-17 Thread Upayavira
s : 83261 
> cumulative_evictions : 3479
> 
> 
> name:filterCache  
> class:   org.apache.solr.search.FastLRUCache  
> version: 1.0  
> description: Concurrent LRU Cache(maxSize=248, initialSize=12,
> minSize=223, acceptableSize=235, cleanupThread=false, autowarmCount=10,
> regenerator=org.apache.solr.search.SolrIndexSearcher$2@36e831d6)  
> stats:  lookups : 3990 
> hits : 3831 
> hitratio : 0.96 
> inserts : 239 
> evictions : 26 
> size : 244 
> warmupTime : 1 
> cumulative_lookups : 5745011 
> cumulative_hits : 5496150 
> cumulative_hitratio : 0.95 
> cumulative_inserts : 351485 
> cumulative_evictions : 276308
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
> Sent: Saturday, June 15, 2013 5:52 AM
> To: solr-user@lucene.apache.org
> Subject: Re: yet another optimize question
> 
> Hi Robi,
> 
> I'm going to guess you are seeing smaller heap also simply because you
> restarted the JVM recently (hm, you don't say you restarted, maybe I'm
> making this up). If you are indeed indexing continuously then you
> shouldn't optimize. Lucene will merge segments itself. Lower mergeFactor
> will force it to do it more often (it means slower indexing, bigger IO
> hit when segments are merged, more per-segment data that Lucene/Solr need
> to read from the segment for faceting and such, etc.) so maybe you
> shouldn't mess with that.  Do you know what your caches are like in terms
> of size, hit %, evictions?  We've recently seen people set those to a few
> hundred K or even higher, which can eat a lot of heap.  We have had luck
> with G1 recently, too.
> Maybe you can run jstat and see which of the memory pools get filled up
> and change/increase appropriate JVM param based on that?  How many fields
> do you index, facet, or group on?
> 
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Solr & ElasticSearch Support -- http://sematext.com/
> 
> 
> 
> 
> 
> On Fri, Jun 14, 2013 at 8:04 PM, Petersen, Robert
>  wrote:
> > Hi guys,
> >
> > We're on solr 3.6.1 and I've read the discussions about whether to optimize 
> > or not to optimize.  I decided to try not optimizing our index as was 
> > recommended.  We have a little over 15 million docs in our biggest index 
> > and a 32gb heap for our jvm.  So without the optimizes the index folder 
> > seemed to grow in size and quantity of files.  There seemed to be an upper 
> > limit but eventually it hit 300 files consuming 26gb of space and that 
> > seemed to push our slave farm over the edge and we started getting the 
> > dreaded OOMs.  We have continuous indexing activity, so I stopped the 
> > indexer and manually ran an optimize which made the index become 9 files 
> > consuming 15gb of space and our slave farm started having acceptable memory 
> > usage.  Our merge factor is 10, we're on java 7.  Before optimizing, I 
> > tried on one slave machine to go with the latest JVM and tried switching 
> > from the CMS GC to the G1GC but it hit OOM condition even faster.  So it 
> > seems like I have to continue to schedule a regular optimize.  Right now it 
> > has been a couple of days since running the optimize and the index is 
> > slowly growing bigger, now up to a bit over 19gb.  What do you guys think?  
> > Did I miss something that would make us able to run without doing an 
> > optimize?
> >
> > Robert (Robi) Petersen
> > Senior Software Engineer
> > Search Department
> 
> 


RE: yet another optimize question

2013-06-17 Thread Petersen, Robert
Hi Otis,

Right I didn't restart the JVMs except on the one slave where I was 
experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I made 
all our caches small enough to keep us from getting OOMs while still having a 
good hit rate.Our index has about 50 fields which are mostly int IDs and 
there are some dynamic fields also.  These dynamic fields can be used for 
custom faceting.  We have some standard facets we always facet on and other 
dynamic facets which are only used if the query is filtering on a particular 
category.  There are hundreds of these fields but since they are only for a 
small subset of the overall index they are very sparsely populated with regard 
to the overall index.  With CMS GC we get a sawtooth on the old generation (I 
guess every replication and commit causes it's usage to drop down to 10GB or 
so) and it seems to be the old generation which is the main space consumer.  
With the G1GC, the memory map looked totally different!  I was a little lost 
looking at memory consumption with that GC.  Maybe I'll try it again now that 
the index is a bit smaller than it was last time I tried it.  After four days 
without running an optimize now it is 21GB.  BTW our indexing speed is mostly 
bound by the DB so reducing the segments might be ok...

Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, but 
unfortunately I guess I can't send the history graphics to the solr-user list 
to show their changes over time:
NameUsedCommitted   Max 
Initial Group
 Par Survivor Space 20.02 MB108.13 MB   108.13 MB   
108.13 MB   HEAP
 CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
MBNON_HEAP
 Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB NON_HEAP
 CMS Old Gen20.22 GB30.94 GB30.94 GB
30.94 GBHEAP
 Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
MB   HEAP
 Total  20.33 GB31.97 GB32.02 GB
31.92 GBTOTAL

And here's our current cache stats from a random slave:

name:queryResultCache  
class:   org.apache.solr.search.LRUCache  
version: 1.0  
description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6, 
regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)  
stats:  lookups : 619 
hits : 36 
hitratio : 0.05 
inserts : 592 
evictions : 101 
size : 488 
warmupTime : 2949 
cumulative_lookups : 681225 
cumulative_hits : 73126 
cumulative_hitratio : 0.10 
cumulative_inserts : 602396 
cumulative_evictions : 428868


 name:   fieldCache  
class:   org.apache.solr.search.SolrFieldCacheMBean  
version: 1.0  
description: Provides introspection of the Lucene FieldCache, this is 
**NOT** a cache that is managed by Solr.  
stats:  entries_count : 359


name:documentCache  
class:   org.apache.solr.search.LRUCache  
version: 1.0  
description: LRU Cache(maxSize=2048, initialSize=512, autowarmCount=10, 
regenerator=null)  
stats:  lookups : 12710 
hits : 7160 
hitratio : 0.56 
inserts : 5636 
evictions : 3588 
size : 2048 
warmupTime : 0 
cumulative_lookups : 10590054 
cumulative_hits : 6166913 
cumulative_hitratio : 0.58 
cumulative_inserts : 4423141 
cumulative_evictions : 3714653


name:fieldValueCache  
class:   org.apache.solr.search.FastLRUCache  
version: 1.0  
description: Concurrent LRU Cache(maxSize=280, initialSize=280, 
minSize=252, acceptableSize=266, cleanupThread=false, autowarmCount=6, 
regenerator=org.apache.solr.search.SolrIndexSearcher$1@143eb77a)  
stats:  lookups : 1725 
hits : 1481 
hitratio : 0.85 
inserts : 122 
evictions : 0 
size : 128 
warmupTime : 4426 
cumulative_lookups : 3449712 
cumulative_hits : 3281805 
cumulative_hitratio : 0.95 
cumulative_inserts : 83261 
cumulative_evictions : 3479


name:filterCache  
class:   org.apache.solr.search.FastLRUCache  
version: 1.0  
description: Concurrent LRU Cache(maxSize=248, initialSize=12, minSize=223, 
acceptableSize=235, cleanupThread=false, autowarmCount=10, 
regenerator=org.apache.solr.search.SolrIndexSearcher$2@36e831d6)  
stats:  lookups : 3990 
hits : 3831 
hitratio : 0.96 
inserts : 239 
evictions : 26 
size : 244 
warmupTime : 1 
cumulative_lookups : 5745011 
cumulative_hits : 5496150 
cumulative_hitratio : 0.95 
cumulative_inserts : 351485 
cumulative_evictions : 276308

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Saturday, June 15, 2013 5:52 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Hi Robi,

I'm going to guess you are seeing smaller heap also simply because you 
restarted the JVM recently (hm, you don't say you restarted, maybe I'm making 
this up). If you are indeed indexing 

Re: yet another optimize question

2013-06-15 Thread Otis Gospodnetic
Hi Robi,

I'm going to guess you are seeing smaller heap also simply because you
restarted the JVM recently (hm, you don't say you restarted, maybe I'm
making this up). If you are indeed indexing continuously then you
shouldn't optimize. Lucene will merge segments itself. Lower
mergeFactor will force it to do it more often (it means slower
indexing, bigger IO hit when segments are merged, more per-segment
data that Lucene/Solr need to read from the segment for faceting and
such, etc.) so maybe you shouldn't mess with that.  Do you know what
your caches are like in terms of size, hit %, evictions?  We've
recently seen people set those to a few hundred K or even higher,
which can eat a lot of heap.  We have had luck with G1 recently, too.
Maybe you can run jstat and see which of the memory pools get filled
up and change/increase appropriate JVM param based on that?  How many
fields do you index, facet, or group on?

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Solr & ElasticSearch Support -- http://sematext.com/





On Fri, Jun 14, 2013 at 8:04 PM, Petersen, Robert
 wrote:
> Hi guys,
>
> We're on solr 3.6.1 and I've read the discussions about whether to optimize 
> or not to optimize.  I decided to try not optimizing our index as was 
> recommended.  We have a little over 15 million docs in our biggest index and 
> a 32gb heap for our jvm.  So without the optimizes the index folder seemed to 
> grow in size and quantity of files.  There seemed to be an upper limit but 
> eventually it hit 300 files consuming 26gb of space and that seemed to push 
> our slave farm over the edge and we started getting the dreaded OOMs.  We 
> have continuous indexing activity, so I stopped the indexer and manually ran 
> an optimize which made the index become 9 files consuming 15gb of space and 
> our slave farm started having acceptable memory usage.  Our merge factor is 
> 10, we're on java 7.  Before optimizing, I tried on one slave machine to go 
> with the latest JVM and tried switching from the CMS GC to the G1GC but it 
> hit OOM condition even faster.  So it seems like I have to continue to 
> schedule a regular optimize.  Right now it has been a couple of days since 
> running the optimize and the index is slowly growing bigger, now up to a bit 
> over 19gb.  What do you guys think?  Did I miss something that would make us 
> able to run without doing an optimize?
>
> Robert (Robi) Petersen
> Senior Software Engineer
> Search Department


yet another optimize question

2013-06-14 Thread Petersen, Robert
Hi guys,

We're on solr 3.6.1 and I've read the discussions about whether to optimize or 
not to optimize.  I decided to try not optimizing our index as was recommended. 
 We have a little over 15 million docs in our biggest index and a 32gb heap for 
our jvm.  So without the optimizes the index folder seemed to grow in size and 
quantity of files.  There seemed to be an upper limit but eventually it hit 300 
files consuming 26gb of space and that seemed to push our slave farm over the 
edge and we started getting the dreaded OOMs.  We have continuous indexing 
activity, so I stopped the indexer and manually ran an optimize which made the 
index become 9 files consuming 15gb of space and our slave farm started having 
acceptable memory usage.  Our merge factor is 10, we're on java 7.  Before 
optimizing, I tried on one slave machine to go with the latest JVM and tried 
switching from the CMS GC to the G1GC but it hit OOM condition even faster.  So 
it seems like I have to continue to schedule a regular optimize.  Right now it 
has been a couple of days since running the optimize and the index is slowly 
growing bigger, now up to a bit over 19gb.  What do you guys think?  Did I miss 
something that would make us able to run without doing an optimize?

Robert (Robi) Petersen
Senior Software Engineer
Search Department