Solr index size statistics

2017-12-02 Thread John Davis
Hello,
Is there a way to get index size statistics for a given solr instance? For
eg broken by each field stored or indexed. The only things I know of is
running du on the index data files and getting counts per field
indexed/stored, however each field can be quite different wrt size.

Thanks
John


Re: JVM GC Issue

2017-12-02 Thread S G
I am a bit curious on the docValues implementation.
I understand that docValues do not use JVM memory and
they make use of OS cache - that is why they are more performant.

But to return any response from the docValues, the values in the
docValues' column-oriented-structures would need to be brought
into the JVM's memory. And that will then increase the pressure
on the JVM's memory anyways. So how do docValues actually
help from memory perspective?

Thanks
SG


On Sat, Dec 2, 2017 at 12:39 AM, Dominique Bejean  wrote:

> Hi, Thank you for the explanations about faceting. I was thinking the hit
> count had a biggest impact on facet memory lifecycle. Regardless the hit
> cout there is a query peak at the time the issue occurs. This is relative
> in regard of what Solr is supposed be able to handle, but this should be
> sufficient to explain GC activity growing up. 198 10:07 208 10:08 267 10:09
> 285 10:10 244 10:11 286 10:12 277 10:13 252 10:14 183 10:15 302 10:16 299
> 10:17 273 10:18 348 10:19 468 10:20 496 10:21 673 10:22 496 10:23 101 10:24
> At the time the issue occurs, we see the CPU activity growing up to very
> high. May be there is a lack of CPU. So, I will suggest all actions that
> will remove pressure on heap memory.
>
>
>- enable docValues
>- divide cache size per 2 in order go back to Solr default
>- refine the fl parameter as I know it can optimized
>
> Concerning phonetic filter, anyway it will be removed as a large number of
> results are really irrelevant. Regads. Dominique
>
>
> Le sam. 2 déc. 2017 à 04:25, Erick Erickson  a
> écrit :
>
> > Doninique:
> >
> > Actually, the memory requirements shouldn't really go up as the number
> > of hits increases. The general algorithm is (say rows=10)
> > Calcluate the score of each doc
> > If the score is zero, ignore
> > If the score is > the minimum in my current top 10, replace the lowest
> > scoring doc in my current top 10 with the new doc (a PriorityQueue
> > last I knew).
> > else discard the doc.
> >
> > When all docs have been scored, assemble the return from the top 10
> > (or whatever rows is set to).
> >
> > The key here is that most of the Solr index is kept in
> > MMapDirecotry/OS space, see Uwe's excellent blog here:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html.
> > In terms of _searching_, very little of the Lucene index structures
> > are kept in memory.
> >
> > That said, faceting plays a bit loose with the rules. If you have
> > docValues set to true, most of the memory structures are in the OS
> > memory space, not the JVM. If you have docValues set to false, then
> > the "uninverted" structure is built in the JVM heap space.
> >
> > Additionally, the JVM requirements are sensitive to the number of
> > unique values in field being faceted on. For instance, let's say you
> > faceted by a date field with just facet.field=some_date_field. A
> > bucket would have to be allocated to hold the counts for each and
> > every unique date field, i.e. one for each millisecond in your search,
> > which might be something you're seeing. Conceptually this is just an
> > array[uniqueValues] of ints (longs? I'm not sure). This should be
> > relatively easily testable by omitting the facets while measuring.
> >
> > Where the number of rows _does_ make a difference is in the return
> > packet. Say I have rows=10. In that case I create a single return
> > packet with all 10 docs "fl" field. If rows = 10,000 then that return
> > packet is obviously 1,000 times as large and must be assembled in
> > memory.
> >
> > I rather doubt the phonetic filter is to blame. But you can test this
> > by just omitting the field containing the phonetic filter in the
> > search query. I've certainly been wrong before.
> >
> > Best,
> > Erick
> >
> > On Fri, Dec 1, 2017 at 2:31 PM, Dominique Bejean
> >  wrote:
> > > Hi,
> > >
> > >
> > > Thank you both for your responses.
> > >
> > >
> > > I just have solr log for the very last period of the CG log.
> > >
> > >
> > > Grep command allows me to count queries per minute with hits > 1000 or
> >
> > > 1 and so with the biggest impact on memory and cpu during faceting
> > >
> > >
> > >> 1000
> > >
> > >  59 11:13
> > >
> > >  45 11:14
> > >
> > >  36 11:15
> > >
> > >  45 11:16
> > >
> > >  59 11:17
> > >
> > >  40 11:18
> > >
> > >  95 11:19
> > >
> > > 123 11:20
> > >
> > > 137 11:21
> > >
> > > 123 11:22
> > >
> > >  86 11:23
> > >
> > >  26 11:24
> > >
> > >  19 11:25
> > >
> > >  17 11:26
> > >
> > >
> > >> 1
> > >
> > >  55 11:19
> > >
> > >  78 11:20
> > >
> > >  48 11:21
> > >
> > > 134 11:22
> > >
> > >  93 11:23
> > >
> > >  10 11:24
> > >
> > >
> > > So we see that at the time GC start become nuts, large result set count
> > > increase.
> > >
> > >
> > > The query field include phonetic filter and 

Re: Having trouble indexing nested docs using "split" feature.

2017-12-02 Thread Shawn Heisey

On 12/2/2017 12:55 PM, David Lee wrote:

{
   "responseHeader":{
     "status":0,
     "QTime":798}}

Though the status indicates there was no error, when I try to query on 
the the data using *:*, I get this:


curl 'http://localhost:8983/solr/my_collection/select?q=*:*'
{
   "responseHeader":{
     "zkConnected":true,
     "status":0,
     "QTime":6,
     "params":{
   "q":"*:*"}},
   "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
   }}

So it looks like no documents were actually indexed from above. I'm 
trying to determine if this is due to an error in the reference manual, 
or if I haven't set up Solr correctly.


I don't know anything at all about the split feature or the parent/child 
document feature.  I'm going to concentrate on the fact that numFound is 
zero.  With the indexing returning a success response, there should have 
been SOMETHING indexed.


Did you ever do a commit operation?  This can be an explicit operation, 
or there are some ways you can have it happen automatically.  If you 
include a commitWithin parameter on the indexing request, then there 
will be an automatic commit within that many milliseconds from when 
indexing started.  You can configure autoSoftCommit in solrconfig.xml, 
then reload the core/collection or restart Solr.


Unless there is a commit that opens a new searcher, changes made to the 
index will never be visible to clients.


https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

The article title says "SolrCloud" but all the information is just as 
applicable to standalone mode.


If you *have* done a commit with openSearcher set to true (which is the 
default setting for openSearcher), then we'll need to examine solr.log, 
and you'll need to be sure that the indexing request happened during the 
time the log was created.


Thanks,
Shawn


Re: Solr JVM best pratices

2017-12-02 Thread Shawn Heisey

On 12/2/2017 8:43 AM, Dominique Bejean wrote:

I would like to have some advices on best practices related to Heap Size,
MMap, direct memory, GC algorithm and OS Swap.


For the most part, there is no generic advice we can give you for these 
things.  What you need is going to be highly dependent on exactly what 
you are doing with Solr and how much index data you have.  There are no 
formulas for calculating these values based on information about your setup.


Experienced Solr users can make *guesses* if you provide some 
information, but those guesses might turn out the be completely wrong.


https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/


About JVM heap size setting

JVM heap size setting is related to use case so there is no other advice
than reduce it at the minimum possible size in order to avoid GC issue.
Reduce Heap size at is minimum will be achieved mainly by :


The max heap size should be as large as you need, and no larger. 
Figuring out what you need may require trial and error on an 
installation that has all the index data and is receiving production 
queries.


On this wiki page, I wrote a small section about one way you MIGHT be 
able to figure out what heap size you need:


https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F


Optimize schema by remove unused fields and not index / store fields if
it is not necessary
-

Enable docValues on fields used for facetting, sorting and grouping
-

Not oversize Solr cache
-

Be careful with rows and fl query parameters


These are good ideas.  But sometimes you find that you need a lot of 
fields, and you need a lot of them to be stored.  The schema and config 
should be designed around what you need Solr to do.  Designing them for 
the lowest possible memory usage might result in a config that doesn't 
do what you want.



About MMap setting

According to the great article “
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html”
from Uwe Schindler, the only tasks that have to be done at OS settings
level is check that “ulimit -v” and “ulimit -m” both report “unlimited” and
increase vm.max_map_count setting from is default 65536.


The default directory implementation that recent Solr versions use is 
NRTCachingDirectoryFactory.  This wraps another implementation with a 
small memory cache.  The implementation that is wrapped by default DOES 
use MMAP.


The amount of memory used for caching MMAP access cannot be configured 
in the application.  The OS handles that caching completely 
automatically, without any configuration at all.  All modern operating 
systems are designed so that the disk cache can use *all* available 
memory in the system.  This is because the cache will instantly give up 
memory if a program requests it.  The cache never keeps memory that 
programs want.



I suppose the best value is related to available off heap memory. I
generally set it to 262144. Is it a good value or is there a better way to
determine this value ?


Solr doesn't use any off heap memory as far as I'm aware.  There was a 
fork of Solr for a short time named heliosearch, which DID use off-heap 
memory.  Java itself will use some off-heap memory for its own 
operation.  I do not know whether that is configurable, and if so, how 
it's done.



About Direct Memory

According to a response in Solr Maillig list from Uwe Schindler (again), I
understand that the MmapDirectory is not Direct Memory.

The only place where I read that MaxDirectMemorySize JVM setting have to be
set for Solr is in Cloudera blog post and in Solr mailing list when using
Solr with HDFS.

Is it necessary to change the default MaxDirectMemorySize JVM setting ? If
yes, how to determine the appropriate value ?


I have never heard of this "direct memory."  Solr probably doesn't use 
it.  I really have no idea what happens when the index is in HDFS. 
You'd have to ask somebody who knows Hadoop.



About OS Swap setting

Linux generally starts swapping when less than 30% of the memory is free.
In order to avoid OS goes against Solr for off heap memory management,  I
use to change OS swappiness value to 0. Can you confirm it is a good thing ?


If the OS starts swapping, performance of everything on the machine is 
going to drop significantly.  Setting swappiness to 0 is probably a good 
idea.  Most Linux distributions default to 60 here, which means the OS 
is going to aggressively start swapping anything it thinks isn't being 
used, even before memory pressure becomes extreme.



About CMS GC vs G1 GC

Default Solr setting use CMS GC.

According to the post from Shawn Heisey in the old Solr wiki (
https://wiki.apache.org/solr/ShawnHeisey), can we consider that G1 GC can
definitely be used with Solr for heap size over nearly 4Gb ?


I've never had any problems with G1, and my experiments suggest that it 
does a better job of reducing GC pauses than CMS does, if 

Re: Having trouble indexing nested docs using "split" feature.

2017-12-02 Thread David Lee

Sorry about the formatting for the first part, hope this is clearer:

{
    "book_id": "1234",
    "book_title": "The Martian Chronicles",
"author": "Ray Bradbury",
"reviews": [
    {
"reviewer": "John Smith",
    "reviewer_background": {
"highest_rank": "Excellent",
    "latest_review": "10/15/2017 10:15:00.000 CST",
    }
    }, {
"reviewer": "Adam Smith",
    "reviewer_background": {
"highest_rank": "Good",
    "latest_review": "10/10/2017 16:18:00.000 CST",
}
}
],
"checkouts": [
{
"member_id": "aaabbbccc",
    "member_name": "Sam Jackson"
},{
"member_id": "bbbcccddd",
    "member_name": "Buddy Jones"
    }
    ]
}


On 12/2/2017 1:55 PM, David Lee wrote:

Hi all,

I've been trying for some time now to find a suitable way to deal with 
json documents that have nested data. By suitable, I mean being able 
to index them and retrieve them so that they are in the same structure 
as when indexed.


I'm using version 7.1 under linux Mint 18.3 with Oracle Java 
1.8.0_151. After untarring the distribution, I ran through the 
"getting started" tutorial from the reference manual where it had me 
create the techproducts index. I then created another collection 
called my_collection so I could run the examples more easily. It used 
the _default schema.


Here is a sample:

{

    "book_id": "1234",     "book_title": "The Martian Chronicles",     
"author": "Ray Bradbury", "reviews": [     { "reviewer": "John 
Smith",     "reviewer_background": {     
"highest_rank": "Excellent", "latest_review": "10/15/2017 10:15:00.000 
CST",     }     }, {     "reviewer": "Adam Smith", 
"reviewer_background": {     "highest_rank": "Good", 
    "latest_review": "10/10/2017 16:18:00.000 CST",     } 
    } ], "checkouts": [ { "member_id": "aaabbbccc", "member_name": 
"Sam Jackson" },{ "member_id": "bbbcccddd",   "member_name": 
"Buddy Jones"   }   ] }


Obviously, I'll need to search at the parent level and child level. I 
started experimenting and tried to use one of the examples from 
"Transforming and Indexing Solr JSON". However, when I tried the first 
example as follows:


curl 'http://localhost:8983/solr/my_collection/update/json/docs'\

'?split=/exams'\
'=first:/first'\
'=last:/last'\
'=grade:/grade'\
'=subject:/exams/subject'\
'=test:/exams/test'\
'=marks:/exams/marks'\
  -H 'Content-type:application/json' -d '
{
   "first": "John",
   "last": "Doe",
   "grade": 8,
   "exams": [
 {
   "subject": "Maths",
   "test"   : "term1",
   "marks"  : 90},
 {
   "subject": "Biology",
   "test"   : "term1",
   "marks"  : 86}
   ]
}'

{
  "responseHeader":{
    "status":0,
    "QTime":798}}

Though the status indicates there was no error, when I try to query on 
the the data using *:*, I get this:


curl 'http://localhost:8983/solr/my_collection/select?q=*:*'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":6,
    "params":{
  "q":"*:*"}},
  "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
  }}

So it looks like no documents were actually indexed from above. I'm 
trying to determine if this is due to an error in the reference 
manual, or if I haven't set up Solr correctly.


I've tried other techniques (not using the split option) like from 
Yonik's site, but those are slightly dated and I was hoping there was 
a more practical approach with the release of Solr 7.


Any assistance would be appreciated.

Thank you.









Having trouble indexing nested docs using "split" feature.

2017-12-02 Thread David Lee

Hi all,

I've been trying for some time now to find a suitable way to deal with 
json documents that have nested data. By suitable, I mean being able to 
index them and retrieve them so that they are in the same structure as 
when indexed.


I'm using version 7.1 under linux Mint 18.3 with Oracle Java 1.8.0_151. 
After untarring the distribution, I ran through the "getting started" 
tutorial from the reference manual where it had me create the 
techproducts index. I then created another collection called 
my_collection so I could run the examples more easily. It used the 
_default schema.


Here is a sample:

{

    "book_id": "1234",     "book_title": "The Martian Chronicles",     
"author": "Ray Bradbury", "reviews": [     {     "reviewer": 
"John Smith",     "reviewer_background": {     
"highest_rank": "Excellent",     "latest_review": 
"10/15/2017 10:15:00.000 CST",     }     }, {     
"reviewer": "Adam Smith",    "reviewer_background": { 
    "highest_rank": "Good",     "latest_review": 
"10/10/2017 16:18:00.000 CST",     }     } ], "checkouts": [ { 
"member_id": "aaabbbccc", "member_name": "Sam Jackson" },{ "member_id": 
"bbbcccddd",   "member_name": "Buddy Jones"   }   ] }


Obviously, I'll need to search at the parent level and child level. I 
started experimenting and tried to use one of the examples from 
"Transforming and Indexing Solr JSON". However, when I tried the first 
example as follows:


curl 'http://localhost:8983/solr/my_collection/update/json/docs'\

'?split=/exams'\
'=first:/first'\
'=last:/last'\
'=grade:/grade'\
'=subject:/exams/subject'\
'=test:/exams/test'\
'=marks:/exams/marks'\
  -H 'Content-type:application/json' -d '
{
   "first": "John",
   "last": "Doe",
   "grade": 8,
   "exams": [
 {
   "subject": "Maths",
   "test"   : "term1",
   "marks"  : 90},
 {
   "subject": "Biology",
   "test"   : "term1",
   "marks"  : 86}
   ]
}'

{
  "responseHeader":{
    "status":0,
    "QTime":798}}

Though the status indicates there was no error, when I try to query on 
the the data using *:*, I get this:


curl 'http://localhost:8983/solr/my_collection/select?q=*:*'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":6,
    "params":{
  "q":"*:*"}},
  "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
  }}

So it looks like no documents were actually indexed from above. I'm 
trying to determine if this is due to an error in the reference manual, 
or if I haven't set up Solr correctly.


I've tried other techniques (not using the split option) like from 
Yonik's site, but those are slightly dated and I was hoping there was a 
more practical approach with the release of Solr 7.


Any assistance would be appreciated.

Thank you.







Re: Solr JVM best pratices

2017-12-02 Thread Walter Underwood
We decided to go with modern technology for the new cluster. CMS has been 
around for a very long time, maybe more then ten years.

These are the GC settings where we still use CMS. Instead of setting up a lot 
of ratios, I specify the sizes of the GC areas. That seems a lot more clear to 
me. We did some benchmarking and increasing the new space to 2G reduced the 
growth of tenured space. Most of Solr’s allocations have a lifetime of a single 
HTTP request.

-Xms8g
-Xmx8g
-XX:NewSize=2g
-XX:MaxPermSize=256m
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+ExplicitGCInvokesConcurrent

The last flag was because something was invoking a full GC to get accurate 
memory usage. That was annoying.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 2, 2017, at 8:18 AM, Dominique Bejean  
> wrote:
> 
> Hi Walter,
> 
> Thank you for this response. Did you use CMS before G1 ? Was there any GC
> issues fixed by G1 ?
> 
> Dominique
> 
> 
> Le sam. 2 déc. 2017 à 17:13, Walter Underwood  a
> écrit :
> 
>> We use an 8G heap and G1 with Shawn Heisey’s settings. Java 8, update 131.
>> 
>> This has been solid in production with a 32 node Solr Cloud cluster. We do
>> not do faceting.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Dec 2, 2017, at 7:43 AM, Dominique Bejean 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I would like to have some advices on best practices related to Heap Size,
>>> MMap, direct memory, GC algorithm and OS Swap.
>>> 
>>> This is a waste subject and sorry for this long question but all these
>>> items are linked in order to have a stable Solr environment.
>>> 
>>> My understanding and questions.
>>> 
>>> About JVM heap size setting
>>> 
>>> JVM heap size setting is related to use case so there is no other advice
>>> than reduce it at the minimum possible size in order to avoid GC issue.
>>> Reduce Heap size at is minimum will be achieved mainly by :
>>> 
>>>  -
>>> 
>>>  Optimize schema by remove unused fields and not index / store fields if
>>>  it is not necessary
>>>  -
>>> 
>>>  Enable docValues on fields used for facetting, sorting and grouping
>>>  -
>>> 
>>>  Not oversize Solr cache
>>>  -
>>> 
>>>  Be careful with rows and fl query parameters
>>> 
>>> 
>>> Any other advice is welcome :)
>>> 
>>> 
>>> About MMap setting
>>> 
>>> According to the great article “
>>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html”
>>> from Uwe Schindler, the only tasks that have to be done at OS settings
>>> level is check that “ulimit -v” and “ulimit -m” both report “unlimited”
>> and
>>> increase vm.max_map_count setting from is default 65536.
>>> 
>>> I suppose the best value is related to available off heap memory. I
>>> generally set it to 262144. Is it a good value or is there a better way
>> to
>>> determine this value ?
>>> 
>>> 
>>> About Direct Memory
>>> 
>>> According to a response in Solr Maillig list from Uwe Schindler (again),
>> I
>>> understand that the MmapDirectory is not Direct Memory.
>>> 
>>> The only place where I read that MaxDirectMemorySize JVM setting have to
>> be
>>> set for Solr is in Cloudera blog post and in Solr mailing list when using
>>> Solr with HDFS.
>>> 
>>> Is it necessary to change the default MaxDirectMemorySize JVM setting ?
>> If
>>> yes, how to determine the appropriate value ?
>>> 
>>> 
>>> About OS Swap setting
>>> 
>>> Linux generally starts swapping when less than 30% of the memory is free.
>>> In order to avoid OS goes against Solr for off heap memory management,  I
>>> use to change OS swappiness value to 0. Can you confirm it is a good
>> thing ?
>>> 
>>> 
>>> About CMS GC vs G1 GC
>>> 
>>> Default Solr setting use CMS GC.
>>> 
>>> According to the post from Shawn Heisey in the old Solr wiki (
>>> https://wiki.apache.org/solr/ShawnHeisey), can we consider that G1 GC
>> can
>>> definitely be used with Solr for heap size over nearly 4Gb ?
>>> 
>>> 
>>> Regards
>>> 
>>> Dominique
>>> 
>>> --
>>> Dominique Béjean
>>> 06 08 46 12 43
>> 
>> --
> Dominique Béjean
> 06 08 46 12 43



Re: Solr JVM best pratices

2017-12-02 Thread Dominique Bejean
Hi Walter,

Thank you for this response. Did you use CMS before G1 ? Was there any GC
issues fixed by G1 ?

Dominique


Le sam. 2 déc. 2017 à 17:13, Walter Underwood  a
écrit :

> We use an 8G heap and G1 with Shawn Heisey’s settings. Java 8, update 131.
>
> This has been solid in production with a 32 node Solr Cloud cluster. We do
> not do faceting.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Dec 2, 2017, at 7:43 AM, Dominique Bejean 
> wrote:
> >
> > Hi,
> >
> > I would like to have some advices on best practices related to Heap Size,
> > MMap, direct memory, GC algorithm and OS Swap.
> >
> > This is a waste subject and sorry for this long question but all these
> > items are linked in order to have a stable Solr environment.
> >
> > My understanding and questions.
> >
> > About JVM heap size setting
> >
> > JVM heap size setting is related to use case so there is no other advice
> > than reduce it at the minimum possible size in order to avoid GC issue.
> > Reduce Heap size at is minimum will be achieved mainly by :
> >
> >   -
> >
> >   Optimize schema by remove unused fields and not index / store fields if
> >   it is not necessary
> >   -
> >
> >   Enable docValues on fields used for facetting, sorting and grouping
> >   -
> >
> >   Not oversize Solr cache
> >   -
> >
> >   Be careful with rows and fl query parameters
> >
> >
> > Any other advice is welcome :)
> >
> >
> > About MMap setting
> >
> > According to the great article “
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html”
> > from Uwe Schindler, the only tasks that have to be done at OS settings
> > level is check that “ulimit -v” and “ulimit -m” both report “unlimited”
> and
> > increase vm.max_map_count setting from is default 65536.
> >
> > I suppose the best value is related to available off heap memory. I
> > generally set it to 262144. Is it a good value or is there a better way
> to
> > determine this value ?
> >
> >
> > About Direct Memory
> >
> > According to a response in Solr Maillig list from Uwe Schindler (again),
> I
> > understand that the MmapDirectory is not Direct Memory.
> >
> > The only place where I read that MaxDirectMemorySize JVM setting have to
> be
> > set for Solr is in Cloudera blog post and in Solr mailing list when using
> > Solr with HDFS.
> >
> > Is it necessary to change the default MaxDirectMemorySize JVM setting ?
> If
> > yes, how to determine the appropriate value ?
> >
> >
> > About OS Swap setting
> >
> > Linux generally starts swapping when less than 30% of the memory is free.
> > In order to avoid OS goes against Solr for off heap memory management,  I
> > use to change OS swappiness value to 0. Can you confirm it is a good
> thing ?
> >
> >
> > About CMS GC vs G1 GC
> >
> > Default Solr setting use CMS GC.
> >
> > According to the post from Shawn Heisey in the old Solr wiki (
> > https://wiki.apache.org/solr/ShawnHeisey), can we consider that G1 GC
> can
> > definitely be used with Solr for heap size over nearly 4Gb ?
> >
> >
> > Regards
> >
> > Dominique
> >
> > --
> > Dominique Béjean
> > 06 08 46 12 43
>
> --
Dominique Béjean
06 08 46 12 43


Re: Solr JVM best pratices

2017-12-02 Thread Walter Underwood
We use an 8G heap and G1 with Shawn Heisey’s settings. Java 8, update 131.

This has been solid in production with a 32 node Solr Cloud cluster. We do not 
do faceting.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 2, 2017, at 7:43 AM, Dominique Bejean  
> wrote:
> 
> Hi,
> 
> I would like to have some advices on best practices related to Heap Size,
> MMap, direct memory, GC algorithm and OS Swap.
> 
> This is a waste subject and sorry for this long question but all these
> items are linked in order to have a stable Solr environment.
> 
> My understanding and questions.
> 
> About JVM heap size setting
> 
> JVM heap size setting is related to use case so there is no other advice
> than reduce it at the minimum possible size in order to avoid GC issue.
> Reduce Heap size at is minimum will be achieved mainly by :
> 
>   -
> 
>   Optimize schema by remove unused fields and not index / store fields if
>   it is not necessary
>   -
> 
>   Enable docValues on fields used for facetting, sorting and grouping
>   -
> 
>   Not oversize Solr cache
>   -
> 
>   Be careful with rows and fl query parameters
> 
> 
> Any other advice is welcome :)
> 
> 
> About MMap setting
> 
> According to the great article “
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html”
> from Uwe Schindler, the only tasks that have to be done at OS settings
> level is check that “ulimit -v” and “ulimit -m” both report “unlimited” and
> increase vm.max_map_count setting from is default 65536.
> 
> I suppose the best value is related to available off heap memory. I
> generally set it to 262144. Is it a good value or is there a better way to
> determine this value ?
> 
> 
> About Direct Memory
> 
> According to a response in Solr Maillig list from Uwe Schindler (again), I
> understand that the MmapDirectory is not Direct Memory.
> 
> The only place where I read that MaxDirectMemorySize JVM setting have to be
> set for Solr is in Cloudera blog post and in Solr mailing list when using
> Solr with HDFS.
> 
> Is it necessary to change the default MaxDirectMemorySize JVM setting ? If
> yes, how to determine the appropriate value ?
> 
> 
> About OS Swap setting
> 
> Linux generally starts swapping when less than 30% of the memory is free.
> In order to avoid OS goes against Solr for off heap memory management,  I
> use to change OS swappiness value to 0. Can you confirm it is a good thing ?
> 
> 
> About CMS GC vs G1 GC
> 
> Default Solr setting use CMS GC.
> 
> According to the post from Shawn Heisey in the old Solr wiki (
> https://wiki.apache.org/solr/ShawnHeisey), can we consider that G1 GC can
> definitely be used with Solr for heap size over nearly 4Gb ?
> 
> 
> Regards
> 
> Dominique
> 
> -- 
> Dominique Béjean
> 06 08 46 12 43



Re: JVM GC Issue

2017-12-02 Thread Dominique Bejean
Hi Toke,

Nearly 30% of the requests are setting facet.limit=200

On 42000 requests the number of time each field is used for faceting is

$ grep  'facet=true' select.log | grep -oP 'facet.field=([^&])*' | sort |
uniq -c | sort -r

 23119 facet.field=category_path

  8643 facet.field=EUR_0_price_decimal

  5560 facet.field=type_pratique_facet

  5560 facet.field=size_facet_facet

  5560 facet.field=marque_facet

  5560 facet.field=is_marketplace_origin_facet

  5560 facet.field=gender_facet

  5560 facet.field=color_facet_facet

  5560 facet.field=club_facet

  5560 facet.field=age_facet

  3290 facet.field=durete_facet

  3290 facet.field=diametre_roues_facet

   169 facet.field=EUR_1_price_decimal

38 facet.field=EUR_4_price_decimal


The larger count of unique values for these fields are

category_path 3025

marque_facet 2100

size_facet_facet 1400

type_pratique_facet 166

color_facet_facet 165

Here are 2 typical queries :

2017-11-20 10:13:27.585 INFO  (qtp592179046-15153) [   x:french]
o.a.s.c.S.Request [french]  webapp=/solr path=/select
params={mm=100%25=category_path=age_facet=is_marketplace_origin_facet=type_pratique_facet=gender_facet=color_facet_facet=size_facet_facet=EUR_0_price_decimal=EUR_0_price_decimal=marque_facet=club_facet=diametre_roues_facet=durete_facet=*:*&
json.nl=map=products_id,product_type_static,name_varchar,store_id,website_id,EUR_0_price_decimal=0=(store_id:"1")+AND+(website_id:"1")+AND+(product_status:"1")+AND+(category_id:"3389")+AND+(filter_visibility_int:"2"+OR+filter_visibility_int:"4")=48=is_marketplace_origin_boost_exact:"OUI"^210+is_marketplace_origin_boost:OUI^207+is_marketplace_origin_relative_boost:OUI^203+is_marketplace_origin_boost_exact:"NON"^210+is_marketplace_origin_boost:NON^207+is_marketplace_origin_relative_boost:NON^203+=*:*=200==edismax=textSearch=true=1=true=json=EUR_0_price_decimal=sort_EUR_0_special_price_decimal=true=1511172807}
hits=953 status=0 QTime=26

2017-11-20 10:17:28.193 INFO  (qtp592179046-17115) [   x:french]
o.a.s.c.S.Request [french]  webapp=/solr path=/select
params={mm=100%25=category_path=age_facet=is_marketplace_origin_facet=type_pratique_facet=gender_facet=color_facet_facet=size_facet_facet=EUR_0_price_decimal=marque_facet=club_facet&
json.nl=map=products_id,product_type_static,name_varchar,store_id,website_id,EUR_0_price_decimal=0=(store_id:"1")+AND+(website_id:"1")+AND+(product_status:"1")+AND+(filter_visibility_int:"3"+OR+filter_visibility_int:"4")8=name_boost_exact:"velo"^120+name_boost:"velo"^100+name_relative_boost:velo^80+category_boost:velo^60+is_marketplace_origin_boost_exact:"OUI"^210+is_marketplace_origin_boost:OUI^207+is_marketplace_origin_relative_boost:OUI^203+is_marketplace_origin_boost_exact:"NON"^210+is_marketplace_origin_boost:NON^207+is_marketplace_origin_relative_boost:NON^203+size_facet_boost_exact:"velo"^299+size_facet_boost:velo^296+size_facet_relative_boost:velo^292+marque_boost_exact:"velo"^359+marque_boost:velo^356+marque_relative_boost:velo^352+=velo=200=velo=edismax=textSearch=true=1=true=json=EUR_0_price_decimal=sort_EUR_0_special_price_decimal=true=1511173047}
hits=6761 status=0 QTime=38




Dominique


Le sam. 2 déc. 2017 à 16:23, Toke Eskildsen  a écrit :

> Dominique Bejean  wrote:
> > Hi, Thank you for the explanations about faceting. I was thinking the hit
> > count had a biggest impact on facet memory lifecycle.
>
> Only if you have a very high facet.limit. Could you provide us with a
> typical query, including all the parameters?
>
> - Toke Eskildsen
>
-- 
Dominique Béjean
06 08 46 12 43


Solr JVM best pratices

2017-12-02 Thread Dominique Bejean
Hi,

I would like to have some advices on best practices related to Heap Size,
MMap, direct memory, GC algorithm and OS Swap.

This is a waste subject and sorry for this long question but all these
items are linked in order to have a stable Solr environment.

My understanding and questions.

About JVM heap size setting

JVM heap size setting is related to use case so there is no other advice
than reduce it at the minimum possible size in order to avoid GC issue.
Reduce Heap size at is minimum will be achieved mainly by :

   -

   Optimize schema by remove unused fields and not index / store fields if
   it is not necessary
   -

   Enable docValues on fields used for facetting, sorting and grouping
   -

   Not oversize Solr cache
   -

   Be careful with rows and fl query parameters


Any other advice is welcome :)


About MMap setting

According to the great article “
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html”
from Uwe Schindler, the only tasks that have to be done at OS settings
level is check that “ulimit -v” and “ulimit -m” both report “unlimited” and
increase vm.max_map_count setting from is default 65536.

I suppose the best value is related to available off heap memory. I
generally set it to 262144. Is it a good value or is there a better way to
determine this value ?


About Direct Memory

According to a response in Solr Maillig list from Uwe Schindler (again), I
understand that the MmapDirectory is not Direct Memory.

The only place where I read that MaxDirectMemorySize JVM setting have to be
set for Solr is in Cloudera blog post and in Solr mailing list when using
Solr with HDFS.

Is it necessary to change the default MaxDirectMemorySize JVM setting ? If
yes, how to determine the appropriate value ?


About OS Swap setting

Linux generally starts swapping when less than 30% of the memory is free.
In order to avoid OS goes against Solr for off heap memory management,  I
use to change OS swappiness value to 0. Can you confirm it is a good thing ?


About CMS GC vs G1 GC

Default Solr setting use CMS GC.

According to the post from Shawn Heisey in the old Solr wiki (
https://wiki.apache.org/solr/ShawnHeisey), can we consider that G1 GC can
definitely be used with Solr for heap size over nearly 4Gb ?


Regards

Dominique

-- 
Dominique Béjean
06 08 46 12 43


Re: JVM GC Issue

2017-12-02 Thread Toke Eskildsen
Dominique Bejean  wrote:
> Hi, Thank you for the explanations about faceting. I was thinking the hit
> count had a biggest impact on facet memory lifecycle.

Only if you have a very high facet.limit. Could you provide us with a typical 
query, including all the parameters?

- Toke Eskildsen


Re: JVM GC Issue

2017-12-02 Thread Dominique Bejean
Hi, Thank you for the explanations about faceting. I was thinking the hit
count had a biggest impact on facet memory lifecycle. Regardless the hit
cout there is a query peak at the time the issue occurs. This is relative
in regard of what Solr is supposed be able to handle, but this should be
sufficient to explain GC activity growing up. 198 10:07 208 10:08 267 10:09
285 10:10 244 10:11 286 10:12 277 10:13 252 10:14 183 10:15 302 10:16 299
10:17 273 10:18 348 10:19 468 10:20 496 10:21 673 10:22 496 10:23 101 10:24
At the time the issue occurs, we see the CPU activity growing up to very
high. May be there is a lack of CPU. So, I will suggest all actions that
will remove pressure on heap memory.


   - enable docValues
   - divide cache size per 2 in order go back to Solr default
   - refine the fl parameter as I know it can optimized

Concerning phonetic filter, anyway it will be removed as a large number of
results are really irrelevant. Regads. Dominique


Le sam. 2 déc. 2017 à 04:25, Erick Erickson  a
écrit :

> Doninique:
>
> Actually, the memory requirements shouldn't really go up as the number
> of hits increases. The general algorithm is (say rows=10)
> Calcluate the score of each doc
> If the score is zero, ignore
> If the score is > the minimum in my current top 10, replace the lowest
> scoring doc in my current top 10 with the new doc (a PriorityQueue
> last I knew).
> else discard the doc.
>
> When all docs have been scored, assemble the return from the top 10
> (or whatever rows is set to).
>
> The key here is that most of the Solr index is kept in
> MMapDirecotry/OS space, see Uwe's excellent blog here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html.
> In terms of _searching_, very little of the Lucene index structures
> are kept in memory.
>
> That said, faceting plays a bit loose with the rules. If you have
> docValues set to true, most of the memory structures are in the OS
> memory space, not the JVM. If you have docValues set to false, then
> the "uninverted" structure is built in the JVM heap space.
>
> Additionally, the JVM requirements are sensitive to the number of
> unique values in field being faceted on. For instance, let's say you
> faceted by a date field with just facet.field=some_date_field. A
> bucket would have to be allocated to hold the counts for each and
> every unique date field, i.e. one for each millisecond in your search,
> which might be something you're seeing. Conceptually this is just an
> array[uniqueValues] of ints (longs? I'm not sure). This should be
> relatively easily testable by omitting the facets while measuring.
>
> Where the number of rows _does_ make a difference is in the return
> packet. Say I have rows=10. In that case I create a single return
> packet with all 10 docs "fl" field. If rows = 10,000 then that return
> packet is obviously 1,000 times as large and must be assembled in
> memory.
>
> I rather doubt the phonetic filter is to blame. But you can test this
> by just omitting the field containing the phonetic filter in the
> search query. I've certainly been wrong before.
>
> Best,
> Erick
>
> On Fri, Dec 1, 2017 at 2:31 PM, Dominique Bejean
>  wrote:
> > Hi,
> >
> >
> > Thank you both for your responses.
> >
> >
> > I just have solr log for the very last period of the CG log.
> >
> >
> > Grep command allows me to count queries per minute with hits > 1000 or >
> > 1 and so with the biggest impact on memory and cpu during faceting
> >
> >
> >> 1000
> >
> >  59 11:13
> >
> >  45 11:14
> >
> >  36 11:15
> >
> >  45 11:16
> >
> >  59 11:17
> >
> >  40 11:18
> >
> >  95 11:19
> >
> > 123 11:20
> >
> > 137 11:21
> >
> > 123 11:22
> >
> >  86 11:23
> >
> >  26 11:24
> >
> >  19 11:25
> >
> >  17 11:26
> >
> >
> >> 1
> >
> >  55 11:19
> >
> >  78 11:20
> >
> >  48 11:21
> >
> > 134 11:22
> >
> >  93 11:23
> >
> >  10 11:24
> >
> >
> > So we see that at the time GC start become nuts, large result set count
> > increase.
> >
> >
> > The query field include phonetic filter and results are really not
> relevant
> > due to this. I will suggest to :
> >
> >
> > 1/ remove the phonetic filter in order to have less irrelevant results
> and
> > so get smaller result set
> >
> > 2/ enable docValues on field use for faceting
> >
> >
> > I expect decrease GC requirements and stabilize GC.
> >
> >
> > Regards
> >
> >
> > Dominique
> >
> >
> >
> >
> >
> > Le ven. 1 déc. 2017 à 18:17, Erick Erickson  a
> > écrit :
> >
> >> Your autowarm counts are rather high, bit as Toke says this doesn't
> >> seem outrageous.
> >>
> >> I have seen situations where Solr is running close to the limits of
> >> its heap and GC only reclaims a tiny bit of memory each time, when you
> >> say "full GC with no memory
> >> reclaimed" is that really no memory _at all_? Or "almost no memory"?
> >>