On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
> QTime enum:
>          1st call: 1200
>  subsequent calls: 200

Those numbers seems fine.

> QTime fc:
>        never returns, webserver restarts itself after 30 min with 100% CPU 
> load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

> QTime=41205              facet.prefix=            q=frequent_word          
> numFound=44532
> 
> Same query repeated:
> QTime=225810             facet.prefix=            q=ottomotor              
> numFound=909
> QTime=199839             facet.prefix=            q=ottomotor              
> numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

> QTime=185948             facet.prefix=            q=ottomotor              
> numFound=909
> 
> QTime=3344               facet.prefix=d           q=ottomotor              
> numFound=909

Fits with expectations.

> >- Documents in your index
> 13,434,414
> 
> >- Unique values in the CONTENT field
> Not sure how to get this.  In luke I find
> 21,797,514 term count CONTENT

Those are the relevant numbers for faceting. There is a limit of 2^24
(16M) terms for facet.method=enum, although I am a bit unsure if that is
for the whole index or per segment.

Come to think of it, if you have a multi-segmented index, you might want
to try facet.method.fcs. It should have faster startup than fc and
better performance than enum for fields with a large number of unique
values. Memory requirements should be between fc and enum.

> >- Xmx
> The maximum the system allows me to get: 1612m
> 
> Maybe I have a hopelessly under-dimensioned server for this sort of things?

Well, 1612m should be enough for the faceting in itself; it it the
startup that is the killer. 

A rule of thumb for fc is that the internal structure takes at least
#docs*log(#references) + #references*log(#unique_values) bytes

If your content field is a description, let's say that each description
has 40 words, which gives us 500M references from documents to facet
values. This translates to
13M*log(500M) + 500M*log(22M) bytes ~= 13M*29 + 500M*25 bytes ~= 380MB.

Taking into account that building the structure has an overhead of 2-3
times that, we are approaching the memory limit of 1612m. If the index
is updated, a new facet structure is build all over again while the old
structure is still in memory.


If you need better performance on your large field I would suggest, in
order of priority:

- facet.method=fcs
- facet.method=fcs with DocValues
- Shard your index and use facet.method=fc
- SOLR-2412 (https://issues.apache.org/jira/browse/SOLR-2412)

SOLR-2412 is a last resort, but it does have the same speed as
facet.method=fc only without the 16M unique values limitation.

Regards,
Toke Eskildsen, State and University Library, Denmark

Reply via email to