On Mon, 2013-04-15 at 10:25 +0200, John Nielsen wrote:

> The FieldCache is the big culprit. We do a huge amount of faceting so
> it seems right.

Yes, you wrote that earlier. The mystery is that the math does not check
out with the description you have given us.

> Unfortunately I am super swamped at work so I have precious little
> time to work on this, which is what explains my silence.

No problem, we've all been there.
> 
[Band aid: More memory]

> The extra memory helped a lot, but it still OOM with about 180 clients
> using it.

You stated earlier that you has a "solr cluster" and your total(?) index
size was 35GB, with each "register" being between "15k" and "30k". I am
using the quotes to signify that it is unclear what you mean. Is your
cluster multiple machines (I'm guessing no), multiple Solr's, cores,
shards or maybe just a single instance prepared for later distribution?
Is a register a core, shard or a simply logical part (one client's data)
of the index?

If each client has their own core or shard, that would mean that each
client uses more than 25GB/180 bytes ~= 142MB of heap to access 35GB/180
~= 200MB of index. That sounds quite high and you would need a very
heavy facet to reach that.

If you could grep "UnInverted" from the Solr log file and paste the
entries here, that would help to clarify things.


Another explanation for the large amount of memory presents itself if
you use a single index: If each of your clients facet on at least one
fields specific to the client ("client123_persons" or something like
that), then your memory usage goes through the roof.

Assuming an index with 10M documents, each with 5 references to a modest
10K unique values in a facet field, the simplified formula
  #documents*log2(#references) + #references*log2(#unique_values) bit
tells us that this takes at least 110MB with field cache based faceting.

180 clients @ 110MB ~= 20GB. As that is a theoretical low, we can at
least double that. This fits neatly with your new heap of 64GB.


If my guessing is correct, you can solve your memory problems very
easily by sharing _all_ the facet fields between your clients.
This should bring your memory usage down to a few GB.

You are probably already restricting their searches to their own data by
filtering, so this should not influence the returned facet values and
counts, as compared to separate fields.

This is very similar to the thread "Facets with 5000 facet fields" BTW.

> Today I finally managed to set up a test core so I can begin to play
> around with docValues.

If you are using a single index with the individual-facet-fields for
each client approach, the DocValues will also have scaling issues, as
the amount of values (of which the majority will be null) will be
  #clients*#documents*#facet_fields
This means that the adding a new client will be progressively more
expensive.

On the other hand, if you use a lot of small shards, DocValues should
work for you.

Regards,
Toke Eskildsen


Reply via email to