On Thu, 2013-04-18 at 11:59 +0200, John Nielsen wrote:
> Yes, thats right. No search from any given client ever returns
> anything from another client.

Great. That makes the 1 core/client solution feasible.

[No sort & facet warmup is performed]

[Suggestion 1: Reduce the number of sort fields by mapping]

[Suggestion 3: 1 core/customer]

> If I understand the fieldCache mechanism correctly (which i can see
> that I don't), the data used for faceting and sorting is saved in the
> fieldCache using a key comprised of the fields used for said
> faceting/sorting. That data only contains the data which is actually
> used for the operation. This is what the fq queries are for.
> 
You are missing an essential part: Both the facet and the sort
structures needs to hold one reference for each document
_in_the_full_index_, even when the document does not have any values in
the fields.

It might help to visualize the structures as arrays of values with docID
as index: String[] myValues = new String[1400000] takes up 1.4M * 32 bit
(or more for a 64 bit machine) = 5.6MB, even when it is empty.

Note: Neither String-objects, nor Java references are used for the real
facet- and sort-structures, but the principle is quite the same.

> So if i generate a core for each client, I would have a client
> specific fieldCache containing the data from that client. Wouldn't I
> just split up the same data into several cores?

The same terms, yes, but not the same references.

Let's say your customer has 10K documents in the index and that there
are 100 unique values, each 10 bytes long, in each group .

As each group holds its own separate structure, we use the old formula
to get the memory overhead:

#documents*log2(#unique_terms*average_term_length) +
#unique_terms*average_term_length
> 
1.4M*log2(100*(10*8)) + 100*(10*8) bit = 1.2MB + 1KB.

Note how the values themselves are just 1KB, while the nearly empty
reference list takes 1.2MB.


Compare this to a dedicated core with just the 10K documents:
10K*log2(100*(10*8)) + 100*(10*8) bit = 8.5KB + 1KB.

The terms take up exactly the same space, but the heap requirement for
the references is reduced by 99%.

Now, 25GB for 180 clients means 140MB/client with your current setup.
I do not know the memory overhead of running a core, but since Solr can
run fine with 32MB for small indexes, it should be smaller than that.
You will of course have to experiment and to measure.


- Toke Eskildsen, State and University Library, Denmark


Reply via email to