Erick,

I am trying to do a premature optimization. *There will be no updates to my
index. So, no worries about ageing out or garbage collection.*
Let me get my understanding correctly; when we talk about filterCache, it
just stores the document IDs in the cache right?

And my setup is as follows. There are 16 nodes in my SolrCloud. Each having
64 GB of RAM, out of which I am allocating 45 GB to Solr. I have a
collection (say Products, which contains around 100 million Docs), which I
created with 64 shards, replication factor 2, and 8 shards per node. Each
shard is getting around 1.6 Million Documents. So my math here for
filterCache for a specific filter will be -


   - an average filter query will be 20 bytes, so 1000 (distinct number of
   states) x 20 = 2 MB
   - and considering union of DocIds for all the values of a given filter
   equals to total number of DocId's present in the index. There are 1.6
   Million Documents in a  solr core. So, 1,600,000 x 8 Bytes (for each Doc
   Id) equals to 12.8 MB
   - There will be 8 solrcores per node - 8 x 12.8 MB = *102 MB. *

This is the size of cache for a single filter in a single node. Considering
the heapsize I have given, I think this shouldn't be an issue..

Thanks,
Manohar

On Fri, Dec 26, 2014 at 10:56 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Manohar:
>
> Please approach this cautiously. You state that you have "hundreds of
> states".
> Every 100 states will use roughly 1.2G of your filter cache. Just for this
> field. Plus it'll fill up the cache and they may soon be aged out anyway.
> Can you really afford the space? Is it really a problem that needs to be
> solved at this point? This _really_ sounds like premature optimization
> to me as you haven't
> demonstrated that there's an actual problem you're solving.
>
> OTOH, of course, if you're experimenting to better understand all the
> ins and outs
> of the process that's another thing entirely ;)....
>
> Toke:
>
> I don't know the complete algorithm, but if the number of docs that
> satisfy the fq is "small enough",
> then just the internal Lucene doc IDs are stored rather than a bitset.
> What exactly "small enough" is
> I don't know off the top of my head. And I've got to assume looking
> stuff up in a list is slower than
> indexing into a bitset so I suspect "small enough" is very small....
>
> On Fri, Dec 26, 2014 at 3:00 AM, Manohar Sripada <manohar...@gmail.com>
> wrote:
> > Thanks Toke for the explanation, I will experiment with
> > f.state.facet.method=enum
> >
> > Thanks,
> > Manohar
> >
> > On Fri, Dec 26, 2014 at 4:09 PM, Toke Eskildsen <t...@statsbiblioteket.dk>
> > wrote:
> >
> >> Manohar Sripada [manohar...@gmail.com] wrote:
> >> > I have 100 million documents in my index. The maxDoc here is the
> maximum
> >> > Documents in each shard, right? How is it determined that each entry
> will
> >> > occupy maxDoc/8 approximately.
> >>
> >> Assuming that it is random whether a document is part of the result set
> or
> >> not, the most efficient representation is 1 bit/doc (this is often
> called a
> >> bitmap or bitset). So the total number of bits will be maxDoc, which is
> the
> >> same as maxDoc/8 bytes.
> >>
> >> Of course, result sets are rarely random, so it is possible to have
> other
> >> and more compact representations. I do not know how that plays out in
> >> Lucene. Hopefully somebody else can help here.
> >>
> >> > If I have to add facet.method=enum every time in the query, how
> should I
> >> > specify for each field separately?
> >>
> >> f.state.facet.method=enum
> >>
> >> See https://wiki.apache.org/solr/SimpleFacetParameters#Parameters
> >>
> >> - Toke Eskildsen
> >>
>

Reply via email to