Re: BitSet in Filters

Sandeep Khanzode Tue, 12 Aug 2014 09:08:59 -0700

Hi Erick,

I have mentioned everything that is relevant, I believe :).

However, just to give more background: Assume documents of the order of more 
than 300 million, and multiple concurrent users running search. I may front 
Lucene with ElasticSearch, and ES basically calls Lucene TermFilters. My 
filters are broad in nature, so you can take it that any time I filter on a 
tag, it would run into, easily, millions of documents to be accepted in the 
filter.

The only filter that uses a BitSet works with Document Ids in Lucene. I would 
have wanted this bitset approach to work on some other regular numeric long 
field so that we can scale which does not seem likely if I have to use an 
ArrayList of Longs for TermFilters.

Hope that makes the scenario more clear. Please let me know your thoughts.

-----------------------
Thanks n Regards,
Sandeep Ramesh Khanzode

On Tuesday, August 12, 2014 8:41 PM, Erick Erickson <erickerick...@gmail.com> 
wrote:

bq: Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical

Why do you say that? Do you have evidence? Because lots and lots of Solr 
installations do exactly this and they run fine.

So I suspect there's something you're not telling us about your setup. Are you, 
say, soft committing often? Do you have autowarming specified? 

You're not going to be able to keep your filters based on some other field in 
the document. Internally, Lucene uses the internal doc ID as an index into the 
bitset. That's baked in to very low levels and isn't going to change AFAIK.

Best,
Erick

On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode 
<sandeep_khanz...@yahoo.com.invalid> wrote:

Hi,
> 
>The current usage of BitSets in filters in Lucene is limited to applying only 
>on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
>DocumentIDs handy.
>
>However, with every update/delete i.e. CRUD modification, these will change, 
>and I have to again redo the whole process to fetch the latest docIDs. 
>
>Assume a scenario where I need to tag millions of documents with a tag like 
>"Finance", "IT", "Legal", etc.
>
>Unless, I can cache these filters in memory, the cost of constructing this 
>filter at run time per query is not practical. If I could map the documents to 
>a numeric long identifier and put them in a BitMap, I could then cache them 
>because the size reduces drastically. However, I cannot use this numeric long 
>identifier in Lucene filters because it is not a docID but another regular 
>field.
>
>Please help with this scenario. Thanks,
>
>-----------------------
>Thanks n Regards,
>Sandeep Ramesh Khanzode

Re: BitSet in Filters

Reply via email to