BitSet in Filters

2014-08-12 Thread Sandeep Khanzode
Hi,
 
The current usage of BitSets in filters in Lucene is limited to applying only 
on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
DocumentIDs handy.

However, with every update/delete i.e. CRUD modification, these will change, 
and I have to again redo the whole process to fetch the latest docIDs. 

Assume a scenario where I need to tag millions of documents with a tag like 
Finance, IT, Legal, etc.

Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical. If I could map the documents to 
a numeric long identifier and put them in a BitMap, I could then cache them 
because the size reduces drastically. However, I cannot use this numeric long 
identifier in Lucene filters because it is not a docID but another regular 
field.

Please help with this scenario. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: BitSet in Filters

2014-08-12 Thread Erick Erickson
bq: Unless, I can cache these filters in memory, the cost of constructing
this filter at run time per query is not practical

Why do you say that? Do you have evidence? Because lots and lots of Solr
installations do exactly this and they run fine.

So I suspect there's something you're not telling us about your setup. Are
you, say, soft committing often? Do you have autowarming specified?

You're not going to be able to keep your filters based on some other field
in the document. Internally, Lucene uses the internal doc ID as an index
into the bitset. That's baked in to very low levels and isn't going to
change AFAIK.

Best,
Erick


On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 The current usage of BitSets in filters in Lucene is limited to applying
 only on docIDs i.e. I can only construct a filter out of a BitSet if I have
 the DocumentIDs handy.

 However, with every update/delete i.e. CRUD modification, these will
 change, and I have to again redo the whole process to fetch the latest
 docIDs.

 Assume a scenario where I need to tag millions of documents with a tag
 like Finance, IT, Legal, etc.

 Unless, I can cache these filters in memory, the cost of constructing this
 filter at run time per query is not practical. If I could map the documents
 to a numeric long identifier and put them in a BitMap, I could then cache
 them because the size reduces drastically. However, I cannot use this
 numeric long identifier in Lucene filters because it is not a docID but
 another regular field.

 Please help with this scenario. Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Re: BitSet in Filters

2014-08-12 Thread Sandeep Khanzode
Hi Erick,

I have mentioned everything that is relevant, I believe :).

However, just to give more background: Assume documents of the order of more 
than 300 million, and multiple concurrent users running search. I may front 
Lucene with ElasticSearch, and ES basically calls Lucene TermFilters. My 
filters are broad in nature, so you can take it that any time I filter on a 
tag, it would run into, easily, millions of documents to be accepted in the 
filter.

The only filter that uses a BitSet works with Document Ids in Lucene. I would 
have wanted this bitset approach to work on some other regular numeric long 
field so that we can scale which does not seem likely if I have to use an 
ArrayList of Longs for TermFilters.

Hope that makes the scenario more clear. Please let me know your thoughts.
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, August 12, 2014 8:41 PM, Erick Erickson erickerick...@gmail.com 
wrote:
 


bq: Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical

Why do you say that? Do you have evidence? Because lots and lots of Solr 
installations do exactly this and they run fine.

So I suspect there's something you're not telling us about your setup. Are you, 
say, soft committing often? Do you have autowarming specified? 

You're not going to be able to keep your filters based on some other field in 
the document. Internally, Lucene uses the internal doc ID as an index into the 
bitset. That's baked in to very low levels and isn't going to change AFAIK.

Best,
Erick



On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

Hi,
 
The current usage of BitSets in filters in Lucene is limited to applying only 
on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
DocumentIDs handy.

However, with every update/delete i.e. CRUD modification, these will change, 
and I have to again redo the whole process to fetch the latest docIDs. 

Assume a scenario where I need to tag millions of documents with a tag like 
Finance, IT, Legal, etc.

Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical. If I could map the documents to 
a numeric long identifier and put them in a BitMap, I could then cache them 
because the size reduces drastically. However, I cannot use this numeric long 
identifier in Lucene filters because it is not a docID but another regular 
field.

Please help with this scenario. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

RE: BitSet in Filters

2014-08-12 Thread Uwe Schindler
Hi,

in general you cannot cache Filter, you can cache their DocIdSets 
(CachingWrapperFilter is for example doing this). Lucene Queries are executed 
per segment, that means when you index new documents or update new documents, 
lucene creates new index segments. Older ones *never* change, so a DocIdSet 
(e.g. implemented by FixedBitSet) can be linked to a specifc segment of the 
index that never changes - only deletions may be added, but that's transparent 
to the filter - the deletions (given in acceptDocs to getDocIdSet) and the 
cached BitSet just need to be anded together (btw, deletions in Lucene are just 
a Filter, too).

Of course, after a while Lucene merges segments using its MergePolicy, because 
otherwise there would be too many of them. In that case several smaller 
segments (preferably those with many deletions) get merged into larger ones by 
the indexer. This is the only case when the some *new* DocIdSets need to be 
created. Large segments are unlikely to be merged, unless they have many 
deletions (caused by updates into new segments or deletions). This approach is 
used by Solr and Elasticsearch - CachingWrapperFilter is an example how to do 
this in own code.

To implement this:
- don't cache a bitset for the whole index this would indeed need you to 
recalculate the bitsets over and over
- In YourFilter.getDocIdSet() look up in your cache if the coreCacheKey of the 
given AtomicReaderContext.reader() is in your cache and if yes, reuse the 
cached DocIdSet (deletions are not relevant, you just have to apply them by 
BitsFilteredDocIdSet.wrap(cachedDocIdSet). If it's not in the cache, 
recalculate the bitset for the given AtomicReaderContext (not the whole index) 
and return it as DocIdSet instance.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Sandeep Khanzode [mailto:sandeep_khanz...@yahoo.com.INVALID]
 Sent: Tuesday, August 12, 2014 8:53 AM
 To: Lucene Users
 Subject: BitSet in Filters
 
 Hi,
 
 The current usage of BitSets in filters in Lucene is limited to applying only 
 on
 docIDs i.e. I can only construct a filter out of a BitSet if I have the
 DocumentIDs handy.
 
 However, with every update/delete i.e. CRUD modification, these will
 change, and I have to again redo the whole process to fetch the latest
 docIDs.
 
 Assume a scenario where I need to tag millions of documents with a tag like
 Finance, IT, Legal, etc.
 
 Unless, I can cache these filters in memory, the cost of constructing this 
 filter
 at run time per query is not practical. If I could map the documents to a
 numeric long identifier and put them in a BitMap, I could then cache them
 because the size reduces drastically. However, I cannot use this numeric long
 identifier in Lucene filters because it is not a docID but another regular 
 field.
 
 Please help with this scenario. Thanks,
 
 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org