IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread John Wang
Hi guys: IndexWriter.deleteDocuments(Query query) api is not really making sense to me. Wouldn't IndexWriter.deleteDocuments(DocIdSet set) be better? Since we don't really care about scoring for this call. Also, can we expose IndexWriter.deleteDocuments(int[] docids)? Using the c

Re: IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread Yonik Seeley
On Tue, Mar 31, 2009 at 3:41 PM, John Wang wrote: > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)? Exposing internal ids from the IndexWriter may not be a good idea given that they are transient. -Yonik http://www.lucidimagination.com --

Re: IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread John Wang
I fail to see the difference of exposing the api to allow for a Query instance to be passed in vs a DocIdSet. In this specific case, Query is essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't it what DocIdSet is? Thanks -John On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley wrot

Re: IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread Yonik Seeley
On Tue, Mar 31, 2009 at 4:58 PM, John Wang wrote: > I fail to see the difference of exposing the api to allow for a Query > instance to be passed in vs a DocIdSet. I was commenting specifically on your idea to allow deletion by int[] (docids) on the IndexWriter. DocIdSet is a different issue - i

Re: IndexWriter.deleteDocuments(Query query)

2009-03-31 Thread John Wang
So do you think it is a good addition/change to the current api now? -John On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley wrote: > On Tue, Mar 31, 2009 at 4:58 PM, John Wang wrote: > > I fail to see the difference of exposing the api to allow for a Query > > instance to be passed in vs a DocIdSe

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
John, I think this has the same problem as exposing delete by docID, ie, how would you produce that docIdSet? We could consider delete by Filter instead, since that exposes the necessary getDocIdSet(IndexReader) method. Or, with near real-time search, we could enhance it to allow deletions via t

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Hi Michael: Let me first share what I am doing w.r.t deleting by docid: I have a customized index reader that stores a mapping of docid -> uid in the payload (something Michael Bush and Ning Li suggested a while back) And that mapping is loaded a IndexReader load time and is shared by searche

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Yonik Seeley
On Wed, Apr 1, 2009 at 4:02 AM, Michael McCandless wrote: > I think this has the same problem as exposing delete by docID, ie, how > would you produce that docIdSet? Whoops, right. I was going by memory that there was a get(IndexReader) type method there... but that's on Filter of course. -Yon

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
> For me at lease, IndexWriter.deleteDocument(int) would be useful. I completely agree: delete-by-docID in IndexWriter would be a great feature. Long ago I became convinced of that. Where this feature always gets stuck (search the lists -- it's gotten stuck alot) is how to implement it? At any

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Jason Rutherglen
John, We looked at implementing delete by doc id for LUCENE-1516, however it seemed to be something that if enough people wanted we could implement it at as a later patch. The implementation involves maintaining a genealogy of SegmentReaders within IndexWriter so that deletes to a reader that has

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Thanks Michael for the info. I do guarantee there are not modifications between when "MySpecialIndexReader" is loaded and when I iterate and find the deleted docids. I am, however, not aware that when IndexWriter is opened, docids move. I thought only when docs are added and when it is committed.

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
On Wed, Apr 1, 2009 at 2:04 PM, John Wang wrote: > My test essentially this. I took out the reader.deleteDocuments call from > both scenarios. I took a index of 5m docs. a batch of 1 randomly > generated uids. > > Compared the following scenarios: > 1) > * open index reader > * for each uid i

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
Hi Michael: 1) Yes, we use TermDocs, exactly what IndexWriter.deleteDocuments(Term) is doing under the cover. 2) We iterate the docid->uid mapping, for each docid, get the corresponding ui and check that to see if that is in the deleted set. If so, add the docid to the list. There is no ui

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Michael McCandless
On Wed, Apr 1, 2009 at 5:22 PM, John Wang wrote: > Hi Michael: > >    1) Yes, we use TermDocs, exactly what IndexWriter.deleteDocuments(Term) > is doing under the cover. This part I understand :) >    2) We iterate the docid->uid mapping, for each docid, get the > corresponding ui and check that

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread John Wang
a code snippet is worth 1000 words :) private static final Term UID_TERM = new Term("uid_payload", "_UID"); private static class SinglePayloadTokenStream extends TokenStream { private Token token = new Token(UID_TERM.text(), 0, 0); private byte[] buffer = new byte[4]; private boolean

Re: IndexWriter.deleteDocuments(Query query)

2009-04-02 Thread Michael McCandless
On Wed, Apr 1, 2009 at 6:37 PM, John Wang wrote: > a code snippet is worth 1000 words :) Here here! OK, now I understand the difference. With approach 1, for each of N UIDs you use a TermDocs to find the postings for that UID, and retrieve the one docID corresponding to that UID. You retrieve

Re: IndexWriter.deleteDocuments(Query query)

2009-04-02 Thread John Wang
Hi Michael: Thanks for looking into this. Approach 2 has a dependency on how fast the delete set performs a check on a given id, approach one doesn't. After replacing my delete set with a simple bitset, approach 2 gets a 25-30% improvement. I understand if the delete set is small, appr

Re: IndexWriter.deleteDocuments(Query query)

2009-04-02 Thread Michael McCandless
On Thu, Apr 2, 2009 at 2:26 PM, John Wang wrote: > Hi Michael: >    Thanks for looking into this. > >    Approach 2 has a dependency on how fast the delete set performs a check > on a given id, approach one doesn't. After replacing my delete set with a > simple bitset, approach 2 gets a 25-30% imp