[
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841407#action_12841407
]
Michael Busch commented on LUCENE-2293:
---------------------------------------
bq. Yes, I think each DW will have to record its own buffered delete
Term/Query, mapping to its docID at the time the delete arrived.
I think in the future deletes in DW could work like this:
- DW keeps of course track of a private sequence id, which gets incremented in
the add, delete, update calls
- a DW has a getReader() call, the reader can search the ram buffer
- when DW.gerReader() gets called, then the new reader remembers the current
seqID at the time it was opened - let's call it RAMReader.seqID; if such a
reader gets reopened, simply its seqID gets updated.
- we keep an growing int array with the size of DW's maxDoc, which replaces the
usual deletes bitset
- when DW.updateDocument() or .deleteDocument() needs to delete a doc we do
that right away, before inverting the new doc. We can do that by running a
query using a RAMReader to find all docs that must be deleted. Instead of
flipping a bit in a bitset, for each hit we now keep track of when it was
deleted:
{code}
// init each slot in deletes array with -1
static final int NOT_DELETED = Integer.MAX_INT;
...
Arrays.fill(deletes, NOT_DELETED);
...
public void deleteDocument(Query q) {
reopen RAMReader
run query q using RAMReader
for each hit {
int hitDocId = ...
if (deletes[hitDocId] == NOT_DELETED) {
deletes[hitDocId] = DW.seqID;
}
}
...
DW.seqID++;
}
{code}
Now no matter of how often you (re)open RAMReaders, they can share the deletes
array. No cloning like with the BitSet approach would be necessary:
When the RAMReader iterates posting lists it's as simple as this to treat
deletes docs correctly. Instead of doing this in RAMTermDocs.next():
{code}
if (deletedDocsBitSet.get(doc)) {
skip this doc
}
{code}
we can now do:
{code}
if (deletes[doc] < ramReader.seqID) {
skip this doc
}
{code}
Here is an example:
1. Add 3 docs with DW.addDocument()
2. User opens ramReader_a
3. Delete doc 1
4. User opens ramReader_b
After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT}
After 2: ramReader_a.seqID = 2
After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT}
After 3: ramReader_b.seqID = 3
Note that both ramReader_a and ramReader_b share the same deletes[] array. Now
when ramReader_a is used to read posting lists, it will not treat doc 1 as
deleted, because (deletes[1] < ramReader_a.seqID) = (2 < 2) = false; But
ramReader_b will see it as deleted, because (deletes[1] < ramReader_b.seqID) =
(2 < 3) = true.
What do you think about this approach for the future when we have a searchable
DW buffer?
> IndexWriter has hard limit on max concurrency
> ---------------------------------------------
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5. This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads. So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput. So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter. Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort. The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread). We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states. Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running. This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs. But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc. However, a single thread would still get
> monotonic assignment of docIDs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]