[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

Michael Busch (JIRA) Thu, 04 Mar 2010 09:49:51 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841407#action_12841407
 ]


Michael Busch commented on LUCENE-2293:
---------------------------------------

bq. Yes, I think each DW will have to record its own buffered delete 
Term/Query, mapping to its docID at the time the delete arrived. 

I think in the future deletes in DW could work like this:
- DW keeps of course track of a private sequence id, which gets incremented in 
the add, delete, update calls
- a DW has a getReader() call, the reader can search the ram buffer
- when DW.gerReader() gets called, then the new reader remembers the current 
seqID at the time it was opened - let's call it RAMReader.seqID; if such a 
reader gets reopened, simply its seqID gets updated.
- we keep an growing int array with the size of DW's maxDoc, which replaces the 
usual deletes bitset
- when DW.updateDocument() or .deleteDocument() needs to delete a doc we do 
that right away, before inverting the new doc. We can do that by running a 
query using a RAMReader to find all docs that must be deleted. Instead of 
flipping a bit in a bitset, for each hit we now keep track of when it was 
deleted:

{code}
// init each slot in deletes array with -1
static final int NOT_DELETED = Integer.MAX_INT;
...
Arrays.fill(deletes, NOT_DELETED);

...

public void deleteDocument(Query q) {
  reopen RAMReader
  run query q using RAMReader
  for each hit {
    int hitDocId = ...
    if (deletes[hitDocId] == NOT_DELETED) {
      deletes[hitDocId] = DW.seqID;
    }
  }
...
  DW.seqID++;
}
{code}

Now no matter of how often you (re)open RAMReaders, they can share the deletes 
array. No cloning like with the BitSet approach would be necessary:

When the RAMReader iterates posting lists it's as simple as this to treat 
deletes docs correctly. Instead of doing this in RAMTermDocs.next():
{code}
  if (deletedDocsBitSet.get(doc)) {
    skip this doc
 }
{code}

we can now do:

{code}
  if (deletes[doc] < ramReader.seqID) {
    skip this doc
  }
{code}

Here is an example:
1. Add 3 docs with DW.addDocument() 
2. User opens ramReader_a
3. Delete doc 1
4. User opens ramReader_b


After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT}
After 2: ramReader_a.seqID = 2
After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT}
After 3: ramReader_b.seqID = 3

Note that both ramReader_a and ramReader_b share the same deletes[] array. Now 
when ramReader_a is used to read posting lists, it will not treat doc 1 as 
deleted, because (deletes[1] < ramReader_a.seqID) = (2 < 2) = false; But 
ramReader_b will see it as deleted, because (deletes[1] < ramReader_b.seqID) = 
(2 < 3) = true.

What do you think about this approach for the future when we have a searchable 
DW buffer?

> IndexWriter has hard limit on max concurrency
> ---------------------------------------------
>
>                 Key: LUCENE-2293
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2293
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

Reply via email to