subject:"\[jira\] \[Commented\] \(LUCENE\-2312\) Search on IndexWriter's RAM Buffer"

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845726#action_12845726
]

Michael Busch commented on LUCENE-2312:
---

{quote}
A quick and easy way to solve this is to use a read write lock
on the byte pool?
{quote}

If you use a RW lock then the writer thread will block all reader threads while
it's making changes. The writer thread will be making changes all the time in
a real-time search environment. The contention will kill performance I'm sure.
RW lock is only faster than mutual exclusion lock if writes are infrequent, as
mentioned in the javadocs of ReadWriteLock.java

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845729#action_12845729
]

Jason Rutherglen commented on LUCENE-2312:
--

{quote}but my goal is it here to implement a non-blocking and
lock-free algorithm. So my idea was it to make use of a very
subtle behavior of volatile variables. {quote}

You're talking about having a per thread write buffer byte
array, that on search gets copied into a read only array, or
gets transformed magically into a volatile byte array? (Do
volatile byte arrays work? I couldn't find a clear answer on the
net, maybe it's stated in the Goetz book). If volatile byte
arrays do work, an option to test would be a byte buffer pool
that uses volatile byte arrays?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845731#action_12845731
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845735#action_12845735
]

Jason Rutherglen commented on LUCENE-2312:
--

{quote}This makes the reference to the array volatile, not the
slots in the array{quote}

That's no good! :)

{quote}If you use a RW lock then the writer thread will block
all reader threads while it's making changes{quote}

We probably need to implement more fine grained locking, perhaps
using volatile booleans instead of RW locks. Fine grained
meaning on the byte array/block level. I think this would imply
that changes are not visible until a given byte block is more or
less flushed? This is different than the design that's been
implicated, that we'd read from byte arrays as their being
written to. We probably don't need to read from and write to the
same byte array concurrently (that might not be feasible?).

The performance win here is probably going to be the fact that
we avoid segment merges.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845745#action_12845745
]

Michael Busch commented on LUCENE-2312:
---

The tricky part is to make sure that a reader always sees a consistent snapshot
of the index. At the same time a reader must not follow pointers to
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs in certain
intervals to not prevent JVM optimizations - but I need more time for thinking
about all the combinations and corner cases.

It's getting late now - need to sleep!

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-16 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845777#action_12845777
]

Michael McCandless commented on LUCENE-2312:

bq. The tricky part is to make sure that a reader always sees a consistent
snapshot of the index. At the same time a reader must not follow pointers to
non-published locations (e.g. array blocks).

Right, I'm just not familiar specifically with what JMM says about one thread
writing to a byte[] and another thread reading it.

In general, for our usage, the reader threads will never read into an area that
has not yet been written to. So that works in our favor (they can't cache
those bytes if they didn't read them). EXCEPT the CPU will have loaded the
bytes on a word boundary and so if our reader thread reads only 1 byte, and no
more (because this is now the end of the posting), the CPU may very well have
pulled in the following 7 bytes (for example) and then illegally (according to
our needs) cache them.

We better make some serious tests for this... including reader threads that
just enum the postings for a single rarish term over and over while writer
threads are indexing docs that occasionally have that term. I think that's the
worst case for JMM violation since the #bytes cached is small.

It's too bad there isn't higher level control on the CPU caching via java. EG,
in our usage, if we could call a System.flushCPUCache whenever a thread enters
a newly reopened reader because, when accessing postings via a given Reader
we want point-in-time searching anyway and so any bytes cached by the CPU are
perfectly fine. We only need CPU cache flush when a reader is reopened

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-16 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845778#action_12845778
]

Michael McCandless commented on LUCENE-2312:

{quote}
The prototype I'm experimenting with has a fixed length postings format for the
in-memory representation (in TermsHash). Basically every posting has 4 bytes,
so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are
used for an absolute docID (not delta-encoded). This limits the max in-memory
segment size to 2^24 docs. The 1 remaining byte is used for the position. With
a max doc length of 140 characters you can fit every possible position in a
byte - what a luxury! If a term occurs multiple times in the same doc, then
the TermDocs just skips multiple occurrences with the same docID and increments
the freq. Again, the same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash,
but backwards pointers. In real-time search you often want a strongly
time-biased ranking. A PostingList object has a pointer that points to the last
posting (this statement is not 100% correct for visibility reasons across
threads, but we can imagine it this way for now). A TermDocs can now traverse
the postinglists in opposite order. Skipping can be done by following pointers
to previous slices directly, or by binary search within a slice.
{quote}
This sounds nice!

This would be a custom indexing chain for docs guaranteed not to be over 255
positions in length right?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-16 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845780#action_12845780
]

Michael McCandless commented on LUCENE-2312:

bq. In thinking about the terms dictionary, we're going to run into concurrency
issues right if we just use TreeMap?

Right, we need a concurrent data structure here. It's OK if there've been
changes to this shared data structure since a reader was opened -- that reader
knows its max doc id and so it can skip a term if the first doc id in that term
is that max.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845943#action_12845943
]

Jason Rutherglen commented on LUCENE-2312:
--

I thought we're moving away from byte block pooling and we're
going to try relying on garbage collection? Does a volatile
object[] publish changes to all threads? Probably not, again
it'd just be the pointer.

In the case of posting/termdocs iteration, I'm more concerned
that the lastDocID be volatile than the with the byte array
containing extra data. Extra docs is OK in the byte array
because we'll simply stop iterating when we've reached the last
doc. Though with our system, we shouldn't even run into this
either, meaning a byte array is copied and published, perhaps
the master byte array is still being written to and the same
byte array (by id or something) is published again? Then we'd
have multiple versions of byte arrays. That could be bad.

Because there is one DW per thread, there's only one document
being indexed at a time. There's no writer concurrency. This
leaves reader concurrency. However after each doc, we *could*
simply flush all bytes related to the doc. Any new docs must
simply start writing to new byte arrays? The problem with this
is, unless the byte arrays are really small, we'll have a lot of
extra data around, well, unless the byte arrays are trimmed
before publication. Or we can simply RW lock (or some other
analogous thing) individual byte arrays, not publish them after
each doc, then only publish them when get reader is called. To
clarify, the RW lock (or flag) would only be per byte array, in
fact, all writing to the byte array could necessarily cease on
flush, and new byte arrays allocated. The published byte array
could point to the next byte array.

I think we simply need a way to publish byte arrays to all
threads? Michael B. can you post something of what you have so
we can get an idea of how your system will work (ie, mainly what
the assumptions are)?

We do need to strive for correctness of data, and perhaps
performance will be slightly impacted (though compared with our
current NRT we'll have an overall win).

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845950#action_12845950
]

Jason Rutherglen commented on LUCENE-2312:
--

{quote}The tricky part is to make sure that a reader always sees
a consistent snapshot of the index. At the same time a reader
must not follow pointers to non-published locations (e.g. array
blocks). {quote}

Right. In what case in the term enum, term docs chain of doc
scoring would a reader potentially try to follow a pointer to a
byte array that doesn't exist? I think we're strictly preventing
it via last doc ids? Also, when we flush, I think we need to
block further doc writing (via an RW lock?) and wait for any
currently writing docs to complete, then forcibly publish the
byte arrays, then release the write lock? This way we always
have published data that's consistent for readers (eg, the
inverted index can be read completely, and there won't be any
wild writes still occurring to a byte array that's been
published).

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845969#action_12845969
]

Michael Busch commented on LUCENE-2312:
---

{quote}
I thought we're moving away from byte block pooling and we're
going to try relying on garbage collection? Does a volatile
object[] publish changes to all threads? Probably not, again
it'd just be the pointer.
{quote}

We were so far only considering moving away from pooling of (Raw)PostingList
objects. Pooling byte blocks might have more performance impact - they're more
heavy-weight.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845971#action_12845971
]

Jason Rutherglen commented on LUCENE-2312:
--

To clarify the above comment, DW's update doc method would acquire a mutex.
The flush bytes method would also acquire that mutex when it copies existing
writeable bytes over to the readable bytes thing (pool?).

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845978#action_12845978
]

Michael Busch commented on LUCENE-2312:
---

{quote}
think we simply need a way to publish byte arrays to all
threads? Michael B. can you post something of what you have so
we can get an idea of how your system will work (ie, mainly what
the assumptions are)?
{quote}

It's kinda complicated to explain and currently differs from Lucene's TermHash
classes a lot. I'd prefer to wait a little bit until I have verified that my
solution works.

I think here we should really tackle LUCENE-2324 first - it's a prereq. Wanna
help with that, Jason?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846261#action_12846261
]

Jason Rutherglen commented on LUCENE-2312:
--

I think the easiest way to test out the concurrency is to add a
flush method to ByteBlockPool. Then allocate a read only version
of the buffers array (not copying the byte arrays, just the 1st
dimension pointers). The only issue is to rework the code to
read from the read only array, and write to the write only
array...

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845255#action_12845255
]

Michael McCandless commented on LUCENE-2312:

Yes, commit should flush sync all doc writers, and rollback must abort all of
them.

bq. I also have a separate indexing chain prototype working with searchable RAM
buffer (single-threaded)

Yay!

bq. but slightly different postinglist format (some docs nowadays only have 140
characters ).

New sponsor, eh? ;)

But, yes, I suspect an indexer chain optimized to tiny docs can get sizable
gains.

What change to the postings format? Is the change only in the RAM
buffer or also in the index? If it's in the index... we should
probably do this under flex.

bq. It seems really fast. I spent a long time thinking about lock-free
algorithms and data structures, so indexing performance should be completely
independent of the search load (in theory). I need to think a bit more about
how to make it work with normal documents and Lucene's current in-memory
format.

Sounds like awesome progress!! Want some details over here :)

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845257#action_12845257
]

Michael McCandless commented on LUCENE-2312:

{quote}
I got the basics of the term enum working, it can be completed
fairly easily. So I moved on to term docs... There we got some
work to do? Because we're not storing the skip lists in the ram
buffer, currently. I guess we'll need a new
FreqProxTermsWriterPerField that stores the skip lists as
they're being written? How will that work? Doesn't the
multi-level skip list assume a set number of docs?
{quote}

Sounds like you Michael should sync up!

Good question on skipping -- for first cut we can have no skipping
(and just scan)? Skipping may not be that important in practice,
unless RAM buffer becomes truly immense. Of course, the tinier the
docs the more important skipping will be...

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845374#action_12845374
]

Jason Rutherglen commented on LUCENE-2312:
--

{quote}Good question on skipping - for first cut we can have no
skipping (and just scan)? {quote}

True.

One immediate thought is to have a set skip interval (what was
it before when we had single level?), and for now at least have
a single level skip list. That we can grow the posting list with
docs, and the skip list at the same time. If the interval is
constant there won't be a need to rebuild the skip list.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845404#action_12845404
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Pre-advanced apology for permanently damaging (well I guess it
can be deleted) the look and feel of this issue with a thwack of
code, however I don't want to post the messy patch, and I'm
guessing there's something small as to why the postings
iteration on the freq byte slice reader isn't happening
correctly (ie, it's returning 0).

{code}
public class DWTermDocs implements TermDocs {
final FreqProxTermsWriterPerField field;
final int numPostings;
final CharBlockPool charPool;
FreqProxTermsWriter.PostingList posting;
char[] text;
int textOffset;
private int postingUpto = -1;
final ByteSliceReader freq = new ByteSliceReader();
final ByteSliceReader prox = new ByteSliceReader();

int docID;
int termFreq;

DWTermDocs(FreqProxTermsWriterPerField field, 
FreqProxTermsWriter.PostingList posting) throws IOException {
  this.field = field;
  this.charPool = field.perThread.termsHashPerThread.charPool;
  //this.numPostings = field.termsHashPerField.numPostings;
  this.numPostings = 1;
  this.posting = posting;
  // nextTerm is called only once to 
  // set the term docs pointer at the 
  // correct position
  nextTerm();
}

boolean nextTerm() throws IOException {
  postingUpto++;
  if (postingUpto == numPostings)
return false;

  docID = 0;

  text = charPool.buffers[posting.textStart  
DocumentsWriter.CHAR_BLOCK_SHIFT];
  textOffset = posting.textStart  DocumentsWriter.CHAR_BLOCK_MASK;

  field.termsHashPerField.initReader(freq, posting, 0);
  if (!field.fieldInfo.omitTermFreqAndPositions)
field.termsHashPerField.initReader(prox, posting, 1);

  // Should always be true
  boolean result = nextDoc();
  assert result;

  return true;
}

public boolean nextDoc() throws IOException {
  if (freq.eof()) {
if (posting.lastDocCode != -1) {
  // Return last doc
  docID = posting.lastDocID;
  if (!field.omitTermFreqAndPositions)
termFreq = posting.docFreq;
  posting.lastDocCode = -1;
  return true;
} else
  // EOF
  return false;
  }
  final int code = freq.readVInt();
  if (field.omitTermFreqAndPositions)
docID += code;
  else {
docID += code  1;
if ((code  1) != 0)
  termFreq = 1;
else
  termFreq = freq.readVInt();
  }
  assert docID != posting.lastDocID;
  return true;
}
{code}

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845432#action_12845432
 ] 

Michael McCandless commented on LUCENE-2312:


I don't see anything obviously wrong -- you excised this code from the same 
code that's used when merging the postings during flush?

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845448#action_12845448
]

Jason Rutherglen commented on LUCENE-2312:
--

The code is from FreqProxFieldMergeState which accepts in it's
constructor FreqProxTermsWriterPerField. One difference is
instead of operating on an array of posting lists, the code
above assumes one posting list.

The numPostings was always 0 when testing
{code}this.numPostings = field.termsHashPerField.numPostings;{code}
In the code above it's hard coded to 1.

Maybe there's some initialization that's not happening correctly?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845464#action_12845464
]

Michael McCandless commented on LUCENE-2312:

Ahh, I think it's because you're not calling compactPostings/sortPostings in
the THPF, right?

Those methods collapse the hash table in-place (ie move all the nulls out), and
sort.

So you have to re-work the code to not do that and instead use whatever
structure you have for visiting terms in sorted order. Then stepping through
the docs should just work, but, you gotta stop at the max docID, right?

Hmm... what does JMM say about byte arrays? If one thread is writing to the
byte array, can any other thread see those changes?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845493#action_12845493
]

Jason Rutherglen commented on LUCENE-2312:
--

{quote}Ahh, I think it's because you're not calling
compactPostings/sortPostings in the THPF, right?

Those methods collapse the hash table in-place (ie move all the
nulls out), and sort.{quote}

Yep, got that part.

{quote}So you have to re-work the code to not do that and
instead use whatever structure you have for visiting terms in
sorted order. Then stepping through the docs should just work,
but, you gotta stop at the max docID, right?{quote}

Right, the terms in sorted order is working... The freq
ByteSliceReader is reading nothing however (zeroes). Either it's
init'ed to the wrong position, or there's nothing in there? Or
something else.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845503#action_12845503
]

Jason Rutherglen commented on LUCENE-2312:
--

Also wanted to add that the PostingList lastDocID is correct.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845663#action_12845663
]

Jason Rutherglen commented on LUCENE-2312:
--

I have a test case showing the term docs working... I'm going to try to add the
term positions methods.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845679#action_12845679
]

Jason Rutherglen commented on LUCENE-2312:
--

Basic term positions working, need to figure out how to do lazy loading
payloads...

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845680#action_12845680
]

Jason Rutherglen commented on LUCENE-2312:
--

In thinking about the terms dictionary, we're going to run into concurrency
issues right if we just use TreeMap? Can't we simply use the lock free
ConcurrentSkipListMap? Yeah it's a part of Java6 however why reinvent the
wheel?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845696#action_12845696
]

Jason Rutherglen commented on LUCENE-2312:
--

Payloads works (non-lazy loading), however ByteSliceReader doesn't implement a
seek method so I think we simply need to load each payload as we increment
nextPosition? The cost shouldn't be too much because we're simply copying
small byte arrays (in the heap).

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845703#action_12845703
]

Michael Busch commented on LUCENE-2312:
---

{quote}
Sounds like awesome progress!! Want some details over here :)
{quote}

Sorry for not being very specific. The prototype I'm experimenting with has a
fixed length postings format for the in-memory representation (in TermsHash).
Basically every posting has 4 bytes, so I can use int[] arrays (instead of the
byte[] pools). The first 3 bytes are used for an absolute docID (not
delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1
remaining byte is used for the position. With a max doc length of 140
characters you can fit every possible position in a byte - what a luxury! :)
If a term occurs multiple times in the same doc, then the TermDocs just skips
multiple occurrences with the same docID and increments the freq. Again, the
same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash,
but backwards pointers. In real-time search you often want a strongly
time-biased ranking. A PostingList object has a pointer that points to the
last posting (this statement is not 100% correct for visibility reasons across
threads, but we can imagine it this way for now). A TermDocs can now traverse
the postinglists in opposite order. Skipping can be done by following pointers
to previous slices directly, or by binary search within a slice.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845712#action_12845712
]

Michael Busch commented on LUCENE-2312:
---

{quote} Hmm... what does JMM say about byte arrays? If one thread is writing
to the byte array, can any other thread see those changes?
{quote}

This is the very right question to ask here. Thread-safety is really the by
far most complicated aspect of this feature. Jason, I'm not sure if you
already figured out how to ensure visibility of changes made by the writer
thread to the reader threads?

Thread-safety in our case boils down to safe publication. We don't need
locking to coordinate writing of multiple threads, because of LUCENE-2324. But
we need to make sure that the reader threads see all changes they need to see
at the right time, in the right order. This is IMO very hard, but we all like
challenges :)

The JMM gives no guarantee whatsover what changes a thread will see that
another thread made - or if it will ever see the changes, unless proper
publication is ensured by either synchronization or volatile/atomic variables.

So e.g. if a writer thread executes the following statements:
{code}
public static int a, b;

...

a = 1; b = 2;

a = 5; b = 6;
{code}

and a reader threads does:
{code}
System.out.println(a + , + b);
{code}

The thing to remember is that the output might be: 1,6! Another reader thread
with the following code:
{code}
while (b != 6) {
.. do something
}
{code}
might further NEVER terminate without synchronization/volatile/atomic.

The reason is that the JVM is allowed to perform any reorderings to utilize
modern CPUs, memory, caches, etc. if not forced otherwise.

To ensure safe publication of data written by a thread we could do
synchronization, but my goal is it here to implement a non-blocking and
lock-free algorithm. So my idea was it to make use of a very subtle behavior
of volatile variables. I will take a simple explanation of the JMM from Brian
Goetz' awesome book Java concurrency in practice, in which he describes the
JMM in simple happens-before rules. I will mention only three of those rules,
because they are enough to describe the volatile behavior I'd like to mention
here (p. 341)

*Program order rule:* Each action in a thread _happens-before_ every action in
that thread that comes later in the program order.

*Volatile variable rule:* A write to a volatile field _happens-before_ every
subsequent read of that same field.

*Transitivity:* If A happens-before B, and B _happens-before_ C, then A
_happens-before_ C.

Based on these three rules you can see that writing to a volatile variable v
by one thread t1 and subsequent reading of the same volatile variable v by
another thread t2 publishes ALL changes of t1 that happened-before the write
to v and the change of v itself. So this write/read of v means crossing a
memory barrier and forcing everything that t1 might have written to caches to
be flushed to the RAM. That's why a volatile write can actually be pretty
expensive.

Note that this behavior is actually only working like I just described since
Java 1.5. Behavior of volatile variables was a very very subtle change from
1.4-1.5!

The way I'm trying to make use of this behavior is actually similar to how we
lazily sync Lucene's files with the filesystem: I want to delay the cache-RAM
write-through as much as possible, which increases the probability of getting
the sync for free! Still fleshing out the details, but I wanted to share these
infos with you guys already, because it might invalidate a lot of assumptions
you might have when developing the code. Some of this stuff was actually new
to me, maybe you all know it already. And if anything that I wrote here is
incorrect, please let me know!

Btw: IMO, if there's only one java book you can ever read, then read Goetz'
book! It's great. He also says in the book somewhere about lock-free
algorithms: Don't try this at home! - so, let's do it! :)

Search on IndexWriter's RAM Buffer
--

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845721#action_12845721
]

Jason Rutherglen commented on LUCENE-2312:
--

Just to clarify, I think Mike's referring to ParallelArray?

http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/P
arallelArray.html

There's AtomicIntegerArray:
http://www.melclub.net/java/_atomic_integer_array_8java_source.html
which underneath uses the sun.Unsafe class for volatile array
access. Could this be reused for an AtomicByteArray class (why
isn't there one of these already?).

A quick and easy way to solve this is to use a read write lock
on the byte pool? Remember when we'd sync on each read bytes
call to the underlying random access file in FSDirectory (eg,
now we're using NIOFSDir which can be a good concurrent
throughput improvement). Lets try the RW lock and examine the
results? I guess the issue is we're not writing in blocks of
bytes, we're actually writing byte by byte and need to read byte
by byte concurrently? This sounds like a fairy typical thing to
do?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845057#action_12845057
]

Michael McCandless commented on LUCENE-2312:

Michael are you also going to [first] tackle truly separating the RAM segments?
I think we need this first ...

bq. Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is
this class used? Thanks.

This is so we can make a tee in the indexing chain. Here's the default chain
(copied out of comment in DW):
{code}
DocConsumer / DocConsumerPerThread
-- code: DocFieldProcessor / DocFieldProcessorPerThread
-- DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField
-- code: DocFieldConsumers / DocFieldConsumersPerThread /
DocFieldConsumersPerField
-- code: DocInverter / DocInverterPerThread / DocInverterPerField
-- InvertedDocConsumer / InvertedDocConsumerPerThread /
InvertedDocConsumerPerField
-- code: TermsHash / TermsHashPerThread / TermsHashPerField
-- TermsHashConsumer / TermsHashConsumerPerThread /
TermsHashConsumerPerField
-- code: FreqProxTermsWriter / FreqProxTermsWriterPerThread /
FreqProxTermsWriterPerField
-- code: TermVectorsTermsWriter /
TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField
-- InvertedDocEndConsumer / InvertedDocConsumerPerThread /
InvertedDocConsumerPerField
-- code: NormsWriter / NormsWriterPerThread / NormsWriterPerField
-- code: StoredFieldsWriter / StoredFieldsWriterPerThread /
StoredFieldsWriterPerField
{code}

The tee is so the doc fields can go to both DocInvert (for creating postings
term vectors) and to stored fields writer.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845059#action_12845059
]

Michael McCandless commented on LUCENE-2312:

bq. IW flush could become thread dependent

Right, we want this -- different RAM segments should be flushed at different
times. This gives us better concurrency since IO/CPU resource consumption will
now be more interleaved. While one RAM segment is flushing, the others are
still indexing.

{quote}
A new term will first check the hash table for existence (as
currently), if it's not in the term hash table only then will it
be added to the btree (btw, binary search is O(log N) on
average?) This way we're avoiding the somewhat costlier btree
existence check per token.
{quote}

Yes, we could have btree on-the-side but still use hash for mapping (vs using
btree alone). Hash will be faster lookups... btree could be created/updated on
demand first time something needs to .next() through the TermsEnum.

{quote
The algorithm for flushing doc writers based on RAM
consumption can simply be, on exceed, flush the doc writer
consuming the most RAM
{quote}

Sounds good :) The challenge will be balancing things... eg if during the time
1 RAM segment is flushed, the others are able to consume more RAM that was
freed up by flushing this one RAM segment, you've got a problem... or maybe at
that point you go and flush the next one now using the most RAM, so it'd self
balance with time.

This will mean the RAM usage is able to flare up above the high water mark...

{quote}
I gutted the PerThread classes, then realized, it's all too
intertwined. I'd rather get something working, than spend an
excessive amount of time rearranging code that already works.
{quote}

For starters I would keep the *PerThread, but create multiple DWs? Ie,
removing the PerThread layer doesn't have to happen at first.

Or we could do the nuclear option -- make a new indexing chain.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845061#action_12845061
]

Michael McCandless commented on LUCENE-2312:

bq. IW commitMerge calls docWriter's remapDeletes, a synchronized method to
prevent concurrent updates. I'm not sure how we should efficiently block calls
to the different DW's.

Yeah this is because when we buffer a delete Term/Query, the docID we store
against it is absolute. It *seems* like it could/should be relative (ie,
within the RAM segment), then remapping wouldn't be needed when a merge
commits. I think?

bq. _mergeInit calls docWriter getDocStoreSegment - unsure what to change

It wouldn't anymore once we have private RAM segments: we would no longer share
doc stores across segments, meaning merging will always merge doc stores and
there's no need to call that method nor have all the logic in SegmentMerger to
determine whether doc store merging is required.

This will necessarily be a perf hit when up and building a large index from
scratch in a single IW session. Today that index creates one large set of doc
stores and never has to merge it while building. This is the biggest perf
downside to this change, I think.

But maybe the perf loss will not be so bad, because of bulk merging, in the
case when all docs always add the same fields in the same order. Or... if we
could fix lucene to always bind the same field name to the same field number
(LUCENE-1737) then we'd always bulk-merge regardless of which which order app
adds fields to docs.

bq. Some of the config settings (such as maxBufferedDocs) can simply be removed
from DW, and instead accessed via WriterConfig

Ahh, you mean push IWC down to DW? That sounds great.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845149#action_12845149
]

Jason Rutherglen commented on LUCENE-2312:
--

Mike, rollback is pausing all threads and calling doc writer abort. This
should probably happen across all (per thread) doc writers?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845155#action_12845155
]

Michael Busch commented on LUCENE-2312:
---

Well, we need to keep our transactional semantics. So I assume while a flush
will happen per doc writer independently, a commit will trigger all (per
thread) doc writers to flush. Then a rollback also has to abort all per thread
doc writers.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845157#action_12845157
]

Michael Busch commented on LUCENE-2312:
---

{quote}
Michael are you also going to [first] tackle truly separating the RAM segments?
I think we need this first ...
{quote}

Yeah I agree. I started working on a patch for separating the doc writers
already.

I also have a separate indexing chain prototype working with searchable RAM
buffer (single-threaded), but slightly different postinglist format (some docs
nowadays only have 140 characters ;) ). It seems really fast. I spent a long
time thinking about lock-free algorithms and data structures, so indexing
performance should be completely independent of the search load (in theory). I
need to think a bit more about how to make it work with normal documents and
Lucene's current in-memory format.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845179#action_12845179
]

Jason Rutherglen commented on LUCENE-2312:
--

I got the basics of the term enum working, it can be completed
fairly easily. So I moved on to term docs... There we got some
work to do? Because we're not storing the skip lists in the ram
buffer, currently. I guess we'll need a new
FreqProxTermsWriterPerField that stores the skip lists as
they're being written? How will that work? Doesn't the
multi-level skip list assume a set number of docs?

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844891#action_12844891
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

From LUCENE-2293: {quote}(b-tree, or, simply sort-on-demand the
first time a query needs it, though that cost increases the
larger your RAM segments get, ie, not incremental to the # docs
you just added){quote}

For the terms dictionary, perhaps a terms array (this could be a
RawPostingList[], or an array of objects with pointers to a
RawPostingList with some helper methods like getTerm and
compareTo), is kept in sorted order, we then binary search and
insert new RawPostingLists/terms into the array. We *could*
implement a 2 dimensional array, allowing us to make a per
reader copy of the 1st dimension of array. This would maintain
transactional consistency (ie, a reader's array isn't changing
as a term enum is traversing in another thread). 

{quote}Also, we have to solve what happens to a reader using a
RAM segment that's been flushed. Perhaps we don't reuse RAM at
that point, ie, rely on GC to reclaim once all readers using
that RAM segment have closed.{quote}

I don't think we have a choice here? 

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
 Fix For: 3.0.2


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844898#action_12844898
]

Michael McCandless commented on LUCENE-2312:

{quote}
For the terms dictionary, perhaps a terms array (this could be a
RawPostingList[], or an array of objects with pointers to a
RawPostingList with some helper methods like getTerm and
compareTo), is kept in sorted order, we then binary search and
insert new RawPostingLists/terms into the array. We could
implement a 2 dimensional array, allowing us to make a per
reader copy of the 1st dimension of array. This would maintain
transactional consistency (ie, a reader's array isn't changing
as a term enum is traversing in another thread).
{quote}

I don't think we can do term insertion into an array -- that's O(N^2)
insertion cost -- we should use a btree instead.

Also, we could store the first docID stored into the term, too -- this
way we could have a ordered collection of terms, that's shared across
several open readers even as changes are still being made, but each
reader skips a given term if its first docID is greater than the
maxDoc it's searching. That'd give us point in time searching even
while we add terms with time...

{quote}
bq. Also, we have to solve what happens to a reader using a RAM segment that's
been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to
reclaim once all readers using that RAM segment have closed.

I don't think we have a choice here?
{quote}

I think we do have a choice.

EG we could force the reader to cutover to the newly flushed segment
(which should be identical to the RAM segment), eg by making [say] a
DelegatingSegmentReader.

Still... we'd probably have to not re-use in that case, since there
can be queries in-flight stepping through the RAM postings, and, we
have no way to accurately detect they are done. But at least with
this approach we wouldn't tie up RAM indefinitely...

Or maybe we simply state that the APP must aggressively close NRT
readers with time else memory use grows and grows... but I don't
really like that. We don't have such a restriction today...

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845030#action_12845030
]

Jason Rutherglen commented on LUCENE-2312:
--

Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is
this class used? Thanks.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845031#action_12845031
]

Michael Busch commented on LUCENE-2312:
---

{quote}
Also, we could store the first docID stored into the term, too - this
way we could have a ordered collection of terms, that's shared across
several open readers even as changes are still being made, but each
reader skips a given term if its first docID is greater than the
maxDoc it's searching. That'd give us point in time searching even
while we add terms with time...
{quote}

Exactly. This is what I meant in my comment:
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

But I mistakenly said lastDocID; of course firstDocID is correct.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845032#action_12845032
 ] 

Michael Busch commented on LUCENE-2312:
---

I'll try to tackle this one!

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.0.2


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845036#action_12845036
]

Jason Rutherglen commented on LUCENE-2312:
--

A few notes so far:

* IW flush could become thread dependent (eg, it'll only flush
for the current doc writer) or maybe it should flush all doc
writers? Close will shut down and flush all doc writers.

* A new term will first check the hash table for existence (as
currently), if it's not in the term hash table only then will it
be added to the btree (btw, binary search is O(log N) on
average?) This way we're avoiding the somewhat costlier btree
existence check per token.

* The algorithm for flushing doc writers based on RAM
consumption can simply be, on exceed, flush the doc writer
consuming the most RAM?

* I gutted the PerThread classes, then realized, it's all too
intertwined. I'd rather get *something* working, than spend an
excessive amount of time rearranging code that already works.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer