[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

Michael Busch (JIRA) Mon, 15 Mar 2010 22:10:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845712#action_12845712
 ]


Michael Busch commented on LUCENE-2312:
---------------------------------------

{quote} Hmm... what does JMM say about byte arrays? If one thread is writing
to the byte array, can any other thread see those changes? 
{quote}

This is the very right question to ask here. Thread-safety is really the by
far most complicated aspect of this feature. Jason, I'm not sure if you
already figured out how to ensure visibility of changes made by the writer
thread to the reader threads?

Thread-safety in our case boils down to safe publication. We don't need
locking to coordinate writing of multiple threads, because of LUCENE-2324. But
we need to make sure that the reader threads see all changes they need to see
at the right time, in the right order. This is IMO very hard, but we all like
challenges :)

The JMM gives no guarantee whatsover what changes a thread will see that
another thread made - or if it will ever see the changes, unless proper
publication is ensured by either synchronization or volatile/atomic variables.

So e.g. if a writer thread executes the following statements:
{code}
public static int a, b;

...

a = 1; b = 2;

a = 5; b = 6;
{code}

and a reader threads does:
{code}
System.out.println(a + "," + b);
{code}

The thing to remember is that the output might be: 1,6! Another reader thread
with the following code: 
{code}
while (b != 6) {
  .. do something 
}
{code}
might further NEVER terminate without synchronization/volatile/atomic.

The reason is that the JVM is allowed to perform any reorderings to utilize
modern CPUs, memory, caches, etc. if not forced otherwise.

To ensure safe publication of data written by a thread we could do
synchronization, but my goal is it here to implement a non-blocking and
lock-free algorithm. So my idea was it to make use of a very subtle behavior
of volatile variables. I will take a simple explanation of the JMM from Brian
Goetz' awesome book "Java concurrency in practice", in which he describes the
JMM in simple happens-before rules. I will mention only three of those rules,
because they are enough to describe the volatile behavior I'd like to mention
here (p. 341)

*Program order rule:* Each action in a thread _happens-before_ every action in
that thread that comes later in the program order.

*Volatile variable rule:* A write to a volatile field _happens-before_ every
subsequent read of that same field.

*Transitivity:* If A happens-before B, and B _happens-before_ C, then A
_happens-before_ C.

Based on these three rules you can see that writing to a volatile variable v
by one thread t1 and subsequent reading of the same volatile variable v by
another thread t2 publishes ALL changes of t1 that happened-before the write
to v and the change of v itself. So this write/read of v means crossing a
memory barrier and forcing everything that t1 might have written to caches to
be flushed to the RAM. That's why a volatile write can actually be pretty
expensive.

Note that this behavior is actually only working like I just described since
Java 1.5. Behavior of volatile variables was a very very subtle change from
1.4->1.5!

The way I'm trying to make use of this behavior is actually similar to how we
lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM
write-through as much as possible, which increases the probability of getting
the sync for free! Still fleshing out the details, but I wanted to share these
infos with you guys already, because it might invalidate a lot of assumptions
you might have when developing the code. Some of this stuff was actually new
to me, maybe you all know it already.  And if anything that I wrote here is
incorrect, please let me know!

Btw: IMO, if there's only one java book you can ever read, then read Goetz'
book! It's great. He also says in the book somewhere about lock-free
algorithms: "Don't try this at home!" - so, let's do it! :)

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>
>                 Key: LUCENE-2312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2312
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

Reply via email to