Re: DefaultIndexAccessor

Mark Miller Mon, 04 Feb 2008 16:47:20 -0800

I replied to the wrong thread -- sorry about that:

You still have to be careful if you want to alternate a search andwrite. If you are loading a lot of docs this way, you would want to holdthe Writer to batch the docs, but while you are holding it, you will nothave a fresh view of the index - so you could add the same doc twice ifit came twice in a batch. The only way to be sure you avoid this is toreopen readers after you add every doc. This is just not going to be afast way of doing things...but if you have a high mergefactor, the newreopen method will prob make it *much* faster. Or if you are sure thatthe batch won't contain duplicates, you can batch load.


Cam Bazz wrote:

Hello Mark,

Thank you for your lengthy and valuable clarification. I have the case -
before adding to the index, i must check if a document exist with the
same key (actually, double key) - or before deleting a document - I must
ensure it exists in the index.

Currently I am doing it with my custom caching routine. It works quite well
upto 32M documents. but after that something happens and it really slows
down.

I will experiment with your implementation, as soon as I can. It is very
cool by the way. Will it be included in the next release?

Best,
-C.B.

On Feb 4, 2008 7:15 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

The purpose of IndexAccessor is to coordinate Readers/Writers for a
Lucene index. Readers and Writers in Lucene are multi-threaded in that
multiple threads may use them at the same time, but they must/should be
shared and there are special rules (You cannot delete with a Reader
while a Writer is working on the index). Also, you need to refresh
Reader views every so often; this is expensive (though usually much less
so with the new reopen method).

IndexAccessor enforces the rules and controls Reader refreshing. Instead
of worrying about caching or index interaction rules, you just ask for
your Reader/Writer, use it to search or add a doc, and then return it.
The rest is taken care of for you.

This is done by keeping a cached Writer and Searcher(s) that all threads
share. References to the Searchers are counted so that after a Writer is
returned (and no other thread has a reference to the Writer),
IndexAccessor waits for all of the current Searchers to come back and
then reopens their Readers.

In this regard, you get a similar setup to what Solr might give: from
any thread you just add docs and run searches -- you don't have to worry
about refreshing Readers or sharing Writers/Readers or one thread
deleting with a Reader while another thread tries to write with a Writer.

This setup allows you to do other cool things, like warm Searchers
before putting them into action. Thats what the code I am posting soon
is be capable of - when the Readers are reopened, search requests will
still be handled by the old Readers while the new Searchers run a sample
query with optional sort fields. This will make sure the Reader is open
and its sort caches are loaded before the first thread tries to use it.
Much faster response to applications.

You must open a new Reader or reopen a Reader to see recently added
docs...IndexAccessor provides no real way around that. But it does make
the reopening much easier -- and your application that just wants to add
docs and search at will from multiple threads, won't have to worry about
it.

You can bail out here, or if you want further clarification I will
include an alternate attempt at what IndexAccessor is below.

- Mark

----------------------------------------------------------------------------------------------------
When accessing a Lucene index from multiple threads, there are a variety
of issues that you must address.

1. The Readers/Writer should be shared across threads.
2. Readers must periodically be refreshed, either be creating new
instances or using the new reopen method.
3. A Reader that writes needs to be properly coordinated with a Writer
eg they cannot be used at the same time.

IndexAccessor addresses each of these issues.

How it works:

A single Writer is shared among threads that try to concurrently
retrieve and use a Writer. Once all of these threads release their
reference
to the Writer, it is closed and upon the next request a new one is
created.

A single Searcher for each Similarity is also shared across threads.
Upon first request, a new Searcher is created. This Searcher is then
returned
upon every request. A count of every Searcher reference retrieved is
maintained.

When all references to a Writer are released, the Writer is closed and
after waiting for all of the Searchers to be returned, the Searchers are
reopened. Without warming enabled, new requests for Searchers/Readers
must wait for this reopen to complete. If warming is enabled, the old
Searchers/Readers continue handling Searcher requests until the Readers
have been reopened and any requested sort caches have been loaded.

If you ask for a writing Reader, you will not get it until a Writer is
released and vice versa.

The result is that you can freely use Writers/Readers/Searchers from any
thread without considering thread interactions. ***

If you want to add docs, just ask for a Writer, add the docs, and
release the Writer. If you want to search, get a Searcher, search,
and release the Searcher. You don't have to worry about reopening
Readers or coordinating access.

***
You still do have to consider things like hogging the Writer/Readers -
if you don't occasionally release them, things will not stay very
interactive.
The best method is to just get the object, use it, and then return it in
a finally block. Batch load multiple docs, but if your just randomly
adding
a doc, get the Writer, add it, and then release the Writer in a finally
block. If you are batch loading a million docs and you want to be able
to see them
as they are added: get the writer and add 10,000 docs (or something),
release the Writer, get the Writer and add 10,000 docs, etc.

Cam Bazz wrote:

Hello Mark,

I have been reading the code - and honestly I have not understood how it
works. I was hoping that this was a solution to the case when you are

adding

documents - in a multithreaded way, it allows other non-writer threads

to be

able to see documents added without refreshing the indexsearcher - by

using

some caching mechanism.

Could you elaborate what IndexAccessor does and how it does it a little

bit

more?

Best Regards,
-C.B.

On Feb 4, 2008 3:06 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

IndexAccessor-1.26.2008.zip is the latest one. I will be dating a zip

from

now on.

I hope to post new code with the warming either tonight or tomorrow

night.

I would be ecstatic to have some help vetting that.

Also, I am thinking of making a change so that when you release the

Writer

the thread that releases does not block until reopen. I think the

original

author did this so that if you add a doc with a thread and then

immediately

search from the same thread, you are guaranteed to find the doc.

However,

this gaurentee did not hold -- if another thread had a reference to the
Writer and a new thread grabbed a Writer and then quicly released

before the

first thread, you will have added a doc but it will not be visible

until the

first thread releases its reference to the Writer...since the concept

is not

enforced anyway, you might as well not block for the final thread that
releases the Writer either. Instead I will grab a thread from a thread

pool

to do the reopening with that thread, and return right after closing

the

Writer. The result is that you cannot add a doc and search and expect

to

find it without waiting a second or too. But this way things will be
consistent, and an app that adds docs will be a bit more

responsive....eg it

wont hang as Readers are being reopened.

I also have to bring the AccessProvider classes back. No easy way to

use

your own custom Readers without it...I shouldn't have stripped it out.

- Mark



Cam Bazz wrote:

Hello,

Regarding https://issues.apache.org/jira/browse/LUCENE-1026 , this

seems

very interesting. I have read the discussion on the page, but I could

not

figure out which set of files is the latest.
Is it the IndexAccessor-1.26.2008.zip file?

I will read through the code, make my own tests, and send some

feedback.

Best.
-C.B.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DefaultIndexAccessor

Reply via email to