Maybe I did something wrong, maybe it does indeed not help, but pushing
data into Lucene was not any faster than before.
I would like remove my project specific baggage and try to rephrase my
question by means of a simple example.
Suppose a Lucene document is used to count events of certain types. For
each type of event I have one document. Whenever a new event arrives, I
must read the respective document from the index, increment the count,
delete the document from the index and write the new one into the index.
As an addition, consider the distribution of events being a typical
Zipf, i.e. a small number of event types occurs rather frequently, while
other types of events may appear just once.
What is the most efficient sequence of Lucene operations for such a
scenario?
Harald.
On 07.08.2012 15:39, Harald Kirsch wrote:
Hello Simon,
ok, I'll try this out. Just to be sure. I was after a way to update
documents before they are even written to disk, but this seems not to be
the Lucene way. From what you propose I understand that this approach
tries to keep documents from being written up to the time they need to
be actually changed.
If I need to keep some kind map anyway myself, I wonder if I will not
just cache the documents themselves rather than just their sequence id.
If they are "old" enough I migrate them into the index. For the sequence
IDs I would need a retirement strategy too.
It was exactly this additional caching that I hoped to avoid. :-(
Harald.
On 06.08.2012 13:55, Simon Willnauer wrote:
hey harald,
On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch
<harald.kir...@raytion.com> wrote:
Hi,
in my application I have to write tons of small documents to the
index, but
with a twist. Many of the documents are actually aggregations of
pieces of
information that appear in a data stream, usually close together, but
nevertheless merged with information for other documents.
When information a1 for my document A arrives, I create my A-object,
store
it with index.addDocument() and forget about it. Later, when a2
arrives, I
fetch A from the index, delete it from the index, update it, and
store its
updated version. To fetch it from the index, I use a reader retrieved
with
IndexReader.openIfChanged(). So for one piece of information I have
roughly
the following sequence:
get searcher via IndexReader.openIfChanged()
find previously stored document, if any
if document already available {
update document object
index.deleteDocument(new Term(IDFIELD, id))
} else {
create document object
}
index.addDocument()
The overall speed is not too bad, but I wonder if more is possible. I
changed RAMBufferSizeMB from the default 16 to 200 but saw no
improvement in
speed.
I would think that keeping documents in RAM for some time such that many
updates happen in RAM, rather then being written to disk would
improve the
overall running time.
Any hints how to configure and use Lucene to improve the speed without
layering my own caching on top of it?
what happens if you re-open a reader from an IW (NearRealtime) you
flush documents to disk each time you reopen the NRT reader. That
likely means if you have high update rates that you don't keep stuff
in memory for very long so ram buffer size increase won't help much.
What I would try to exploit is the fact that you only need to open a
new reader if the document (or its latest update) you are looking for
has not been flushed to disk yet ie. is not in reader you already have
opened. Lucene ships with some handy tools that helps you to implement
this. I'd likely use org.apache.lucene.search.NRTManager that exposes
the methods of IW (update/add/delete) and returns a sequence ID that
you can later use to request an NRT reader. Lets say you have document
X indexed with sequence ID 15 and you now wanna update it you look up
the ID of doc X in a hashmap or something like this to get the last
changed sequence ID then you ask the NRTManager to refresh the search
it holds right now with NRTManager#waitForGeneration(15) if the
generation is already refreshed it will return immediately otherwise
it will wait until its opened. Then you can just acquire a new
searcher and check the document.
something like this:
String id = doc.getId();
Long seqId = mapping.get(id);
if (seqId != null) {
nrtManager.waitForGeneration(seqId);
}
IndexSearcher s = nrtManager.acquire();
try {
IndexReader reader = s.getReader();
// do something
} finally {
nrtManager.release(s);
}
from time to time you can prune the mapping for sequence ids that are
already flushed.
hope that helps
simon
Harald.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org