Well, closing/opening an index is MUCH less expensive than
rebuilding the whole thing, so I don't understand part of your
statements....
It *may* (but I haven't tried it) be possible to flush the writer
rather
than
close/open it. But, you MUST close/reopen the reader you search with
even if flush works like I think it does.
But it's also possible to use a two tiered approach. 1G isn't all
that
big.
Could
you read it into a RAMDir and use that for your searches? Then,
when you
add
data, you add it to *both* indexes, but close/open the RAMdir for
searching.
It's also possible to keep the RAMdir as the delta between the
FSdir and
"current" states of your index. Add to both and search both. Although
deletes may be a problem here.
You haven't specified how often you expect changes, though. 100/
second?
1/minute? How real is "real time"? You could do something like warm
up
a new reader in the background whenever you decided you needed to be
absolutely up to date and swap your "live" reader for the newly
warmed up
one whenever you deemed it wise.
Or you could just close/open your reader after each modification,
fire off
a
couple of warmup queries at it and let the users live with slow
responses
if they happen to search before your warm-up queries completed.
The point is that there are many options, but to suggest the best
one, we
need some throughput numbers and a better definition of what "real
time"
means. Is a one minute delay acceptable? 10 seconds? a millisecond?
the answer defines the scope of reasonable solutions.....
Best
Erick
On 8/10/07, Antonello Provenzano <[email protected]> wrote:
Kai,
The context I'm going to work with requires a continuous addition of
documents to the indexes, since it's user-driven content, and this
would require the content to be always up-to-date.
This is the problem I'm facing, since I cannot rebuild a 1Gb (at
least) index every time a user inserts a new entry into the
database.
I know Digg, for instance, is using Lucene as search engine: since
the
amount of data they're dealing with is much higher than mine, I
would
like to understand the way they used to implement this kind of
solution.
Thank you again.
Antonello
On 8/10/07, Kai Hu <[email protected]> wrote:
Antonello,
You are right,I think lucene indexsearcher will search the
old
information if IndexWriter was not closed(I think lucene release the
Lock
here),so I only add a few documents every time from buffer to
implement
index "real time".
kai
发件人: [email protected] [mailto:[email protected]
m] 代表
Antonello Provenzano
发送时间: 2007年8月10日 星期五 17:59
收件人: [email protected]
主题: Re: 答复: Lucene in large database contexts
Kai,
Thanks. The problem I see it's that although I can add a Document
through IndexWriter or IndexModifier, this won't be searchable
until
the index is closed and, possibly, optimized, since the score of
the
document in the index context must be re-calculated on the basis of
the whole context.
Is this assumption true? or am I completely wrong?
Cheers.
Antonello
On 8/10/07, Kai Hu <[email protected]> wrote:
Hi, Antonello
You can use IndexWriter.addDocument(Document document) to
add
single document,same to update,delete operation.
kai
-----邮件原件-----
发件人: Antonello Provenzano [mailto:[email protected]]
发送时间: 2007年8月10日 星期五 17:09
收件人: [email protected]
主题: Lucene in large database contexts
Hi There!
I've been working for a while on the implementation of a website
oriented to contents that would contain millions of entries,
most of
them indexable (such as descriptions, texts, names, etc.).
The ideal solution to make them searchable would be to use
Lucene as
index and search engine.
The reason I'm posting the mailing list is the following: since
all
the entries will be stored in a database (most likely MySQL InnoDB
or
Oracle), what's the best technique to implement a system that
indexes
in "real time" (eg. when an entry is inserted into the databsse)
the
content and make it searchable? Based on my understanding of
Lucene,
such this thing is not possible, since the index must be re-
created
to
be able to search the indexed contents. Is this true?
Eventually, could anyone point me to a working example about how
to
implement such a similar context?
Thank you for the support.
Antonello
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---
------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]