Re: Realtime Search for Social Networks Collaboration

Michael McCandless Thu, 11 Sep 2008 05:30:39 -0700

Right, there would need to be a snapshot taken of all terms whenIndexWriter.getReader() is called.

This snapshot would 1) hold a frozen int docFreq per term, and 2) sortthe terms so TermEnum can just step through them. (We might be ableto delay this sorting until the first time something asks for it).Also, it must merge this data from all threads, since each threadholds its hash per field. I've got a rough start at coding this up...

The costs are clearly growing, in order to keep the "point in time"feature of this RAMIndexReader, but I think are still well containedunless you have a really huge RAM buffer.

Flushing is still tricky because we cannot recycle the byte blockbuffers until all running TermDocs/TermPositions iterations are"finished". Alternatively, I may just allocate new byte blocks andallow the old ones to be GC'd on their own once running iterations arefinished.


Mike

Jason Rutherglen wrote:

Hi Mike,

There would be a new sorted list or something to replace the
hashtable?  Seems like an issue that is not solved.

Jason

On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
This would just tap into the live hashtable that DocumentsWriter*maintainfor the posting lists... except the docFreq will need to be copiedaway on
reopen, I think.

Mike

Jason Rutherglen wrote:
Term dictionary?  I'm curious how that would be solved?

On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
I think it's quite feasible, but, it'd still have a "reopen"cost in
that
any buffered delete by term or query would have to be"materialiazed"
into
docIDs on reopen. Though, if this somehow turns out to be aproblem,
in
the
future we could do this materializing immediately, instead of
buffering,
if
we already have a reader open.
Right... it seems like re-using readers internally is something we
could already be doing in IndexWriter.
True.
Flushing is somewhat tricky because any open RAM readers wouldthen
have
to
cutover to the newly flushed segment once the flush completes,so that
the
RAM buffer can be recycled for the next segment.
Re-use of a RAM buffer doesn't seem like such a big deal.

But, how would you maintain a static view of an index...?

IndexReader r1 = indexWriter.getCurrentIndex()
indexWriter.addDocument(...)
IndexReader r2 = indexWriter.getCurrentIndex()

I assume r1 will have a view of the index before the document was
added, and r2 after?
Right, getCurrentIndex would return a MultiReader that includes
SegmentReader for each segment in the index, plus a "RAMReader"thatsearches the RAM buffer. That RAMReader is a tiny shell classthat wouldbasically just record the max docID it's allowed to go up to (thedocID
as
of when it was opened), and stop enumerating docIDs (eg in theTermDocs)
when it hits a docID beyond that limit.

For reading stored fields and term vectors, which are now flushed
immediately to disk, we need to somehow get an IndexInput from the
IndexOutputs that IndexWriter holds open on these files. Or,maybe, just
open new IndexInputs?
Another thing that will help is if users could get their handson thesub-readers of a multi-segment reader. Right now that is hiddenin
MultiSegmentReader and makes updating anything incrementally
difficult.
Besides what's handled by MultiSegmentReader.reopen already, whatelse do
you need to incrementally update?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Reply via email to