Re: [GSoC] About how flexible indexing works in lucene 4.0

Michael McCandless Wed, 28 Mar 2012 10:18:38 -0700

On Mon, Mar 26, 2012 at 6:59 PM, Han Jiang <jiangha...@gmail.com> wrote:
> Hi all,
>
> I was trying to figure out the control flow of IndexWriter and
> IndexSearcher, in order to get a better understanding of the idea behind
> Codec implementation.
>
> However, there seem to be some questions related with codes, which I just
> find inconvenient to discuss here.
>
> Maybe it is better to expain how much I understand, and ask for your
> comments?
> Here is what I understand:
>
> Index time:
> --First of all, IndexWriter should get a Codec configuration from an
> IndexWriterConfig.
> --When IndexWriter.addDocument is called, an instance of
> DocumentsWriterPerThread will be created,
> --It then pass the codec information through the indexing chain, and make an
> instance of FreqProxTermsWriterPerField to call flush().
> --Then, based on the codec information, we create an instance of
> TermsConsumer, after this, we iterator each termID, get corresponding
> PostingConsumer, and save infomation of each document.
> --Here, by inheriting "TermsConsumer" and "PostingConsumer", we get
> IndexWriter create index with new posting formats.


That sounds about right!

But, it's best to think of FreqProxTErmsWriter/PerField as having its
own "private" in-memory postings format, and then, on flush, it
re-parses its in-memory postings and feeds them to the codec
(Fields/Terms/PostingsConsumer) for writing to the index.

> Query time:
> --Now, let's take Phrase Search as an example.
> --When IndexSearcher.search(phraseQuery,topN) is called, an instance of
> PhraseWeight will be created to wrap the query terms,
> --Then, IndexSearcher will create tasks to call method
> PhraseWeight.scorer(), inside which two instances: Terms and TermsEnum will
> be fetched from corresponding AtomicReader,
> --With the help of TermsEnum, for every phrase words, related docs and
> positions will be fetched through a DocsAndPositionsEnum, and result thus be
> generated.
> --Here, by inheriting "TermsEnum" and related "*Enum" classes, we get
> IndexSearcher(or IndexReader) understand our posting formats.

Sounds right!

> And, here I have some questions:
>
> 1. Will multiple AtomicReaders created if I operate a search on a index with
> several segments? If not, when will there be multi AtomicReaders? And to
> further the question,  what is the idea to introduce AtomicReader and
> CompositeReader into lucene 4?

Right, it's one atomic reader (SegmentReader) per segment.

We split composite/atomic readers in 4.0 so they'd be strongly typed
(they have different methods and before the split they'd throw
UnsupportedOperationExceptions from a number of methods, which was
messy).

> 2. I must have missed something during query time, since subtype of
> PostingsReaderBase is just absent from what I explained. Is it created when
> an instance of AtomicReader is fetch from context? Where can I find related
> codes?

PostingsWriter/ReaderBase is what our default terms dictionaries
(Block/TreeTermsWriter/Reader) interact with.

So, eg the Lucene40PostingsWriter/Reader subclass PostingsWriter/ReaderBase.

> 3. The wiki page here  says we should provide an arbitrary skipDocs bit set
> during enumeration. Then, is posting list itself remains unchanged, even if
> I call deleteDocuments() ? Will deleted documents still remain in the
> postings file, even segments get merged?

Deleted docs are simply marked in a bit set (the liveDocs bits), and
the postings files themselves are unchanged.

So when the postings reader enumerates the postings, it must checked
the provided live docs (if it's not null) to confirm the doc is not
deleted.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [GSoC] About how flexible indexing works in lucene 4.0

Reply via email to