[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Michael McCandless (JIRA) Mon, 05 May 2008 05:34:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594225#action_12594225
 ]


Michael McCandless commented on LUCENE-1278:
--------------------------------------------


It looks like the .tii file is also storing the int[] docIDs (as
inlined byte blob)?  I think that shouldn't be necessary?

This change adds a posting list like the frq file, except that it
stores only docIDs (no freq information), is stored inline in the term
dict, and includes a reader that materializes the full doc list as an
int[] instead of offering an iterator like (nextDoc()) interface
alone.

I think these changes would fit cleanly into what's been proposed for
flexible indexing.  EG, case 1a talks about storing only docID in a
posting list, here:

    http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

And recent discussions on the dev list around how to be flexible as to
which index file(s) (one or many) things are stored in, eg:

   http://www.mail-archive.com/[email protected]/msg15681.html

should allow you to store this data inlined into the terms dict, or as
a separate file.

Some other initial comments/questions:

  * I think this would bloat the index because the docIDs are being
    double stored (in the terms dict, and, in the frq file).  Would
    you propose changing the frq file to not store the docID when the
    term dict is doing so?

  * Why store the byte blob in the term dict, and not a separate (new)
    index file?  We lose locality for cases where one wants to iterate
    through terms and not loads these docs (eg RangeQuery).

  * Could you, instead, make a reader that reads in the full byte blob
    from the frq file for a term, and then processes that into the
    int[]?  This would require no change to indexing & the index
    format, and wouldn't waste space double-storing the docIDs.

  * I'm worried how well this scales up.  For very common terms
    allocating then decoding & holding entirely in RAM the full list
    of docIDs can become extremely costly.  Also, I don't have a clear
    sense of how apps would use the returned int[].  For example,
    would the int[] for many terms need to remain resident at the same
    time?  (Eg when running a RangeQuery).  If so, that compounds the
    scale challenge.



> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index 
> field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, 
> Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Reply via email to