Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Mark Miller Tue, 13 Oct 2009 14:33:00 -0700

I've added missing enums classes, but everything else is looking good so
far.


Michael McCandless (JIRA) wrote:
>     [ 
> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765234#action_12765234
>  ] 
>
> Michael McCandless commented on LUCENE-1458:
> --------------------------------------------
>
> OK I think I've committed Mark's last patch onto this branch:
>
>   https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
>
> and I also branched the 2.9 back-compat branch and committed the last back 
> compat patch:
>
>   
> https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458_2_9_back_compat_tests
>
> Mark can you check it out & see if I missed anything?
>
>   
>> Further steps towards flexible indexing
>> ---------------------------------------
>>
>>                 Key: LUCENE-1458
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Index
>>    Affects Versions: 2.9
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>            Priority: Minor
>>         Attachments: LUCENE-1458-back-compat.patch, 
>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
>> LUCENE-1458.tar.bz2
>>
>>
>> I attached a very rough checkpoint of my current patch, to get early
>> feedback.  All tests pass, though back compat tests don't pass due to
>> changes to package-private APIs plus certain bugs in tests that
>> happened to work (eg call TermPostions.nextPosition() too many times,
>> which the new API asserts against).
>> [Aside: I think, when we commit changes to package-private APIs such
>> that back-compat tests don't pass, we could go back, make a branch on
>> the back-compat tag, commit changes to the tests to use the new
>> package private APIs on that branch, then fix nightly build to use the
>> tip of that branch?o]
>> There's still plenty to do before this is committable! This is a
>> rather large change:
>>   * Switches to a new more efficient terms dict format.  This still
>>     uses tii/tis files, but the tii only stores term & long offset
>>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>>     are structured by field, so we don't have to record field number
>>     in every term.
>> .
>>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>> .
>>     RAM usage when loading terms dict index is significantly less
>>     since we only load an array of offsets and an array of String (no
>>     more TermInfo array).  It should be faster to init too.
>> .
>>     This part is basically done.
>>   * Introduces modular reader codec that strongly decouples terms dict
>>     from docs/positions readers.  EG there is no more TermInfo used
>>     when reading the new format.
>> .
>>     There's nice symmetry now between reading & writing in the codec
>>     chain -- the current docs/prox format is captured in:
>> {code}
>> FormatPostingsTermsDictWriter/Reader
>> FormatPostingsDocsWriter/Reader (.frq file) and
>> FormatPostingsPositionsWriter/Reader (.prx file).
>> {code}
>>     This part is basically done.
>>   * Introduces a new "flex" API for iterating through the fields,
>>     terms, docs and positions:
>> {code}
>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
>> {code}
>>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>>     old API on top of the new API to keep back-compat.
>>     
>> Next steps:
>>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>>     fix any hidden assumptions.
>>   * Expose new API out of IndexReader, deprecate old API but emulate
>>     old API on top of new one, switch all core/contrib users to the
>>     new API.
>>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>>     (not just index-file-format flexibility).  EG if someone wanted
>>     to store payload at the term-doc level instead of
>>     term-doc-position level, you could just add a new attribute.
>>   * Test performance & iterate.
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to