Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Michael McCandless Thu, 08 Oct 2009 12:50:48 -0700

Well, it's the usual process... pull together a big patch, open an issue, etc.


Probably because it's a large amount of code (I think?) you'll need to
submit a software grant
(http://www.apache.org/licenses/software-grant.txt).

Mike

On Thu, Oct 8, 2009 at 2:58 PM, John Wang <[email protected]> wrote:
> Awesome!
>
> Mike, can you let us know what the process is and the time line?
>
> Thanks
>
> -John
>
> On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless
> <[email protected]> wrote:
>>
>> +1!
>>
>> Mike
>>
>> On Thu, Oct 8, 2009 at 2:41 PM, John Wang <[email protected]> wrote:
>> > Hi guys:
>> >
>> >      What are your thoughts about contributing Kamikaze as a lucene
>> > contrib
>> > package? We just finished porting kamikaze to lucene 2.9. With the new
>> > 2.9
>> > api, it allows us for some more code tuning and optimization
>> > improvements.
>> >
>> >      We will be releasing kamikaze, it might a good time to add it to
>> > the
>> > lucene contrib package if there is interest.
>> >
>> > Thanks
>> >
>> > -John
>> >
>> > On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <[email protected]> wrote:
>> >>
>> >> By the way: In the last RC of Lucene 2.9 we added a new method to
>> >> DocIdSet
>> >> called isCacheable(). It is used by e.g. CachingWrapperFilter to
>> >> determine,
>> >> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
>> >> (the
>> >> default is false, so all custom DocIdSets are copied to OpenBitSetDISI
>> >> by
>> >> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
>> >> disk
>> >> IO
>> >> and have a fast iterator like e.g. the FieldCache ones in
>> >> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
>> >> this
>> >> should also be added to Kamikaze, which is a really nice project!
>> >> Especially
>> >> filter DocIdSets should pass this method to its delegate (see
>> >> FilterDocIdSet
>> >> in Lucene).
>> >>
>> >> -----
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: [email protected]
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: John Wang (JIRA) [mailto:[email protected]]
>> >> > Sent: Thursday, September 24, 2009 3:14 PM
>> >> > To: [email protected]
>> >> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards
>> >> > flexible
>> >> > indexing
>> >> >
>> >> >
>> >> >     [ https://issues.apache.org/jira/browse/LUCENE-
>> >> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> >> > tabpanel&focusedCommentId=12759112#action_12759112 ]
>> >> >
>> >> > John Wang commented on LUCENE-1458:
>> >> > -----------------------------------
>> >> >
>> >> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
>> >> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
>> >> > abstraction that was migrated from Solr)
>> >> >
>> >> > It has three components:
>> >> >
>> >> > 1) P4Delta
>> >> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
>> >> > jira
>> >> > ticket and a patch for Lucene awhile ago with performance numbers. It
>> >> > is
>> >> > significantly faster than DisjunctionScorer)
>> >> > 3) algorithm to determine which DocIdSet implementations to use given
>> >> > some
>> >> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
>> >> > the
>> >> > application behavior if not all parameters are given.
>> >> >
>> >> > So please feel free to incorporate anything you see if or move it to
>> >> > contrib.
>> >> >
>> >> >
>> >> > > Further steps towards flexible indexing
>> >> > > ---------------------------------------
>> >> > >
>> >> > >                 Key: LUCENE-1458
>> >> > >                 URL:
>> >> > > https://issues.apache.org/jira/browse/LUCENE-1458
>> >> > >             Project: Lucene - Java
>> >> > >          Issue Type: New Feature
>> >> > >          Components: Index
>> >> > >    Affects Versions: 2.9
>> >> > >            Reporter: Michael McCandless
>> >> > >            Assignee: Michael McCandless
>> >> > >            Priority: Minor
>> >> > >         Attachments: LUCENE-1458-back-compat.patch,
>> >> > > LUCENE-1458-back-
>> >> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
>> >> > LUCENE-
>> >> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>> >> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
>> >> > 1458.tar.bz2, LUCENE-1458.tar.bz2
>> >> > >
>> >> > >
>> >> > > I attached a very rough checkpoint of my current patch, to get
>> >> > > early
>> >> > > feedback.  All tests pass, though back compat tests don't pass due
>> >> > > to
>> >> > > changes to package-private APIs plus certain bugs in tests that
>> >> > > happened to work (eg call TermPostions.nextPosition() too many
>> >> > > times,
>> >> > > which the new API asserts against).
>> >> > > [Aside: I think, when we commit changes to package-private APIs
>> >> > > such
>> >> > > that back-compat tests don't pass, we could go back, make a branch
>> >> > > on
>> >> > > the back-compat tag, commit changes to the tests to use the new
>> >> > > package private APIs on that branch, then fix nightly build to use
>> >> > > the
>> >> > > tip of that branch?o]
>> >> > > There's still plenty to do before this is committable! This is a
>> >> > > rather large change:
>> >> > >   * Switches to a new more efficient terms dict format.  This still
>> >> > >     uses tii/tis files, but the tii only stores term & long offset
>> >> > >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>> >> > >     offsets absolutely instead of with deltas delta.  Also, tis/tii
>> >> > >     are structured by field, so we don't have to record field
>> >> > > number
>> >> > >     in every term.
>> >> > > .
>> >> > >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99
>> >> > > MB
>> >> > >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>> >> > > .
>> >> > >     RAM usage when loading terms dict index is significantly less
>> >> > >     since we only load an array of offsets and an array of String
>> >> > > (no
>> >> > >     more TermInfo array).  It should be faster to init too.
>> >> > > .
>> >> > >     This part is basically done.
>> >> > >   * Introduces modular reader codec that strongly decouples terms
>> >> > > dict
>> >> > >     from docs/positions readers.  EG there is no more TermInfo used
>> >> > >     when reading the new format.
>> >> > > .
>> >> > >     There's nice symmetry now between reading & writing in the
>> >> > > codec
>> >> > >     chain -- the current docs/prox format is captured in:
>> >> > > {code}
>> >> > > FormatPostingsTermsDictWriter/Reader
>> >> > > FormatPostingsDocsWriter/Reader (.frq file) and
>> >> > > FormatPostingsPositionsWriter/Reader (.prx file).
>> >> > > {code}
>> >> > >     This part is basically done.
>> >> > >   * Introduces a new "flex" API for iterating through the fields,
>> >> > >     terms, docs and positions:
>> >> > > {code}
>> >> > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
>> >> > > {code}
>> >> > >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates
>> >> > > the
>> >> > >     old API on top of the new API to keep back-compat.
>> >> > >
>> >> > > Next steps:
>> >> > >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>> >> > >     fix any hidden assumptions.
>> >> > >   * Expose new API out of IndexReader, deprecate old API but
>> >> > > emulate
>> >> > >     old API on top of new one, switch all core/contrib users to the
>> >> > >     new API.
>> >> > >   * Maybe switch to AttributeSources as the base class for
>> >> > > TermsEnum,
>> >> > >     DocsEnum, PostingsEnum -- this would give readers API
>> >> > > flexibility
>> >> > >     (not just index-file-format flexibility).  EG if someone wanted
>> >> > >     to store payload at the term-doc level instead of
>> >> > >     term-doc-position level, you could just add a new attribute.
>> >> > >   * Test performance & iterate.
>> >> >
>> >> > --
>> >> > This message is automatically generated by JIRA.
>> >> > -
>> >> > You can reply to this email to add a comment to the issue online.
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [email protected]
>> >> > For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to