Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Mark Miller Thu, 08 Oct 2009 13:10:34 -0700

Yup - you need for anything developed outside of Apache.

Michael McCandless wrote:
> Well, it's the usual process... pull together a big patch, open an issue, etc.
>
> Probably because it's a large amount of code (I think?) you'll need to
> submit a software grant
> (http://www.apache.org/licenses/software-grant.txt).
>
> Mike
>
> On Thu, Oct 8, 2009 at 2:58 PM, John Wang <john.w...@gmail.com> wrote:
>   
>> Awesome!
>>
>> Mike, can you let us know what the process is and the time line?
>>
>> Thanks
>>
>> -John
>>
>> On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>>     
>>> +1!
>>>
>>> Mike
>>>
>>> On Thu, Oct 8, 2009 at 2:41 PM, John Wang <john.w...@gmail.com> wrote:
>>>       
>>>> Hi guys:
>>>>
>>>>      What are your thoughts about contributing Kamikaze as a lucene
>>>> contrib
>>>> package? We just finished porting kamikaze to lucene 2.9. With the new
>>>> 2.9
>>>> api, it allows us for some more code tuning and optimization
>>>> improvements.
>>>>
>>>>      We will be releasing kamikaze, it might a good time to add it to
>>>> the
>>>> lucene contrib package if there is interest.
>>>>
>>>> Thanks
>>>>
>>>> -John
>>>>
>>>> On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <u...@thetaphi.de> wrote:
>>>>         
>>>>> By the way: In the last RC of Lucene 2.9 we added a new method to
>>>>> DocIdSet
>>>>> called isCacheable(). It is used by e.g. CachingWrapperFilter to
>>>>> determine,
>>>>> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
>>>>> (the
>>>>> default is false, so all custom DocIdSets are copied to OpenBitSetDISI
>>>>> by
>>>>> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
>>>>> disk
>>>>> IO
>>>>> and have a fast iterator like e.g. the FieldCache ones in
>>>>> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
>>>>> this
>>>>> should also be added to Kamikaze, which is a really nice project!
>>>>> Especially
>>>>> filter DocIdSets should pass this method to its delegate (see
>>>>> FilterDocIdSet
>>>>> in Lucene).
>>>>>
>>>>> -----
>>>>> Uwe Schindler
>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>> http://www.thetaphi.de
>>>>> eMail: u...@thetaphi.de
>>>>>
>>>>>
>>>>>           
>>>>>> -----Original Message-----
>>>>>> From: John Wang (JIRA) [mailto:j...@apache.org]
>>>>>> Sent: Thursday, September 24, 2009 3:14 PM
>>>>>> To: java-dev@lucene.apache.org
>>>>>> Subject: [jira] Commented: (LUCENE-1458) Further steps towards
>>>>>> flexible
>>>>>> indexing
>>>>>>
>>>>>>
>>>>>>     [ https://issues.apache.org/jira/browse/LUCENE-
>>>>>> 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>>>>>> tabpanel&focusedCommentId=12759112#action_12759112 ]
>>>>>>
>>>>>> John Wang commented on LUCENE-1458:
>>>>>> -----------------------------------
>>>>>>
>>>>>> Just a FYI: Kamikaze was originally started as our sandbox for Lucene
>>>>>> contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
>>>>>> abstraction that was migrated from Solr)
>>>>>>
>>>>>> It has three components:
>>>>>>
>>>>>> 1) P4Delta
>>>>>> 2) Logical boolean operations on DocIdSet/Iterators (I have created a
>>>>>> jira
>>>>>> ticket and a patch for Lucene awhile ago with performance numbers. It
>>>>>> is
>>>>>> significantly faster than DisjunctionScorer)
>>>>>> 3) algorithm to determine which DocIdSet implementations to use given
>>>>>> some
>>>>>> parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
>>>>>> the
>>>>>> application behavior if not all parameters are given.
>>>>>>
>>>>>> So please feel free to incorporate anything you see if or move it to
>>>>>> contrib.
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Further steps towards flexible indexing
>>>>>>> ---------------------------------------
>>>>>>>
>>>>>>>                 Key: LUCENE-1458
>>>>>>>                 URL:
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458
>>>>>>>             Project: Lucene - Java
>>>>>>>          Issue Type: New Feature
>>>>>>>          Components: Index
>>>>>>>    Affects Versions: 2.9
>>>>>>>            Reporter: Michael McCandless
>>>>>>>            Assignee: Michael McCandless
>>>>>>>            Priority: Minor
>>>>>>>         Attachments: LUCENE-1458-back-compat.patch,
>>>>>>> LUCENE-1458-back-
>>>>>>>               
>>>>>> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
>>>>>> LUCENE-
>>>>>> 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>> LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
>>>>>> 1458.tar.bz2, LUCENE-1458.tar.bz2
>>>>>>             
>>>>>>> I attached a very rough checkpoint of my current patch, to get
>>>>>>> early
>>>>>>> feedback.  All tests pass, though back compat tests don't pass due
>>>>>>> to
>>>>>>> changes to package-private APIs plus certain bugs in tests that
>>>>>>> happened to work (eg call TermPostions.nextPosition() too many
>>>>>>> times,
>>>>>>> which the new API asserts against).
>>>>>>> [Aside: I think, when we commit changes to package-private APIs
>>>>>>> such
>>>>>>> that back-compat tests don't pass, we could go back, make a branch
>>>>>>> on
>>>>>>> the back-compat tag, commit changes to the tests to use the new
>>>>>>> package private APIs on that branch, then fix nightly build to use
>>>>>>> the
>>>>>>> tip of that branch?o]
>>>>>>> There's still plenty to do before this is committable! This is a
>>>>>>> rather large change:
>>>>>>>   * Switches to a new more efficient terms dict format.  This still
>>>>>>>     uses tii/tis files, but the tii only stores term & long offset
>>>>>>>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>>>>>>>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>>>>>>>     are structured by field, so we don't have to record field
>>>>>>> number
>>>>>>>     in every term.
>>>>>>> .
>>>>>>>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99
>>>>>>> MB
>>>>>>>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>>>>>>> .
>>>>>>>     RAM usage when loading terms dict index is significantly less
>>>>>>>     since we only load an array of offsets and an array of String
>>>>>>> (no
>>>>>>>     more TermInfo array).  It should be faster to init too.
>>>>>>> .
>>>>>>>     This part is basically done.
>>>>>>>   * Introduces modular reader codec that strongly decouples terms
>>>>>>> dict
>>>>>>>     from docs/positions readers.  EG there is no more TermInfo used
>>>>>>>     when reading the new format.
>>>>>>> .
>>>>>>>     There's nice symmetry now between reading & writing in the
>>>>>>> codec
>>>>>>>     chain -- the current docs/prox format is captured in:
>>>>>>> {code}
>>>>>>> FormatPostingsTermsDictWriter/Reader
>>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
>>>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
>>>>>>> {code}
>>>>>>>     This part is basically done.
>>>>>>>   * Introduces a new "flex" API for iterating through the fields,
>>>>>>>     terms, docs and positions:
>>>>>>> {code}
>>>>>>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
>>>>>>> {code}
>>>>>>>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates
>>>>>>> the
>>>>>>>     old API on top of the new API to keep back-compat.
>>>>>>>
>>>>>>> Next steps:
>>>>>>>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>>>>>>>     fix any hidden assumptions.
>>>>>>>   * Expose new API out of IndexReader, deprecate old API but
>>>>>>> emulate
>>>>>>>     old API on top of new one, switch all core/contrib users to the
>>>>>>>     new API.
>>>>>>>   * Maybe switch to AttributeSources as the base class for
>>>>>>> TermsEnum,
>>>>>>>     DocsEnum, PostingsEnum -- this would give readers API
>>>>>>> flexibility
>>>>>>>     (not just index-file-format flexibility).  EG if someone wanted
>>>>>>>     to store payload at the term-doc level instead of
>>>>>>>     term-doc-position level, you could just add a new attribute.
>>>>>>>   * Test performance & iterate.
>>>>>>>               
>>>>>> --
>>>>>> This message is automatically generated by JIRA.
>>>>>> -
>>>>>> You can reply to this email to add a comment to the issue online.
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>>>>             
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>>>
>>>>>           
>>>>         
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>       
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to