RE: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Uwe Schindler Thu, 24 Sep 2009 06:20:59 -0700

By the way: In the last RC of Lucene 2.9 we added a new method to DocIdSet
called isCacheable(). It is used by e.g. CachingWrapperFilter to determine,
if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI (the
default is false, so all custom DocIdSets are copied to OpenBitSetDISI by
CachingWrapperFilter, even if not needed - if a DocIdSet does not do disk IO
and have a fast iterator like e.g. the FieldCache ones in
FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe this
should also be added to Kamikaze, which is a really nice project! Especially
filter DocIdSets should pass this method to its delegate (see FilterDocIdSet
in Lucene).


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: John Wang (JIRA) [mailto:[email protected]]
> Sent: Thursday, September 24, 2009 3:14 PM
> To: [email protected]
> Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible
> indexing
> 
> 
>     [ https://issues.apache.org/jira/browse/LUCENE-
> 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=12759112#action_12759112 ]
> 
> John Wang commented on LUCENE-1458:
> -----------------------------------
> 
> Just a FYI: Kamikaze was originally started as our sandbox for Lucene
> contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
> abstraction that was migrated from Solr)
> 
> It has three components:
> 
> 1) P4Delta
> 2) Logical boolean operations on DocIdSet/Iterators (I have created a jira
> ticket and a patch for Lucene awhile ago with performance numbers. It is
> significantly faster than DisjunctionScorer)
> 3) algorithm to determine which DocIdSet implementations to use given some
> parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the
> application behavior if not all parameters are given.
> 
> So please feel free to incorporate anything you see if or move it to
> contrib.
> 
> 
> > Further steps towards flexible indexing
> > ---------------------------------------
> >
> >                 Key: LUCENE-1458
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Index
> >    Affects Versions: 2.9
> >            Reporter: Michael McCandless
> >            Assignee: Michael McCandless
> >            Priority: Minor
> >         Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-
> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-
> 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
> 1458.tar.bz2, LUCENE-1458.tar.bz2
> >
> >
> > I attached a very rough checkpoint of my current patch, to get early
> > feedback.  All tests pass, though back compat tests don't pass due to
> > changes to package-private APIs plus certain bugs in tests that
> > happened to work (eg call TermPostions.nextPosition() too many times,
> > which the new API asserts against).
> > [Aside: I think, when we commit changes to package-private APIs such
> > that back-compat tests don't pass, we could go back, make a branch on
> > the back-compat tag, commit changes to the tests to use the new
> > package private APIs on that branch, then fix nightly build to use the
> > tip of that branch?o]
> > There's still plenty to do before this is committable! This is a
> > rather large change:
> >   * Switches to a new more efficient terms dict format.  This still
> >     uses tii/tis files, but the tii only stores term & long offset
> >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >     are structured by field, so we don't have to record field number
> >     in every term.
> > .
> >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > .
> >     RAM usage when loading terms dict index is significantly less
> >     since we only load an array of offsets and an array of String (no
> >     more TermInfo array).  It should be faster to init too.
> > .
> >     This part is basically done.
> >   * Introduces modular reader codec that strongly decouples terms dict
> >     from docs/positions readers.  EG there is no more TermInfo used
> >     when reading the new format.
> > .
> >     There's nice symmetry now between reading & writing in the codec
> >     chain -- the current docs/prox format is captured in:
> > {code}
> > FormatPostingsTermsDictWriter/Reader
> > FormatPostingsDocsWriter/Reader (.frq file) and
> > FormatPostingsPositionsWriter/Reader (.prx file).
> > {code}
> >     This part is basically done.
> >   * Introduces a new "flex" API for iterating through the fields,
> >     terms, docs and positions:
> > {code}
> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > {code}
> >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> >     old API on top of the new API to keep back-compat.
> >
> > Next steps:
> >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >     fix any hidden assumptions.
> >   * Expose new API out of IndexReader, deprecate old API but emulate
> >     old API on top of new one, switch all core/contrib users to the
> >     new API.
> >   * Maybe switch to AttributeSources as the base class for TermsEnum,
> >     DocsEnum, PostingsEnum -- this would give readers API flexibility
> >     (not just index-file-format flexibility).  EG if someone wanted
> >     to store payload at the term-doc level instead of
> >     term-doc-position level, you could just add a new attribute.
> >   * Test performance & iterate.
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to