Merge away - still sleeping over here. Would love to look more again but don't know when, so no use waiting on me.

- Mark

http://www.lucidimagination.com (mobile)

On Oct 6, 2009, at 5:54 AM, "Michael McCandless (JIRA)" <j...@apache.org> wrote:


[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762573#action_12762573 ]

Michael McCandless commented on LUCENE-1458:
--------------------------------------------

Whoa thanks for the sudden sprint Mark!

bq. Come on old man, stop clinging to emacs

Hey!  I'm not so old :) But yeah I still cling to emacs.  Hey, I know
people who still cling to vi!

{quote}
I didn't really look at the code, but some stuff I noticed:

java 6 in pfor Arrays.copy

skiplist stuff in codecs still have package of index - not sure what is going on there - changed them

in IndexWriter:
+ // Mark: read twice?
segmentInfos.read(directory);
+ segmentInfos.read(directory, codecs);
{quote}

Excellent catches!  All of these are not right.

bq. (since you don't include contrib in the tar)

Gak, sorry.  I have a bunch of mods there, cutting over to flex API.

bq. You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'.

Woops, for back compat I think we need to leave it in (it's a
protected method), deprecated.  I'll put it back if you haven't.

bq. I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.

Eek, it shouldn't be -- indeed it is.  When did that happen?  We
should fix this (separately from this issue!).

Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib & back-compat branch, and post new patch! Thanks :)


Further steps towards flexible indexing
---------------------------------------

               Key: LUCENE-1458
               URL: https://issues.apache.org/jira/browse/LUCENE-1458
           Project: Lucene - Java
        Issue Type: New Feature
        Components: Index
  Affects Versions: 2.9
          Reporter: Michael McCandless
          Assignee: Michael McCandless
          Priority: Minor
Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back- compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


I attached a very rough checkpoint of my current patch, to get early
feedback.  All tests pass, though back compat tests don't pass due to
changes to package-private APIs plus certain bugs in tests that
happened to work (eg call TermPostions.nextPosition() too many times,
which the new API asserts against).
[Aside: I think, when we commit changes to package-private APIs such
that back-compat tests don't pass, we could go back, make a branch on
the back-compat tag, commit changes to the tests to use the new
package private APIs on that branch, then fix nightly build to use the
tip of that branch?o]
There's still plenty to do before this is committable! This is a
rather large change:
 * Switches to a new more efficient terms dict format.  This still
   uses tii/tis files, but the tii only stores term & long offset
   (not a TermInfo).  At seek points, tis encodes term & freq/prox
   offsets absolutely instead of with deltas delta.  Also, tis/tii
   are structured by field, so we don't have to record field number
   in every term.
.
   On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
   -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
.
   RAM usage when loading terms dict index is significantly less
   since we only load an array of offsets and an array of String (no
   more TermInfo array).  It should be faster to init too.
.
   This part is basically done.
 * Introduces modular reader codec that strongly decouples terms dict
   from docs/positions readers.  EG there is no more TermInfo used
   when reading the new format.
.
   There's nice symmetry now between reading & writing in the codec
   chain -- the current docs/prox format is captured in:
{code}
FormatPostingsTermsDictWriter/Reader
FormatPostingsDocsWriter/Reader (.frq file) and
FormatPostingsPositionsWriter/Reader (.prx file).
{code}
   This part is basically done.
 * Introduces a new "flex" API for iterating through the fields,
   terms, docs and positions:
{code}
FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
{code}
   This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
   old API on top of the new API to keep back-compat.

Next steps:
 * Plug in new codecs (pulsing, pfor) to exercise the modularity /
   fix any hidden assumptions.
 * Expose new API out of IndexReader, deprecate old API but emulate
   old API on top of new one, switch all core/contrib users to the
   new API.
 * Maybe switch to AttributeSources as the base class for TermsEnum,
   DocsEnum, PostingsEnum -- this would give readers API flexibility
   (not just index-file-format flexibility).  EG if someone wanted
   to store payload at the term-doc level instead of
   term-doc-position level, you could just add a new attribute.
 * Test performance & iterate.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to