[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783488#action_12783488
 ] 

Uwe Schindler edited comment on LUCENE-1458 at 11/29/09 10:16 PM:
------------------------------------------------------------------

bq. A partial solution for you which does work with tokenstreams, you could use 
indexablebinarystring which won't change between any unicode sort order... (it 
will not encode in any unicode range where there is a difference between the 
UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you 
still would not have the "full 8 bits per byte"

This would not change anything, only would make the format incompatible. With 
7bits/char the currently UTF-8 coded index is the smallest possible one (even 
IndexableBinaryString would cost more bytes in the index, because if you would 
use 14 of the 16 bits/char, most chars would take 3 bytes in index because of 
UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String 
representation would take less space than currently. See the discussion with 
Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much 
faster).

For the TokenStreams: The idea is to create an additional Attribute: 
BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute 
instead of TermAttribute, the indexer would choose to write the bytes directly 
to the index. NumericTokenStream could use this attribute and encode the 
numbers directly to byte[] with 8 bits/byte. -- the new AttributeSource API was 
created just because of such customizations (not possible with Token).

      was (Author: thetaphi):
    bq. A partial solution for you which does work with tokenstreams, you could 
use indexablebinarystring which won't change between any unicode sort order... 
(it will not encode in any unicode range where there is a difference between 
the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but 
you still would not have the "full 8 bits per byte"

This would not change anything, only would make the format incompatible. With 
7bits/char the currently UTF-8 coded index is the smallest possible one (even 
IndexableBinaryString would cost more bytes in the index, because if you would 
use 14 of the 16 bits/char, most chars would take 3 bytes in index because of 
UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String 
representation would take less space than currently. See the discussion with 
Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much 
faster).

For the TokenStreams: The idea is to create an additional Attribute: 
BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute 
instead of TermAttribute, the indexer would choose to write the bytes directly 
to the index. NumericTokenStream could use this attribute and encode the 
numbers directly to byte[] with 8 bits/byte.
  
> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to