[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

Uwe Schindler (JIRA) Tue, 07 Apr 2009 03:19:41 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-1582:
----------------------------------

    Attachment: LUCENE-1582.patch

New patch. In my opinion, it is now stable.

New features/changes:
- Attribute "ShiftAttribute" for the new TokenStream API. This makes it 
possible t write consumers of the TokenStream that maybe index the values to 
different fields depending on the shift value. This only works with the new API 
(as the old Token does not have a field for that).
- Tests for the TokenStreams
- Missing initialization of Token in old TokenStream API
- reverted CharSequence for prefix decoder to String (performance was 5% worse 
during FieldCache filling)

I think, it is ready for commit. I did some further performance tests with a 
index of 10 Mio indexed trie values:
- The speed difference between reusing the token streams is marginal, maximum 
10% improvement
- Filling the FieldCache is really fast now, the use of CharSequence was a bad 
idea (nicer API-wise but not for performance - the well known Java-Interface 
problem)

I did some statistics on this large index: The avg. number of terms for 
RangeFilters is 450 for 8bit and 70 for 4bit. This is exactly the same I have 
seen with 10000 docs in the test cases and 500000 docs in our PANGAEA index. 
This verifies, that the numbero of terms is *not* related to index size, only 
related to precision step.

I will do some further speed tests comparing the prefix-encoded FieldCache with 
the conventional int cache using Integer.parseInt(). I suspect a big 
improvement, because of the simple encoding.

I will also compare the indexing time with the old API and the new tokenizers.

Mike: If you think, the changes in FieldCache are OK, can you commit only the 
changes to the FieldCache?

> Make TrieRange completely independent from Document/Field with TokenStream of 
> prefix encoded values
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1582
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: LUCENE-1582.patch, LUCENE-1582.patch, LUCENE-1582.patch, 
> LUCENE-1582.patch
>
>
> TrieRange has currently the following problem:
> - To add a field, that uses a trie encoding, you can manually add each term 
> to the index or use a helper method from TrieUtils. The helper method has the 
> problem, that it uses a fixed field configuration
> - TrieUtils currently creates per default a helper field containing the lower 
> precision terms to enable sorting (limitation of one term/document for 
> sorting)
> - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
> heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
> String copying is involved.
> This issue should improve this:
> - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
> arrays are reused by Token API, additional String[] arrays for the encoded 
> result are not created, instead the TokenStream enumerates the trie values.
> - Trie fields can be added to Documents during indexing using the standard 
> API: new Field(name,TokenStream,...), so no extra util method needed. By 
> using token filters, one could also add payload and so and customize 
> everything.
> The drawback is: Sorting would not work anymore. To enable sorting, a 
> (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
> a lower precision one is enumerated by TermEnum. I will create a "hack" patch 
> for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
> stop iteration. With LUCENE-831, a more generic API for this type can be used 
> (custom parser/iterator implementation for FieldCache). I will attach the 
> field cache patch (with the temporary solution, until FieldCache is 
> reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

Reply via email to