[
https://issues.apache.org/jira/browse/LUCENE-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558409#action_12558409
]
Grant Ingersoll commented on LUCENE-602:
----------------------------------------
I think, if I understand the problem correctly, that the new TeeTokenFilter and
SinkTokenizer could also solve this problem, right Chuck?
> [PATCH] Filtering tokens for position and term vector storage
> -------------------------------------------------------------
>
> Key: LUCENE-602
> URL: https://issues.apache.org/jira/browse/LUCENE-602
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Affects Versions: 2.1
> Reporter: Chuck Williams
> Priority: Minor
> Attachments: TokenSelectorAllWithParallelWriter.patch,
> TokenSelectorSoloAll.patch
>
>
> This patch provides a new TokenSelector mechanism to select tokens of
> interest and creates two new IndexWriter configuration parameters:
> termVectorTokenSelector and positionsTokenSelector.
> termVectorTokenSelector, if non-null, selects which index tokens will be
> stored in term vectors. If positionsTokenSelector is non-null, then any
> tokens it rejects will have only their first position stored in each document
> (it is necessary to store one position to keep the doc freq properly to avoid
> the token being garbage collected in merges).
> This mechanism provides a simple solution to the problem of minimzing index
> size overhead cause by storing extra tokens that facilitate queries, in those
> cases where the mere existence of the extra tokens is sufficient. For
> example, in my test data using reverse tokens to speed prefix wildcard
> matching, I obtained the following index overheads:
> 1. With no TokenSelectors: 60% larger with reverse tokens than without
> 2. With termVectorTokenSelector rejecting reverse tokens: 36% larger
> 3. With both positionsTokenSelector and termVectorTokenSelector rejecting
> reverse tokens: 25% larger
> It is possible to obtain the same effect by using a separate field that has
> one occurrence of each reverse token and no term vectors, but this can be
> hard or impossible to do and a performance problem as it requires either
> rereading the content or storing all the tokens for subsequent processing.
> The solution with TokenSelectors is very easy to use and fast.
> Otis, thanks for leaving a comment in QueryParser.jj with the correct
> production to enable prefix wildcards! With this, it is a straightforward
> matter to override the wildcard query factory method and use reverse tokens
> effectively.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]