[ https://issues.apache.org/jira/browse/LUCENE-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558409#action_12558409 ]
Grant Ingersoll commented on LUCENE-602: ---------------------------------------- I think, if I understand the problem correctly, that the new TeeTokenFilter and SinkTokenizer could also solve this problem, right Chuck? > [PATCH] Filtering tokens for position and term vector storage > ------------------------------------------------------------- > > Key: LUCENE-602 > URL: https://issues.apache.org/jira/browse/LUCENE-602 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.1 > Reporter: Chuck Williams > Priority: Minor > Attachments: TokenSelectorAllWithParallelWriter.patch, > TokenSelectorSoloAll.patch > > > This patch provides a new TokenSelector mechanism to select tokens of > interest and creates two new IndexWriter configuration parameters: > termVectorTokenSelector and positionsTokenSelector. > termVectorTokenSelector, if non-null, selects which index tokens will be > stored in term vectors. If positionsTokenSelector is non-null, then any > tokens it rejects will have only their first position stored in each document > (it is necessary to store one position to keep the doc freq properly to avoid > the token being garbage collected in merges). > This mechanism provides a simple solution to the problem of minimzing index > size overhead cause by storing extra tokens that facilitate queries, in those > cases where the mere existence of the extra tokens is sufficient. For > example, in my test data using reverse tokens to speed prefix wildcard > matching, I obtained the following index overheads: > 1. With no TokenSelectors: 60% larger with reverse tokens than without > 2. With termVectorTokenSelector rejecting reverse tokens: 36% larger > 3. With both positionsTokenSelector and termVectorTokenSelector rejecting > reverse tokens: 25% larger > It is possible to obtain the same effect by using a separate field that has > one occurrence of each reverse token and no term vectors, but this can be > hard or impossible to do and a performance problem as it requires either > rereading the content or storing all the tokens for subsequent processing. > The solution with TokenSelectors is very easy to use and fast. > Otis, thanks for leaving a comment in QueryParser.jj with the correct > production to enable prefix wildcards! With this, it is a straightforward > matter to override the wildcard query factory method and use reverse tokens > effectively. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]