[jira] Updated: (SOLR-11) DeDupTokenFilter{Factory}

Yonik Seeley (JIRA) Wed, 05 Jul 2006 14:47:18 -0700

     [ http://issues.apache.org/jira/browse/SOLR-11?page=all ]


Yonik Seeley updated SOLR-11:
-----------------------------

    Attachment: ArrayQueue.java

I looked it over quick, looks fine to me!

A few weeks ago I did some more premature optimization, writing a circular 
queue (power-of-two based) that's about twice as fast as a LinkedList for our 
typical usage.  I was intending it for use in BufferedTokenFilter if 
insertion/removal of tokens in the middle of the buffers was unneeded (or rare, 
as it could be implemented).

Anyway, I'm attaching it here for lack of a better place.  I support committing 
the current BufferedTokenFilter as-is, since I doubt the LinkedList 
implementation will be any kind of bottleneck.

> DeDupTokenFilter{Factory}
> -------------------------
>
>          Key: SOLR-11
>          URL: http://issues.apache.org/jira/browse/SOLR-11
>      Project: Solr
>         Type: Wish

>   Components: search
>     Reporter: Hoss Man
>     Assignee: Hoss Man
>  Attachments: ArrayQueue.java, 
> SOLR-11-BufferedTokenStream-RemoveDuplicatesTokenFilter.patch, 
> solr.analysis.RemoveDuplicateTokensFilter.java, 
> solr.analysys.RemomveDuplicateTokensFilter.linkedhashmap.java
>
> I recently noticed a situation in which my Query analyzer was producing the 
> same Token more then once, resulting in it getting two equally boosted 
> clauses in the resulting query.  In my specific case, i was using the same 
> synonym file for multiple fields (some stemmed some not) and two synonyms for 
> a word stemmed to the same root, which ment that particular word was worth 
> twice as as any of the other variations of the synonym -- but I can imagine 
> other situations where this might come up, both at index time and at query 
> time, particularlay when using SynonymFilter in combination with the 
> WordDelimiter filter.
> It occured to me that a DeDupFilter would be handy.  In it's simplest form it 
> would drop any Token it gets where the startOffset, endOffset,termText,and 
> type are all identical to the previous token and the positionIncriment is 0.  
> A more robust implimentation might support init options indicating that only 
> certain combinations of those things should be used to determine equality 
> (ie: just termText, just termText and positionIncriment=0, etc...) but in 
> this case, an option might also be neccessary to determine with of the Tokens 
> should be propogated (the first of the last)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-11) DeDupTokenFilter{Factory}

Reply via email to