[ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942642#comment-14942642
 ] 

Michael McCandless commented on LUCENE-6664:
--------------------------------------------

bq. How arbitrary are the token graphs here?

Well, the token graphs produced by {{SynonymGraphFilter}} (this patch) don't 
really have any more expressibility over what existing tokenizers can do (e.g. 
Kuromoji/JapaneseTokenizer).  They are acyclic, and are enumerated under 
similar constraints as the automaton API, i.e. you must add all tokens leaving 
a given position before moving to the next position.

I think what must be controversial about this patch is that this new filter can 
create new positions, which is necessary to fix bugs in the old 
{{SynonymFilter}} to correctly handle syns that expand to more tokens than they 
consumed, e.g. "dns -> domain name system".  Because otherwise you cannot 
distinguish the output of {{SynonymGraphFilter}} from e.g. 
{{JapaneseTokenizer}}: they both produce graphs with side paths.

Probably LUCENE-5012 is the only realistic way to move forward here: the 
synonym filter on that branch fixes the bugs that {{SynonymGraphFilter}} (this 
patch) also fixes, and then fixes additional bugs so that it can consume an 
incoming graph as well.  E.g. on that branch you could apply synonyms to the 
graph output from {{JapaneseTokenizer}}).

> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to