[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Robert Muir (JIRA) Sun, 04 Oct 2015 04:55:09 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942650#comment-14942650
 ]


Robert Muir commented on LUCENE-6664:
-------------------------------------

I thought I already explained my concerns well. I guess I will try yet again...


* I think its a huge, huge break to modify the semantics of existing token 
attributes like position increment, to mean something very different (nodeID), 
that seems like an awkward fit, wedged in.
* If position increment/length no longer have meaning and are just confusing 
ways of storing node ID, that's a huge usability issue: lets use a clean slate 
of attributes instead of abusing in that way...
* I'm concerned about complexity: today the simple case (posInc + posLen) is 
hard enough to understand, but doable. I feel like going this route, pushes the 
complexity onto the simple case, and that worries me a lot. A lot of people 
just simply will not have graphs! Should all of our analysis components really 
be required to support them? Maybe we should really be putting the effort into 
the ability to use alternative analysis APIs, so that we can have a "graph 
analysis" api that just works totally different from the tokenstream one, and 
so on.

> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Reply via email to