[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

Jack Krupansky (JIRA) Tue, 21 May 2013 15:10:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663471#comment-13663471
 ]


Jack Krupansky commented on LUCENE-5012:
----------------------------------------

bq. WordAutomatonQuery

Sounds quite promising.

Back to the query parsers... So, they would present a term or quoted string - 
and eventually hopefully a sequence of terms if the query parser sees that 
there is only white space between them (an issue Robert filed long ago) - and 
invoke analysis. Then what? Sometimes a single term or a clean sausage string 
comes out and a TermQuery or simple BooleanQuery or PhraseQuery needs to be 
generated, but if synonym-like filtering has generated a graph, then the query 
parser would hand "it" directly to WordAutomatonQuery, if I understand 
correctly. Then the question becomes how to tell that a WordAutomatonQuery 
graph is needed - unless WordAutomatonQuery automatically detects the cases for 
TermQuery and BooleanQuery/PhraseQuery as internal optimizations. (Well, I 
don't expect that WordAutomatonQuery would know how to do BooleanQuery vs. 
PhraseQuery, unless it has a "phrase" flag.)

In short, it would be nice if this issue directly or at least partially 
produced enough logic for that Term vs. Boolean vs. Phrase vs. WordAutomaton 
Query generation. Either to actually generate the final query, or at least some 
example code that documents the design pattern that a query parser needs for 
consumption of a "query phrase" graph.

In other words, the query parsers should not simply do a "next" for the entire 
output of query term analysis. A new design pattern is needed.

Also, at index time, the output of analysis is consumed as a single sausage 
stream, using "next" and token position increment, but any multiple multi-word 
synonyms would traditionally get somewhat mangled. There may not be a clean 
solution for the current index term posting format, but at a minimum we should 
reconsider how the output of index-time term analysis is consumed and flag 
potential improvements for the future for posting of multiple multi-term 
phrases at the same token position.

In any case, thanks for moving the multiple multi-term synonym ball forward!

                
> Make graph-based TokenFilters easier
> ------------------------------------
>
>                 Key: LUCENE-5012
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5012
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
>     creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
>     others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
>     after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

Reply via email to