[
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663471#comment-13663471
]
Jack Krupansky commented on LUCENE-5012:
----------------------------------------
bq. WordAutomatonQuery
Sounds quite promising.
Back to the query parsers... So, they would present a term or quoted string -
and eventually hopefully a sequence of terms if the query parser sees that
there is only white space between them (an issue Robert filed long ago) - and
invoke analysis. Then what? Sometimes a single term or a clean sausage string
comes out and a TermQuery or simple BooleanQuery or PhraseQuery needs to be
generated, but if synonym-like filtering has generated a graph, then the query
parser would hand "it" directly to WordAutomatonQuery, if I understand
correctly. Then the question becomes how to tell that a WordAutomatonQuery
graph is needed - unless WordAutomatonQuery automatically detects the cases for
TermQuery and BooleanQuery/PhraseQuery as internal optimizations. (Well, I
don't expect that WordAutomatonQuery would know how to do BooleanQuery vs.
PhraseQuery, unless it has a "phrase" flag.)
In short, it would be nice if this issue directly or at least partially
produced enough logic for that Term vs. Boolean vs. Phrase vs. WordAutomaton
Query generation. Either to actually generate the final query, or at least some
example code that documents the design pattern that a query parser needs for
consumption of a "query phrase" graph.
In other words, the query parsers should not simply do a "next" for the entire
output of query term analysis. A new design pattern is needed.
Also, at index time, the output of analysis is consumed as a single sausage
stream, using "next" and token position increment, but any multiple multi-word
synonyms would traditionally get somewhat mangled. There may not be a clean
solution for the current index term posting format, but at a minimum we should
reconsider how the output of index-time term analysis is consumed and flag
potential improvements for the future for posting of multiple multi-term
phrases at the same token position.
In any case, thanks for moving the multiple multi-term synonym ball forward!
> Make graph-based TokenFilters easier
> ------------------------------------
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
> * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
> * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]