[
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866665#action_12866665
]
Ivan Provalov commented on LUCENE-2458:
---------------------------------------
Robert has asked me to post our test results on the Chinese Collection. We used
the following data collection from TREC:
http://trec.nist.gov/data/qrels_noneng/index.html
qrels.trec6.29-54.chinese.gz
qrels.1-28.chinese.gz
http://trec.nist.gov/data/topics_noneng
TREC-6 Chinese topics (.gz)
TREC-5 Chinese topics (.gz)
Mandarin Data Collection
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52
Analyzer Name Plain analyzers Added PositionFilter (only at query time)
ChineseAnalyzer 0.028 0.264
CJKAnalyzer 0.027 0.284
SmartChinese 0.027 0.265
IKAnalyzer 0.028 0.259
(Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average
precision)
Thanks,
Ivan Provalov
> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
> Issue Type: Bug
> Components: QueryParser
> Reporter: Robert Muir
> Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term
> count is used as some sort of "heuristic" to determine if its a phrase query
> or not.
> This assumption is a disaster for languages that don't use whitespace
> separation: CJK, compounding European languages like German, Finnish, etc. It
> also
> makes it difficult for people to use n-gram analysis techniques. In these
> cases you get bad relevance (MAP improves nearly *10x* if you use a
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases
> its being abused as some heuristic to "second guess" the tokenizer and piece
> back things it shouldn't have split, but for large collections, doing things
> like generating phrasequeries because StandardTokenizer split a compound on a
> dash can cause serious performance problems. Instead people should analyze
> their text with the appropriate methods, and QueryParser should only generate
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty
> obscure and people are not familiar with it. The result is we have bad
> out-of-box behavior for many languages, and bad performance for others on
> some inputs.
> I propose instead that we change the grammar to actually look for double
> quotes to determine when to generate a phrase query, consistent with the
> documentation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]