[ https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431817#comment-13431817 ]
Yonik Seeley commented on SOLR-3723: ------------------------------------ bq. I am strongly -1 against breaking tons of languages for some sketchy "optimization" of english. It's certainly not just for english. It also doesn't seem sketchy at all - it seems to make perfect sense. > Improve OOTB behavior: English word-splitting should default to > autoGeneratePhraseQueries=true > ---------------------------------------------------------------------------------------------- > > Key: SOLR-3723 > URL: https://issues.apache.org/jira/browse/SOLR-3723 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1 > Reporter: Jack Krupansky > > Digging through the Jira and revision history, I discovered that back at the > end of May 2011, a change was made to Solr that fairly significantly degrades > the OOTB behavior for English Solr queries, namely for word-splitting of > terms with embedded punctuation, so that they end up, by default, doing the > OR of the sub-terms, rather than doing the obvious phrase query of the > sub-terms. > Just a couple of examples: > 1. CD-ROM => CD OR ROM rather than “CD ROM” > 2. 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter > innocently added to text_general or text_en) > 3. out-of-the-box => out OR of OR the OR box rather than “out of the box” > 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter innocently > added to text_general or text_en) > 5. docid-001 => docid OR 001 rather than "DOCID 001" > All of those queries will give surprising and unexpected results. > Note: The hyphen issue is present in StandardTokenizer, even if WDF is not > used. Side note: The full behavior of StandardTokenizer should be more fully > documented on the Analyzers wiki. > Back to the history of the change, there was a lot of lively discussion on > SOLR-2015 - add a config hook for autoGeneratePhraseQueries. > And the actual change to default to the behavior described above was > SOLR-2519 - improve defaults for text_* field types. > (Consider the entire discussion in those two issues incorporated here for > reference. Anyone wishing to participate in discussion on this issue would be > well-advised to study those two issues first.) > I gather that the original motivation was for non-European languages, and > that even some European languages might search better without auto-phrase > generation, but the decision to default English terms to NOT automatically > generate phrase queries and to generate OR queries instead is rather > surprising and unexpected and outright undesirable, as my examples above show. > I had been aware of the behavior for quite some time, but I had thought it > was simply a lingering bug so I paid little attention to it, until I stumbled > across this autoGeneratePhraseQueries "feature" while looking at the query > parser code. I can understand the need to disable automatic phrase queries > for SOME languages, but to disable it by default for English seems rather > bizarre, as my simple use cases above show. > Even if no action is taken on this Jira, I feel that it is important that > there be a wider awareness of the significant and unexpected impact from > SOLR-2519, and that what had seemed like buggy behavior was done > intentionally. > Unless there has been a change of heart since SOLR-2015/2519, I guess we are > stuck with the default TextField behavior, but at least we could improve the > example schema in several ways: > 1. The English text field types should have autoGeneratePhraseQueries=true. > If a user innocently adds a word delimiter to text_en, for example, they need > to know that autoGeneratePhraseQueries=true is needed. Better to preempt that > confusion and put the attribute in now. In fact, hyphenated terms fail as I > have noted above, so the addition is needed even if a WDF is not added. > 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - > in terms of use case examples, as above. Specifically note the ones that will > break with if the feature is disabled. > Another, more controversial change will be: > 3. Change text_general to autoGeneratePhraseQueries=true so that English will > be treated reasonably by default. I suspect that most European languages will > be at least "okay". A comment will note that this field attribute should be > removed or set to false for non-whitespace languages, or that an alternative > field type should be used. I suspect that the first thing any non-whitespace > language application will want to do is pick the text field type that has > analysis that makes the most sense for them, so I see no need to mess up > English for no good reason. > Make no mistake, #3 is the primary and only real goal of this OOTB > improvement. Maybe "text_general" could be kept as is for reference as the > purported "general" text field type (except that it doesn't work well for > English, as shown above), and maybe there should be a "text_default" that I > would propose should be a literal copy of text_en with commentary to direct > users to the other choices for language. > I would note that text_ja already has autoGeneratePhraseQueries=false, so I'm > not sure why the default in the TextField code had to be changed to false. > Any languages for which automatic phrase query generation is problematic > should be attributed similarly. But, now that it is wired into the schema > defaults, we may be stuck with it. > I was rather surprised that SOLR-2519 actually changed the default in > TextField rather than simply set the attribute as appropriate for the various > text field types. > There are probably also a couple of places in the wikis where the surprising > behavior should be noted. There is literally no wiki documentation for this > important feature. There are only two references to > autoGeneratePhraseQueries, with no discussion of exactly what this feature > does or what the downside is if it is disabled. > In the past, there was no need to document the treatment of embedded word > delimiters (well, okay, the poor handling for non-whitespace languages SHOULD > have been documented), but now there is no documentation of the degradation > of what was a default and implicit feature that a lot of people assume should > be automatic. > And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the > kinds of use cases that unsuspecting users may not realize were BROKEN by the > commit of SOLR-2519 that is masked under the innocent phrasing of "improve > defaults for text_* field types". How many users seriously understood that a > query with embedded dashes and commas behave differently as a result of that > change? > I am contemplating whether to suggest that the WordDelimiterFilter should > also be part of the default text field type. Right now, it is hidden off in > text_en_splitting. > I think stemming should also be part of the default English field type. The > whole point of the "example" schema is to show-off the best of Lucene/Solr. > I'm not quite ready to propose that English be the default language supported > by the example schema, but I am 99.999% certain that we should focus it on > European, Roman, Latin languages. Non-European languages are indeed > important, and should probably have their own schema. text_general was a good > idea, but in hindsight it appears to have not been such a great idea in light > of the word-splitting problems I have highlighted above. > Maybe I would propose that text_general be left as is, but that we add > text_default which is a copy of text_en (which would have WDF and stemming > added) and fields use text_default as their type. That way, it would be clear > what is going on and users could sensibly see what needs to happen if they > wish to switch default languages. > After discussion settles, a revised final proposal will be composed. And some > specific and non-controversial issues may be split into separate Jira issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org