Where are the actual relevance measurements showing degradation? For every example you have, i can give you a counter-example, including whole languages that flat out won't work at all.
Anyone who *wants* a phrase query can ask for one with double quotes. If you force this option on, users have no way to turn it off. I'm strongly opposed. I could care less about english. On Wed, Aug 8, 2012 at 8:13 PM, Jack Krupansky <j...@basetechnology.com> wrote: > Digging through the Jira and revision history, I discovered that back at the > end of May 2011, a change was made to Solr that fairly significantly > degrades the OOTB behavior for Solr queries, namely for word-splitting of > terms with embedded punctuation, so that they end up, by default, doing the > OR of the sub-terms, rather than doing the obvious phrase query of the > sub-terms. > > Just a couple of examples: > > CD-ROM => CD OR ROM rather than “CD ROM” > 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter) > out-of-the-box => out OR of OR the OR box rather than “out of the box” > 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter) > docid-001 => docid OR 001 rather than "DOCID 001" > > All of those queries will give surprising and unexpected results. > > Back to the history of the change, there was a lot of lively discussion on > SOLR-2015 - add a config hook for autoGeneratePhraseQueries: > https://issues.apache.org/jira/browse/SOLR-2015 > > And the actual change to default to the behavior described above was > SOLR-2519 - improve defaults for text_* field types: > https://issues.apache.org/jira/browse/SOLR-2519 > > I gather that the original motivation was for non-European languages, and > that even some European languages might search better without auto-phrase > generation, but the decision to default English terms to NOT automatically > generate phrase queries and to generate OR queries instead is rather > surprising and unexpected and outright undesirable, as my examples above > show. > > I had been aware of the behavior for quite some time, but I had thought it > was simply a lingering bug so I paid little attention to it, until I > stumbled across this autoGeneratePhraseQueries "feature" while looking at > the query parser code. I can understand the need to disable automatic phrase > queries for SOME languages, but to disable it by default for English seems > rather bizarre, as my simple use cases above show. > > I'll file this as a Jira, but I wanted to call wider attention to it in case > others were as unaware as me that what had seemed like buggy behavior was > done intentionally. > > Unless there has been a change of heart since SOLR-2015/2519, I guess we are > stuck with the default TextField behavior, but at least we could improve the > example schema in several ways: > > 1. The English text field types should have autoGeneratePhraseQueries=true. > 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - > in terms of use case examples, as above. Specifically note the ones that > will break with if the feature is disabled. > > Another, more controversial change will be: > > 3. Change text_general to autoGeneratePhraseQueries=true so that English > will be treated reasonably by default. I suspect that most European > languages will be at least "okay". A comment will note that this field > attribute should be removed or set to false for non-whitespace languages, or > that an alternative field type should be used. I suspect that the first > thing any non-whitespace language application will want to do is pick the > text field type that has analysis that makes the most sense for them, so I > see no need to mess up English for no good reason. > > Make no mistake, #3 is the primary and only real goal of this OOTB > improvement. Maybe "text_general" could be kept as is for reference as the > purported "general" text field type (except that it doesn't work well for > English. as shown above), and maybe there should be a "text_default" that I > would propose should be text_en with commentary to direct users to the other > choices for language. > > I would note that text_ja already has autoGeneratePhraseQueries=false, so > I'm not sure why the default in the TextField code had to be changed to > false. Any languages for which automatic phrase query generation is > problematic should be attributed similarly. But, now that it is wired into > the schema defaults, we may be stuck with it. > > I was rather surprised that SOLR-2519 actually changed the default in > TextField rather than simply set the attribute as appropriate for the > various text field types. > > There are probably also a couple of places in the wikis where the surprising > behavior should be noted. > > And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the > kinds of use cases that unsuspecting users may not realize were BROKEN by > the commit of SOLR-2519 that is masked under the innocent phrasing of > "improve defaults for text_* field types". How many users seriously > understood that a query with embedded dashes and commas behave differently > as a result of that change? > > I am contemplating whether to suggest that the WordDelimiterFilter should > also be part of the default text field type. Right now, it is hidden off in > text_en_splitting. > > I'll file the Jira tomorrow. Feel free to hold off comments until the Jira > appears. > > -- Jack Krupansky > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org