[jira] [Updated] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Jack Krupansky (JIRA) Thu, 09 Aug 2012 06:14:24 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jack Krupansky updated SOLR-3723:
---------------------------------

    Description: 
Digging through the Jira and revision history, I discovered that back at the 
end of May 2011, a change was made to Solr that fairly significantly degrades 
the OOTB behavior for English Solr queries, namely for word-splitting of terms 
with embedded punctuation, so that they end up, by default, doing the OR of the 
sub-terms, rather than doing the obvious phrase query of the sub-terms.

Just a couple of examples:

1. CD-ROM => CD OR ROM rather than “CD ROM”

2. 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter 
innocently added to text_general or text_en)

3. out-of-the-box => out OR of OR the OR box rather than “out of the box”

4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter innocently 
added to text_general or text_en)

5. docid-001 => docid OR 001 rather than "DOCID 001"

All of those queries will give surprising and unexpected results.

Note: The hyphen issue is present in StandardTokenizer, even if WDF is not 
used. Side note: The full behavior of StandardTokenizer should be more fully 
documented on the Analyzers wiki.

Back to the history of the change, there was a lot of lively discussion on 
SOLR-2015 - add a config hook for autoGeneratePhraseQueries.

And the actual change to default to the behavior described above was SOLR-2519 
- improve defaults for text_* field types.

(Consider the entire discussion in those two issues incorporated here for 
reference. Anyone wishing to participate in discussion on this issue would be 
well-advised to study those two issues first.)

I gather that the original motivation was for non-European languages, and that 
even some European languages might search better without auto-phrase 
generation, but the decision to default English terms to NOT automatically 
generate phrase queries and to generate OR queries instead is rather surprising 
and unexpected and outright undesirable, as my examples above show.

I had been aware of the behavior for quite some time, but I had thought it was 
simply a lingering bug so I paid little attention to it, until I stumbled 
across this autoGeneratePhraseQueries "feature" while looking at the query 
parser code. I can understand the need to disable automatic phrase queries for 
SOME languages, but to disable it by default for English seems rather bizarre, 
as my simple use cases above show.

Even if no action is taken on this Jira, I feel that it is important that there 
be a wider awareness of the significant and unexpected impact from SOLR-2519, 
and that what had seemed like buggy behavior was done intentionally.

Unless there has been a change of heart since SOLR-2015/2519, I guess we are 
stuck with the default TextField behavior, but at least we could improve the 
example schema in several ways:

1. The English text field types should have autoGeneratePhraseQueries=true. If 
a user innocently adds a word delimiter to text_en, for example, they need to 
know that autoGeneratePhraseQueries=true is needed. Better to preempt that 
confusion and put the attribute in now. In fact, hyphenated terms fail as I 
have noted above, so the addition is needed even if a WDF is not added.

2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - in 
terms of use case examples, as above. Specifically note the ones that will 
break with if the feature is disabled.

Another, more controversial change will be:

3. Change text_general to autoGeneratePhraseQueries=true so that English will 
be treated reasonably by default. I suspect that most European languages will 
be at least "okay". A comment will note that this field attribute should be 
removed or set to false for non-whitespace languages, or that an alternative 
field type should be used. I suspect that the first thing any non-whitespace 
language application will want to do is pick the text field type that has 
analysis that makes the most sense for them, so I see no need to mess up 
English for no good reason.

Make no mistake, #3 is the primary and only real goal of this OOTB 
improvement. Maybe "text_general" could be kept as is for reference as the 
purported "general" text field type (except that it doesn't work well for 
English, as shown above), and maybe there should be a "text_default" that I 
would propose should be a literal copy of text_en with commentary to direct 
users to the other choices for language.

I would note that text_ja already has autoGeneratePhraseQueries=false, so I'm 
not sure why the default in the TextField code had to be changed to false. Any 
languages for which automatic phrase query generation is problematic should be 
attributed similarly. But, now that it is wired into the schema defaults, we 
may be stuck with it.

I was rather surprised that SOLR-2519 actually changed the default in TextField 
rather than simply set the attribute as appropriate for the various text field 
types.

There are probably also a couple of places in the wikis where the surprising 
behavior should be noted. There is literally no wiki documentation for this 
important feature. There are only two references to autoGeneratePhraseQueries, 
with no discussion of exactly what this feature does or what the downside is if 
it is disabled.

In the past, there was no need to document the treatment of embedded word 
delimiters (well, okay, the poor handling for non-whitespace languages SHOULD 
have been documented), but now there is no documentation of the degradation of 
what was a default and implicit feature that a lot of people assume should be 
automatic.

And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the kinds 
of use cases that unsuspecting users may not realize were BROKEN by the commit 
of SOLR-2519 that is masked under the innocent phrasing of "improve defaults 
for text_* field types". How many users seriously understood that a query with 
embedded dashes and commas behave differently as a result of that change?

I am contemplating whether to suggest that the WordDelimiterFilter should also 
be part of the default text field type. Right now, it is hidden off in 
text_en_splitting.

I think stemming should also be part of the default English field type. The 
whole point of the "example" schema is to show-off the best of Lucene/Solr.

I'm not quite ready to propose that English be the default language supported 
by the example schema, but I am 99.999% certain that we should focus it on 
European, Roman, Latin languages. Non-European languages are indeed important, 
and should probably have their own schema. text_general was a good idea, but in 
hindsight it appears to have not been such a great idea in light of the 
word-splitting problems I have highlighted above.

Maybe I would propose that text_general be left as is, but that we add 
text_default which is a copy of text_en (which would have WDF and stemming 
added) and fields use text_default as their type. That way, it would be clear 
what is going on and users could sensibly see what needs to happen if they wish 
to switch default languages.

After discussion settles, a revised final proposal will be composed. And some 
specific and non-controversial issues may be split into separate Jira issues.


  was:
Digging through the Jira and revision history, I discovered that back at the 
end of May 2011, a change was made to Solr that fairly significantly 
degrades the OOTB behavior for English Solr queries, namely for word-splitting 
of terms with embedded punctuation, so that they end up, by default, doing the 
OR of the sub-terms, rather than doing the obvious phrase query of the 
sub-terms.

Just a couple of examples:

1. CD-ROM => CD OR ROM rather than “CD ROM”

2. 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter 
innocently added to text_general or text_en)

3. out-of-the-box => out OR of OR the OR box rather than “out of the box”

4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter innocently 
added to text_general or text_en)

5. docid-001 => docid OR 001 rather than "DOCID 001"

All of those queries will give surprising and unexpected results.

Note: The hyphen issue is present in StandardTokenizer, even if WDF is not 
used. Side note: The full behavior of StandardTokenizer should be more fully 
documented on the Analyzers wiki.

Back to the history of the change, there was a lot of lively discussion on 
SOLR-2015 - add a config hook for autoGeneratePhraseQueries.

And the actual change to default to the behavior described above was SOLR-2519 
- improve defaults for text_* field types.

(Consider the entire discussion in those two issues incorporated here for 
reference. Anyone wishing to participate in discussion on this issue would be 
well-advised to study those two issues first.)

I gather that the original motivation was for non-European languages, and 
that even some European languages might search better without auto-phrase 
generation, but the decision to default English terms to NOT automatically 
generate phrase queries and to generate OR queries instead is rather surprising 
and unexpected and outright undesirable, as my examples above show.

I had been aware of the behavior for quite some time, but I had thought it was 
simply a lingering bug so I paid little attention to it, until I 
stumbled across this autoGeneratePhraseQueries "feature" while looking at 
the query parser code. I can understand the need to disable automatic phrase 
queries for SOME languages, but to disable it by default for English seems 
rather bizarre, as my simple use cases above show.

Even if no action is taken on this Jira, I feel that it is important that there 
be a wider awareness of the significant and unexpected impact from SOLR-2519, 
and that what had seemed like buggy behavior was done intentionally.

Unless there has been a change of heart since SOLR-2015/2519, I guess we are 
stuck with the default TextField behavior, but at least we could improve the 
example schema in several ways:

1. The English text field types should have autoGeneratePhraseQueries=true. If 
a user innocently adds a word delimiter to text_en, for example, they need to 
know that autoGeneratePhraseQueries=true is needed. Better to preempt that 
confusion and put the attribute in now. In fact, hyphenated terms fail as I 
have noted above, so the addition is needed even if a WDF is not added.

2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - in 
terms of use case examples, as above. Specifically note the ones that will 
break with if the feature is disabled.

Another, more controversial change will be:

3. Change text_general to autoGeneratePhraseQueries=true so that English 
will be treated reasonably by default. I suspect that most European 
languages will be at least "okay". A comment will note that this field 
attribute should be removed or set to false for non-whitespace languages, or 
that an alternative field type should be used. I suspect that the first thing 
any non-whitespace language application will want to do is pick the text field 
type that has analysis that makes the most sense for them, so I see no need to 
mess up English for no good reason.

Make no mistake, #3 is the primary and only real goal of this OOTB 
improvement. Maybe "text_general" could be kept as is for reference as the 
purported "general" text field type (except that it doesn't work well for 
English. as shown above), and maybe there should be a "text_default" that I 
would propose should be text_en with commentary to direct users to the other 
choices for language.

I would note that text_ja already has autoGeneratePhraseQueries=false, so 
I'm not sure why the default in the TextField code had to be changed to false. 
Any languages for which automatic phrase query generation is problematic should 
be attributed similarly. But, now that it is wired into the schema defaults, we 
may be stuck with it.

I was rather surprised that SOLR-2519 actually changed the default in 
TextField rather than simply set the attribute as appropriate for the 
various text field types.

There are probably also a couple of places in the wikis where the surprising 
behavior should be noted. There is literally no wiki documentation for this 
important feature. There are only two references to autoGeneratePhraseQueries, 
with no discussion of exactly what this feature does or what the downside is if 
it is disabled.

In the past, there was no need to document the treatment of embedded word 
delimiters (well, okay, the poor handling for non-whitespace languages SHOULD 
have been documented), but now there is no documentation of the degradation of 
what was a default and implicit feature that a lot of people assume should be 
automatic.

And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the 
kinds of use cases that unsuspecting users may not realize were BROKEN by 
the commit of SOLR-2519 that is masked under the innocent phrasing of 
"improve defaults for text_* field types". How many users seriously 
understood that a query with embedded dashes and commas behave differently as a 
result of that change?

I am contemplating whether to suggest that the WordDelimiterFilter should 
also be part of the default text field type. Right now, it is hidden off in 
text_en_splitting.

I think stemming should also be part of the default English field type. The 
whole point of the "example" schema is to show-off the best of Lucene/Solr.

I'm not quite ready to propose that English be the default language supported 
by the example schema, but I am 99.999% certain that we should focus it on 
European, Roman, Latin languages. Non-European languages are indeed important, 
and should probably have their own schema. text_general was a good idea, but in 
hindsight it appears to have not been such a great idea in light of the 
word-splitting problems I have highlighted above.

Maybe I would propose that text_general be left as is, but that we add 
text_default which is a copy of text_en (which would have WDF and stemming 
added) and fields use text_default as their type. That way, it would be clear 
what is going on and users could sensibly see what needs to happen if they wish 
to switch default languages.

After discussion settles, a revised final proposal will be composed. And some 
specific and non-controversial issues may be split into separate Jira issues.


    
> Improve OOTB behavior: English word-splitting should default to 
> autoGeneratePhraseQueries=true
> ----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3723
>                 URL: https://issues.apache.org/jira/browse/SOLR-3723
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1
>            Reporter: Jack Krupansky
>
> Digging through the Jira and revision history, I discovered that back at the 
> end of May 2011, a change was made to Solr that fairly significantly degrades 
> the OOTB behavior for English Solr queries, namely for word-splitting of 
> terms with embedded punctuation, so that they end up, by default, doing the 
> OR of the sub-terms, rather than doing the obvious phrase query of the 
> sub-terms.
> Just a couple of examples:
> 1. CD-ROM => CD OR ROM rather than “CD ROM”
> 2. 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter 
> innocently added to text_general or text_en)
> 3. out-of-the-box => out OR of OR the OR box rather than “out of the box”
> 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter innocently 
> added to text_general or text_en)
> 5. docid-001 => docid OR 001 rather than "DOCID 001"
> All of those queries will give surprising and unexpected results.
> Note: The hyphen issue is present in StandardTokenizer, even if WDF is not 
> used. Side note: The full behavior of StandardTokenizer should be more fully 
> documented on the Analyzers wiki.
> Back to the history of the change, there was a lot of lively discussion on 
> SOLR-2015 - add a config hook for autoGeneratePhraseQueries.
> And the actual change to default to the behavior described above was 
> SOLR-2519 - improve defaults for text_* field types.
> (Consider the entire discussion in those two issues incorporated here for 
> reference. Anyone wishing to participate in discussion on this issue would be 
> well-advised to study those two issues first.)
> I gather that the original motivation was for non-European languages, and 
> that even some European languages might search better without auto-phrase 
> generation, but the decision to default English terms to NOT automatically 
> generate phrase queries and to generate OR queries instead is rather 
> surprising and unexpected and outright undesirable, as my examples above show.
> I had been aware of the behavior for quite some time, but I had thought it 
> was simply a lingering bug so I paid little attention to it, until I stumbled 
> across this autoGeneratePhraseQueries "feature" while looking at the query 
> parser code. I can understand the need to disable automatic phrase queries 
> for SOME languages, but to disable it by default for English seems rather 
> bizarre, as my simple use cases above show.
> Even if no action is taken on this Jira, I feel that it is important that 
> there be a wider awareness of the significant and unexpected impact from 
> SOLR-2519, and that what had seemed like buggy behavior was done 
> intentionally.
> Unless there has been a change of heart since SOLR-2015/2519, I guess we are 
> stuck with the default TextField behavior, but at least we could improve the 
> example schema in several ways:
> 1. The English text field types should have autoGeneratePhraseQueries=true. 
> If a user innocently adds a word delimiter to text_en, for example, they need 
> to know that autoGeneratePhraseQueries=true is needed. Better to preempt that 
> confusion and put the attribute in now. In fact, hyphenated terms fail as I 
> have noted above, so the addition is needed even if a WDF is not added.
> 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - 
> in terms of use case examples, as above. Specifically note the ones that will 
> break with if the feature is disabled.
> Another, more controversial change will be:
> 3. Change text_general to autoGeneratePhraseQueries=true so that English will 
> be treated reasonably by default. I suspect that most European languages will 
> be at least "okay". A comment will note that this field attribute should be 
> removed or set to false for non-whitespace languages, or that an alternative 
> field type should be used. I suspect that the first thing any non-whitespace 
> language application will want to do is pick the text field type that has 
> analysis that makes the most sense for them, so I see no need to mess up 
> English for no good reason.
> Make no mistake, #3 is the primary and only real goal of this OOTB 
> improvement. Maybe "text_general" could be kept as is for reference as the 
> purported "general" text field type (except that it doesn't work well for 
> English, as shown above), and maybe there should be a "text_default" that I 
> would propose should be a literal copy of text_en with commentary to direct 
> users to the other choices for language.
> I would note that text_ja already has autoGeneratePhraseQueries=false, so I'm 
> not sure why the default in the TextField code had to be changed to false. 
> Any languages for which automatic phrase query generation is problematic 
> should be attributed similarly. But, now that it is wired into the schema 
> defaults, we may be stuck with it.
> I was rather surprised that SOLR-2519 actually changed the default in 
> TextField rather than simply set the attribute as appropriate for the various 
> text field types.
> There are probably also a couple of places in the wikis where the surprising 
> behavior should be noted. There is literally no wiki documentation for this 
> important feature. There are only two references to 
> autoGeneratePhraseQueries, with no discussion of exactly what this feature 
> does or what the downside is if it is disabled.
> In the past, there was no need to document the treatment of embedded word 
> delimiters (well, okay, the poor handling for non-whitespace languages SHOULD 
> have been documented), but now there is no documentation of the degradation 
> of what was a default and implicit feature that a lot of people assume should 
> be automatic.
> And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the 
> kinds of use cases that unsuspecting users may not realize were BROKEN by the 
> commit of SOLR-2519 that is masked under the innocent phrasing of "improve 
> defaults for text_* field types". How many users seriously understood that a 
> query with embedded dashes and commas behave differently as a result of that 
> change?
> I am contemplating whether to suggest that the WordDelimiterFilter should 
> also be part of the default text field type. Right now, it is hidden off in 
> text_en_splitting.
> I think stemming should also be part of the default English field type. The 
> whole point of the "example" schema is to show-off the best of Lucene/Solr.
> I'm not quite ready to propose that English be the default language supported 
> by the example schema, but I am 99.999% certain that we should focus it on 
> European, Roman, Latin languages. Non-European languages are indeed 
> important, and should probably have their own schema. text_general was a good 
> idea, but in hindsight it appears to have not been such a great idea in light 
> of the word-splitting problems I have highlighted above.
> Maybe I would propose that text_general be left as is, but that we add 
> text_default which is a copy of text_en (which would have WDF and stemming 
> added) and fields use text_default as their type. That way, it would be clear 
> what is going on and users could sensibly see what needs to happen if they 
> wish to switch default languages.
> After discussion settles, a revised final proposal will be composed. And some 
> specific and non-controversial issues may be split into separate Jira issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Reply via email to