[jira] [Commented] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Yonik Seeley (JIRA) Thu, 09 Aug 2012 06:52:22 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431808#comment-13431808
 ]


Yonik Seeley commented on SOLR-3723:
------------------------------------

bq. Note: The hyphen issue is present in StandardTokenizer, even if WDF is not 
used.

Ouch!  I hadn't realized that.
I just verified that with our stock setup, a query for F-22 finds anything with 
an F or with a 22 in the document.  I agree this is bad default behavior.

                
> Improve OOTB behavior: English word-splitting should default to 
> autoGeneratePhraseQueries=true
> ----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3723
>                 URL: https://issues.apache.org/jira/browse/SOLR-3723
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1
>            Reporter: Jack Krupansky
>
> Digging through the Jira and revision history, I discovered that back at the 
> end of May 2011, a change was made to Solr that fairly significantly degrades 
> the OOTB behavior for English Solr queries, namely for word-splitting of 
> terms with embedded punctuation, so that they end up, by default, doing the 
> OR of the sub-terms, rather than doing the obvious phrase query of the 
> sub-terms.
> Just a couple of examples:
> 1. CD-ROM => CD OR ROM rather than “CD ROM”
> 2. 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter 
> innocently added to text_general or text_en)
> 3. out-of-the-box => out OR of OR the OR box rather than “out of the box”
> 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter innocently 
> added to text_general or text_en)
> 5. docid-001 => docid OR 001 rather than "DOCID 001"
> All of those queries will give surprising and unexpected results.
> Note: The hyphen issue is present in StandardTokenizer, even if WDF is not 
> used. Side note: The full behavior of StandardTokenizer should be more fully 
> documented on the Analyzers wiki.
> Back to the history of the change, there was a lot of lively discussion on 
> SOLR-2015 - add a config hook for autoGeneratePhraseQueries.
> And the actual change to default to the behavior described above was 
> SOLR-2519 - improve defaults for text_* field types.
> (Consider the entire discussion in those two issues incorporated here for 
> reference. Anyone wishing to participate in discussion on this issue would be 
> well-advised to study those two issues first.)
> I gather that the original motivation was for non-European languages, and 
> that even some European languages might search better without auto-phrase 
> generation, but the decision to default English terms to NOT automatically 
> generate phrase queries and to generate OR queries instead is rather 
> surprising and unexpected and outright undesirable, as my examples above show.
> I had been aware of the behavior for quite some time, but I had thought it 
> was simply a lingering bug so I paid little attention to it, until I stumbled 
> across this autoGeneratePhraseQueries "feature" while looking at the query 
> parser code. I can understand the need to disable automatic phrase queries 
> for SOME languages, but to disable it by default for English seems rather 
> bizarre, as my simple use cases above show.
> Even if no action is taken on this Jira, I feel that it is important that 
> there be a wider awareness of the significant and unexpected impact from 
> SOLR-2519, and that what had seemed like buggy behavior was done 
> intentionally.
> Unless there has been a change of heart since SOLR-2015/2519, I guess we are 
> stuck with the default TextField behavior, but at least we could improve the 
> example schema in several ways:
> 1. The English text field types should have autoGeneratePhraseQueries=true. 
> If a user innocently adds a word delimiter to text_en, for example, they need 
> to know that autoGeneratePhraseQueries=true is needed. Better to preempt that 
> confusion and put the attribute in now. In fact, hyphenated terms fail as I 
> have noted above, so the addition is needed even if a WDF is not added.
> 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - 
> in terms of use case examples, as above. Specifically note the ones that will 
> break with if the feature is disabled.
> Another, more controversial change will be:
> 3. Change text_general to autoGeneratePhraseQueries=true so that English will 
> be treated reasonably by default. I suspect that most European languages will 
> be at least "okay". A comment will note that this field attribute should be 
> removed or set to false for non-whitespace languages, or that an alternative 
> field type should be used. I suspect that the first thing any non-whitespace 
> language application will want to do is pick the text field type that has 
> analysis that makes the most sense for them, so I see no need to mess up 
> English for no good reason.
> Make no mistake, #3 is the primary and only real goal of this OOTB 
> improvement. Maybe "text_general" could be kept as is for reference as the 
> purported "general" text field type (except that it doesn't work well for 
> English, as shown above), and maybe there should be a "text_default" that I 
> would propose should be a literal copy of text_en with commentary to direct 
> users to the other choices for language.
> I would note that text_ja already has autoGeneratePhraseQueries=false, so I'm 
> not sure why the default in the TextField code had to be changed to false. 
> Any languages for which automatic phrase query generation is problematic 
> should be attributed similarly. But, now that it is wired into the schema 
> defaults, we may be stuck with it.
> I was rather surprised that SOLR-2519 actually changed the default in 
> TextField rather than simply set the attribute as appropriate for the various 
> text field types.
> There are probably also a couple of places in the wikis where the surprising 
> behavior should be noted. There is literally no wiki documentation for this 
> important feature. There are only two references to 
> autoGeneratePhraseQueries, with no discussion of exactly what this feature 
> does or what the downside is if it is disabled.
> In the past, there was no need to document the treatment of embedded word 
> delimiters (well, okay, the poor handling for non-whitespace languages SHOULD 
> have been documented), but now there is no documentation of the degradation 
> of what was a default and implicit feature that a lot of people assume should 
> be automatic.
> And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the 
> kinds of use cases that unsuspecting users may not realize were BROKEN by the 
> commit of SOLR-2519 that is masked under the innocent phrasing of "improve 
> defaults for text_* field types". How many users seriously understood that a 
> query with embedded dashes and commas behave differently as a result of that 
> change?
> I am contemplating whether to suggest that the WordDelimiterFilter should 
> also be part of the default text field type. Right now, it is hidden off in 
> text_en_splitting.
> I think stemming should also be part of the default English field type. The 
> whole point of the "example" schema is to show-off the best of Lucene/Solr.
> I'm not quite ready to propose that English be the default language supported 
> by the example schema, but I am 99.999% certain that we should focus it on 
> European, Roman, Latin languages. Non-European languages are indeed 
> important, and should probably have their own schema. text_general was a good 
> idea, but in hindsight it appears to have not been such a great idea in light 
> of the word-splitting problems I have highlighted above.
> Maybe I would propose that text_general be left as is, but that we add 
> text_default which is a copy of text_en (which would have WDF and stemming 
> added) and fields use text_default as their type. That way, it would be clear 
> what is going on and users could sensibly see what needs to happen if they 
> wish to switch default languages.
> After discussion settles, a revised final proposal will be composed. And some 
> specific and non-controversial issues may be split into separate Jira issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Reply via email to