autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Naomi Dushay
Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with 
results when there were hyphenated words:   aaa-bbb.   Erik Hatcher pointed me 
to the autoGeneratePhraseQueries attribute now available on fieldtype 
definitions in schema.xml.  This is a great feature, and everything is peachy 
if you start with Solr 3.4.   But many of us started earlier and are upgrading, 
and that's a different story.

It was surprising to me that

a.  the default for this new feature caused different search results than Solr 
1.4 

b.  it wasn't documented clearly, IMO

http://wiki.apache.org/solr/SchemaXml   makes no mention of it


In the schema.xml example, there is this at the top:

!-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --

And there was this in a couple of field definitions:

fieldType name=text_en_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false

But that was it.



RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
Seems like a change in default behavior like this should be included in the 
changes.txt for Solr 3.5.
Not sure how to do that.

Tom

-Original Message-
From: Naomi Dushay [mailto:ndus...@stanford.edu] 
Sent: Thursday, February 23, 2012 1:57 PM
To: solr-user@lucene.apache.org
Subject: autoGeneratePhraseQueries sort of silently set to false 

Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with 
results when there were hyphenated words:   aaa-bbb.   Erik Hatcher pointed me 
to the autoGeneratePhraseQueries attribute now available on fieldtype 
definitions in schema.xml.  This is a great feature, and everything is peachy 
if you start with Solr 3.4.   But many of us started earlier and are upgrading, 
and that's a different story.

It was surprising to me that

a.  the default for this new feature caused different search results than Solr 
1.4 

b.  it wasn't documented clearly, IMO

http://wiki.apache.org/solr/SchemaXml   makes no mention of it


In the schema.xml example, there is this at the top:

!-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --

And there was this in a couple of field definitions:

fieldType name=text_en_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false

But that was it.



Re: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Erik Hatcher
there's this (for 3.1, but in the 3.x CHANGES.txt):

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
  autoGeneratePhraseQueries=true (the default) causes the query parser to
  generate phrase queries if multiple tokens are generated from a single
  non-quoted analysis string.  For example WordDelimiterFilter splitting 
text:pdp-11
  will cause the parser to generate text:pdp 11 rather than (text:PDP OR 
text:11).
  Note that autoGeneratePhraseQueries=true tends to not work well for non 
whitespace
  delimited languages. (yonik)

with a ton of useful, though back and forth, commentary here: 
https://issues.apache.org/jira/browse/SOLR-2015

Note that the behavior, as Naomi pointed out so succinctly, is adjustable based 
off the *schema* version setting.  (look at your schema line in schema.xml).  
The code is simply this:

if (schema.getVersion()  1.3f) {
  autoGeneratePhraseQueries = false;
} else {
  autoGeneratePhraseQueries = true;
}

on TextField.  Specifying autoGeneratePhraseQueries explicitly on a field type 
overrides whatever the default may be.

Erik



On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote:

 Seems like a change in default behavior like this should be included in the 
 changes.txt for Solr 3.5.
 Not sure how to do that.
 
 Tom
 
 -Original Message-
 From: Naomi Dushay [mailto:ndus...@stanford.edu] 
 Sent: Thursday, February 23, 2012 1:57 PM
 To: solr-user@lucene.apache.org
 Subject: autoGeneratePhraseQueries sort of silently set to false 
 
 Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do 
 with results when there were hyphenated words:   aaa-bbb.   Erik Hatcher 
 pointed me to the autoGeneratePhraseQueries attribute now available on 
 fieldtype definitions in schema.xml.  This is a great feature, and everything 
 is peachy if you start with Solr 3.4.   But many of us started earlier and 
 are upgrading, and that's a different story.
 
 It was surprising to me that
 
 a.  the default for this new feature caused different search results than 
 Solr 1.4 
 
 b.  it wasn't documented clearly, IMO
 
 http://wiki.apache.org/solr/SchemaXml   makes no mention of it
 
 
 In the schema.xml example, there is this at the top:
 
 !-- attribute name is the name of this schema and is only used for display 
 purposes.
   Applications should change this to reflect the nature of the search 
 collection.
   version=1.4 is Solr's version number for the schema syntax and 
 semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
 nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
 except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --
 
 And there was this in a couple of field definitions:
 
 fieldType name=text_en_splitting class=solr.TextField 
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
 
 But that was it.
 



RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
Thanks Erik,

The 3.1 changes document the ability to set this and the default being set to 
true
However apparently the change between 3.4 and 3.5 the default was set to 
false  
Since this will change the behavior of any field where 
autoGeneratePhraseQueries is not explicitly set, it could easily surprise users 
who update to 3.5. 

 That's why I think the changing of the default behavior (i.e. when not 
explicitly set) should be called out explicitly in the changes.txt for 3.5.   

True, everyone should read the notes in the example schema.xml, but I think it 
would help if the change was also noted in changes.txt.  

Is it possible to revise the changes.txt for 3.5?

Do you by any chance know where the change in the default behavior was 
discussed?  I know it has been a contentious issue.

Tom

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, February 23, 2012 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: autoGeneratePhraseQueries sort of silently set to false

there's this (for 3.1, but in the 3.x CHANGES.txt):

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
  autoGeneratePhraseQueries=true (the default) causes the query parser to
  generate phrase queries if multiple tokens are generated from a single
  non-quoted analysis string.  For example WordDelimiterFilter splitting 
text:pdp-11
  will cause the parser to generate text:pdp 11 rather than (text:PDP OR 
text:11).
  Note that autoGeneratePhraseQueries=true tends to not work well for non 
whitespace
  delimited languages. (yonik)

with a ton of useful, though back and forth, commentary here: 
https://issues.apache.org/jira/browse/SOLR-2015

Note that the behavior, as Naomi pointed out so succinctly, is adjustable based 
off the *schema* version setting.  (look at your schema line in schema.xml).  
The code is simply this:

if (schema.getVersion()  1.3f) {
  autoGeneratePhraseQueries = false;
} else {
  autoGeneratePhraseQueries = true;
}

on TextField.  Specifying autoGeneratePhraseQueries explicitly on a field type 
overrides whatever the default may be.

Erik



On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote:

 Seems like a change in default behavior like this should be included in the 
 changes.txt for Solr 3.5.
 Not sure how to do that.
 
 Tom
 
 -Original Message-
 From: Naomi Dushay [mailto:ndus...@stanford.edu] 
 Sent: Thursday, February 23, 2012 1:57 PM
 To: solr-user@lucene.apache.org
 Subject: autoGeneratePhraseQueries sort of silently set to false 
 
 Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do 
 with results when there were hyphenated words:   aaa-bbb.   Erik Hatcher 
 pointed me to the autoGeneratePhraseQueries attribute now available on 
 fieldtype definitions in schema.xml.  This is a great feature, and everything 
 is peachy if you start with Solr 3.4.   But many of us started earlier and 
 are upgrading, and that's a different story.
 
 It was surprising to me that
 
 a.  the default for this new feature caused different search results than 
 Solr 1.4 
 
 b.  it wasn't documented clearly, IMO
 
 http://wiki.apache.org/solr/SchemaXml   makes no mention of it
 
 
 In the schema.xml example, there is this at the top:
 
 !-- attribute name is the name of this schema and is only used for display 
 purposes.
   Applications should change this to reflect the nature of the search 
 collection.
   version=1.4 is Solr's version number for the schema syntax and 
 semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
 nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
 except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --
 
 And there was this in a couple of field definitions:
 
 fieldType name=text_en_splitting class=solr.TextField 
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
 
 But that was it.