[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040298#comment-13040298
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. I think we need to stop kidding ourselves about example/default and just 
recognize that 99.999% of users just use the example as their default 
configuration. Guys, the example is the default, there is simply not argument, 
this is the reality! So I think we should present reasonable field type names 
such as text_en etc. Please don't waste any more of our time trying to convince 
users that the default is actually an example, its a default.

OK I agree.  So I'll rename the fields back to text_XX (instead of 
text_example_XX).

bq. 3. The aggressive analysis is totally unnecessary and gives bad results, 
this is not 1985... Lets drop the porter stemmer and the stopwords list and 
replace them with less aggressive defaults such as s-stemmer and a commongrams 
configuration.

Sounds great!  Can you post the analyzer XML for this?  Kinda out of my 
league at this point :)

bq. 4. I do not think the default query parser should be the lucene one, if we 
have a fancy one (edismax?) that happily handles user input without 
exceptions... why not just default to the best we have to offer?!

+1

Robert maybe you can take the patch and iterate w/ these changes...?


> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch, SOLR-2519.patch, SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040129#comment-13040129
 ] 

Robert Muir commented on SOLR-2519:
---

A few opinions:

1. First of all, I am +1 to the patch. I think its an improvement overall, 
however I think it might be worthwhile to discuss the following issues below.

2. I think we need to stop kidding ourselves about example/default and just 
recognize that 99.999% of users just use the example as their default 
configuration. Guys, the example is the default, there is simply not argument, 
this is the reality!  So I think we should present reasonable field type names 
such as text_en etc. Please don't waste any more of our time trying to convince 
users that the default is actually an example, its a default.

3. The aggressive analysis is totally unnecessary and gives bad results, this 
is not 1985... Lets drop the porter stemmer and the stopwords list and replace 
them with less aggressive defaults such as s-stemmer and a commongrams 
configuration.

4. I do not think the default query parser should be the lucene one, if we have 
a fancy one (edismax?) that happily handles user input without exceptions... 
why not just default to the best we have to offer?!


> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch, SOLR-2519.patch, SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036104#comment-13036104
 ] 

Michael McCandless commented on SOLR-2519:
--

+1 to naming these fields text_example_XXX.  That's a great idea Jan.  I'll do 
that in my next patch...

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-18 Thread Erick Erickson
+1. I've seen far too many implementations of Solr that blindly use
the example configurations and then wonder why the results are
surprising (WordDelimiterFilterFactory by itself has confused more
people than I can recollect).

Although, just to contradict myself, I guess if people don't really
look at the configs, they deserver the consequences...

And to contra-contradict myself, at least that would give us a clue on
the user's list about where to look first!

Erick

2011/5/18 Jan Høydahl (JIRA) :
>
>    [ 
> https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035796#comment-13035796
>  ]
>
> Jan Høydahl commented on SOLR-2519:
> ---
>
> Largely agree with @Hoss' suggestion. But I think it would be wise to 
> emphasize that the example schema is just that - an *example* - encouraging 
> people to create new fieldTypes instead of editing the example ones. It's not 
> a problem for "int", "date" etc, but for text I always encourage our 
> customers and students to stay away from the FieldTypes in the example and 
> make their own versions instead.
>
> One way to further encourage this best practice is naming all text FieldTypes 
> clearly as examples, e.g.
>
> {code}
> 
> 
> {code}
>
> We must realize that a lot of non-american users out there are already 
> customizing their schemas with the naming pattern "text_", which means 
> you'll find "text_en", "text_it", "text_no" in a lot of installations. 
> Therefore it would be un-wise to introduce new FieldTypes wich crashes with 
> those names out of the box in version 3.2, thus include _example in the type 
> name.
>
> When upgrading, I always leave all the example field types intact, and add my 
> custom ones separately, clearly marked by comments for easy copy/paste. I 
> believe this to be a fairly common practice, and wanted as well, which would 
> give no clashes for the above example.
>
> With this example naming practice, we can be pretty sure that if people talk 
> about the fieldType "text_example_en" on the lists, they mean the default 
> example type, but if they talk about "text_en", it's something they've 
> customized themselves (if so by simply renaming the example). It'll be more 
> mental resitance for people to start modifying something with "_example" in 
> it wihout also changing the name.
>
>> Improve the defaults for the "text" field type in default schema.xml
>> 
>>
>>                 Key: SOLR-2519
>>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>>             Project: Solr
>>          Issue Type: Bug
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>             Fix For: 3.2, 4.0
>>
>>         Attachments: SOLR-2519.patch
>>
>>
>> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
>> The text fieldType in schema.xml is unusable for non-whitespace
>> languages, because it has the dangerous auto-phrase feature (of
>> Lucene's QP -- see LUCENE-2458) enabled.
>> Lucene leaves this off by default, as does ElasticSearch
>> (http://http://www.elasticsearch.org/).
>> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
>> StandardTokenizer is a better cross-language default.
>> Until we have language specific field types, I think we should fix
>> the "text" fieldType to work well for all languages, by:
>>   * Switching from WhitespaceTokenizer to StandardTokenizer
>>   * Turning off auto-phrase
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035796#comment-13035796
 ] 

Jan Høydahl commented on SOLR-2519:
---

Largely agree with @Hoss' suggestion. But I think it would be wise to emphasize 
that the example schema is just that - an *example* - encouraging people to 
create new fieldTypes instead of editing the example ones. It's not a problem 
for "int", "date" etc, but for text I always encourage our customers and 
students to stay away from the FieldTypes in the example and make their own 
versions instead.

One way to further encourage this best practice is naming all text FieldTypes 
clearly as examples, e.g. 

{code}


{code}

We must realize that a lot of non-american users out there are already 
customizing their schemas with the naming pattern "text_", which means 
you'll find "text_en", "text_it", "text_no" in a lot of installations. 
Therefore it would be un-wise to introduce new FieldTypes wich crashes with 
those names out of the box in version 3.2, thus include _example in the type 
name.

When upgrading, I always leave all the example field types intact, and add my 
custom ones separately, clearly marked by comments for easy copy/paste. I 
believe this to be a fairly common practice, and wanted as well, which would 
give no clashes for the above example.

With this example naming practice, we can be pretty sure that if people talk 
about the fieldType "text_example_en" on the lists, they mean the default 
example type, but if they talk about "text_en", it's something they've 
customized themselves (if so by simply renaming the example). It'll be more 
mental resitance for people to start modifying something with "_example" in it 
wihout also changing the name.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034203#comment-13034203
 ] 

Robert Muir commented on SOLR-2519:
---

As someone frustrated by this (but who would ultimately like to move past it 
and try to help with solr's intl), I just wanted to say +1 to Hoss Man's 
proposal.

My only suggestion on what he said is that I would greatly prefer text_en over 
text_western or whatever for these reasons:
1. the stemming and stopwords and crap here are english.
2. for other western languages, even if you swap these out to be say, french or 
italian (which is the seemingly obvious way to cut over), the whole 
WDF+autophrase is still a huge trap (see 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for an example). in this case use of ElisionFilter can be taken to avoid it.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034185#comment-13034185
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. Bottom line: it's less confusing to remove  and add new ones 
with new names then to make radical changes to existing ones.

Ahh, this makes great sense!

I really like your proposal Hoss, and that's a great point about emails to the 
mailing lists.

So we'd have no more text fieldType.  Just text_en (what text now is) and 
text_general (basically just StandardAnalyzer, but maybe move/absorb "textgen" 
over).

Over time we can add in more language specific text_XX fieldTypes...

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034176#comment-13034176
 ] 

Hoss Man commented on SOLR-2519:


bq. Also: existing users would be unaffected by this? They've already copied 
over / edited their own schema.xml? This is mainly about new users?

The trap we've seen with this type of thing in the past (ie: the numeric 
fields) is that people who tend to use the example configs w/o changing them 
much refer to the example field types by name when talking about them on the 
mailing list, not considering that those names can have differnet meanings 
depending on version.

if we make radical changes to a {{}} but leave the name alone, it 
could confuse a lot of people, ie: "i tried using the 'text' field but it 
didn't work"; "which version of solr are you using?"; "Solr 4.1"; "that should 
work, what exactly does your schema look like"; "..."; "that's the schema from 
3.6"; "yeah, i started with 3.6 nad then upgraded to 4.1 later", etc...

Bottom line: it's less confusing to *remove* {{}} and add new ones 
with new names then to make radical changes to existing ones.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034172#comment-13034172
 ] 

Hoss Man commented on SOLR-2519:


I feel like we are convoluting two issues here: the "default" behavior of 
TextField, and the example configs.

i don't have any strong opinions about changing the default behavior of 
TextField when {{autoGeneratePhraseQueries}} is not specified in the 
{{}} but if we do make such a change, it should be contingent on 
the schema version property (which we should bump) so that people who upgrade 
will get consistent behavior with their existing configs (TextField.init 
already has an example of this for when we changed the default of {{omitNorms}})

as far as the example configs: i agree with yonik, that changing "text" at this 
point might be confusing ... i think the best way to iterate moving forward 
would probably be:

* rename {{}} and {{}} to something 
that makes their purpose more clear (text_en, or text_western, or 
text_european, or some other more general descriptive word for the types of 
languages were it makes sense) and switch all existing {{}} 
declarations that currently use use field type "text" to use this new name.

* add a new {{}} which is designed (and 
documented to be a general purpose field type when the language is unknown (it 
may make sense to fix/repurpose the existing {{}} 
for this, since it already suggests that's what it's for)

* Audit all {{}} declarations that use "text_en" (or whatever name was 
chosen above) and the existing sample data for those fields to see if it makes 
more sense to change them to "text_general". also change any where based on 
usage it shouldn't matter.

The end result being that we have no {{}} named "text" in the 
example configs, so people won't get it confused with previous versions, and 
we'll have a new {{}} that works as well as possible with all 
langauges which we use as much as possible with the example data.






> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034158#comment-13034158
 ] 

Michael McCandless commented on SOLR-2519:
--

It's also spooky that "text" fieldType has different index
time vs query time analyzers?  Ie, WDF is configured differently.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034154#comment-13034154
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. I think maybe there's a misconception that the fieldType named "text" was 
meant to be generic for all languages.

Regardless of what the original intention was, "text" today has become
the generic text fieldType new users use on starting with Solr.  I
mean, it has the perfect name for that :)

bq. As I said in the thread, if I had to do it over again, I would have named 
it "text_en" because that's what it's purpose was.

Hindsight is 20/20... but, we can still fix this today.  We shouldn't
lock ourselves into poor defaults.

Especially, as things improve and we get better analyzers, etc., we
should be free to improve the defaults in schema.xml to take advantage
of these improvements.

bq. But at this point, it seems like the best way forward is to leave "text" as 
an english fieldType and simply add other fieldTypes that can support other 
languages.

I think this is a dangerous approach -- the name (ie, missing _en if
in fact it has such English-specific configuration) is misleading and
traps new users.

Ideally, in the future, we wouldn't even have a "text" fieldType, only
text_XX per-language examples and then maybe something like
text_general, which you use if you cannot find your language.

{quote}
Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):

The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages. WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
{quote}
Ahh good point.  I think we should remove WDF altogether from the
generic "text" fieldType.

{quote}
You left the english stemmer on the "text" fieldType... but if it's supposed to 
be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?
{quote}

+1, we should remove the stemming too from "text".

bq. Taking into account all the existing users (and all the existing 
documentation, examples, tutorial, etc), I favor a more conservative approach 
of adding new fieldTypes rather than radically changing the behavior of 
existing ones.

Can you point to specific examples (docs, examples, tutorial)?  I'd
like to understand how much work it is to fix these...

My feeling is we should simply do the work here (I'll sign up to it)
and fix any places that actually rely on the specifics of "text"
fieldType, eg autophrase.

We shouldn't avoid fixing things well because it's gonna be more work
today, especially if someone (me) is signing up to do it.

Also: existing users would be unaffected by this?  They've already
copied over / edited their own schema.xml?  This is mainly about new
users?


> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034120#comment-13034120
 ] 

Yonik Seeley commented on SOLR-2519:


I think maybe there's a misconception that the fieldType named "text" was meant 
to be generic for all languages.  As I said in the thread, if I had to do it 
over again, I would have named it "text_en" because that's what it's purpose 
was.  But at this point, it seems like the best way forward is to leave "text" 
as an english fieldType and simply add other fieldTypes that can support other 
languages.

Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):
- The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages.  WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
- You left the english stemmer on the "text" fieldType... but if it's supposed 
to be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?

Taking into account all the existing users (and all the existing documentation, 
examples, tutorial, etc), I favor a more conservative approach of adding new 
fieldTypes rather than radically changing the behavior of existing ones.

Random question: what are the implications of changing from WhitespaceTokenizer 
to StandardTokenizer, esp w.r.t. WDF?

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034101#comment-13034101
 ] 

Michael McCandless commented on SOLR-2519:
--

I think the attached patch is a good starting point. It fixes the
generic "text" fieldType to have good all around defaults for all
languages, so that non-whitespace languages work fine.

Then, I think we should iteratively add in custom languages over time
(as separate issues).  We can eg add text_en_autophrase, text_en,
text_zh, etc.  We should at least do first sweep of nice analyzers
module and add fieldTypes for them.

This way we will eventually get to the ideal future when we have
text_XX coverage for many languages.


> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org