[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040298#comment-13040298 ] Michael McCandless commented on SOLR-2519: -- bq. I think we need to stop kidding ourselves about example/default and just recognize that 99.999% of users just use the example as their default configuration. Guys, the example is the default, there is simply not argument, this is the reality! So I think we should present reasonable field type names such as text_en etc. Please don't waste any more of our time trying to convince users that the default is actually an example, its a default. OK I agree. So I'll rename the fields back to text_XX (instead of text_example_XX). bq. 3. The aggressive analysis is totally unnecessary and gives bad results, this is not 1985... Lets drop the porter stemmer and the stopwords list and replace them with less aggressive defaults such as s-stemmer and a commongrams configuration. Sounds great! Can you post the analyzer XML for this? Kinda out of my league at this point :) bq. 4. I do not think the default query parser should be the lucene one, if we have a fancy one (edismax?) that happily handles user input without exceptions... why not just default to the best we have to offer?! +1 Robert maybe you can take the patch and iterate w/ these changes...? > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch, SOLR-2519.patch, SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040129#comment-13040129 ] Robert Muir commented on SOLR-2519: --- A few opinions: 1. First of all, I am +1 to the patch. I think its an improvement overall, however I think it might be worthwhile to discuss the following issues below. 2. I think we need to stop kidding ourselves about example/default and just recognize that 99.999% of users just use the example as their default configuration. Guys, the example is the default, there is simply not argument, this is the reality! So I think we should present reasonable field type names such as text_en etc. Please don't waste any more of our time trying to convince users that the default is actually an example, its a default. 3. The aggressive analysis is totally unnecessary and gives bad results, this is not 1985... Lets drop the porter stemmer and the stopwords list and replace them with less aggressive defaults such as s-stemmer and a commongrams configuration. 4. I do not think the default query parser should be the lucene one, if we have a fancy one (edismax?) that happily handles user input without exceptions... why not just default to the best we have to offer?! > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch, SOLR-2519.patch, SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036104#comment-13036104 ] Michael McCandless commented on SOLR-2519: -- +1 to naming these fields text_example_XXX. That's a great idea Jan. I'll do that in my next patch... > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
+1. I've seen far too many implementations of Solr that blindly use the example configurations and then wonder why the results are surprising (WordDelimiterFilterFactory by itself has confused more people than I can recollect). Although, just to contradict myself, I guess if people don't really look at the configs, they deserver the consequences... And to contra-contradict myself, at least that would give us a clue on the user's list about where to look first! Erick 2011/5/18 Jan Høydahl (JIRA) : > > [ > https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035796#comment-13035796 > ] > > Jan Høydahl commented on SOLR-2519: > --- > > Largely agree with @Hoss' suggestion. But I think it would be wise to > emphasize that the example schema is just that - an *example* - encouraging > people to create new fieldTypes instead of editing the example ones. It's not > a problem for "int", "date" etc, but for text I always encourage our > customers and students to stay away from the FieldTypes in the example and > make their own versions instead. > > One way to further encourage this best practice is naming all text FieldTypes > clearly as examples, e.g. > > {code} > > > {code} > > We must realize that a lot of non-american users out there are already > customizing their schemas with the naming pattern "text_", which means > you'll find "text_en", "text_it", "text_no" in a lot of installations. > Therefore it would be un-wise to introduce new FieldTypes wich crashes with > those names out of the box in version 3.2, thus include _example in the type > name. > > When upgrading, I always leave all the example field types intact, and add my > custom ones separately, clearly marked by comments for easy copy/paste. I > believe this to be a fairly common practice, and wanted as well, which would > give no clashes for the above example. > > With this example naming practice, we can be pretty sure that if people talk > about the fieldType "text_example_en" on the lists, they mean the default > example type, but if they talk about "text_en", it's something they've > customized themselves (if so by simply renaming the example). It'll be more > mental resitance for people to start modifying something with "_example" in > it wihout also changing the name. > >> Improve the defaults for the "text" field type in default schema.xml >> >> >> Key: SOLR-2519 >> URL: https://issues.apache.org/jira/browse/SOLR-2519 >> Project: Solr >> Issue Type: Bug >> Reporter: Michael McCandless >> Assignee: Michael McCandless >> Fix For: 3.2, 4.0 >> >> Attachments: SOLR-2519.patch >> >> >> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 >> The text fieldType in schema.xml is unusable for non-whitespace >> languages, because it has the dangerous auto-phrase feature (of >> Lucene's QP -- see LUCENE-2458) enabled. >> Lucene leaves this off by default, as does ElasticSearch >> (http://http://www.elasticsearch.org/). >> Furthermore, the "text" fieldType uses WhitespaceTokenizer when >> StandardTokenizer is a better cross-language default. >> Until we have language specific field types, I think we should fix >> the "text" fieldType to work well for all languages, by: >> * Switching from WhitespaceTokenizer to StandardTokenizer >> * Turning off auto-phrase > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035796#comment-13035796 ] Jan Høydahl commented on SOLR-2519: --- Largely agree with @Hoss' suggestion. But I think it would be wise to emphasize that the example schema is just that - an *example* - encouraging people to create new fieldTypes instead of editing the example ones. It's not a problem for "int", "date" etc, but for text I always encourage our customers and students to stay away from the FieldTypes in the example and make their own versions instead. One way to further encourage this best practice is naming all text FieldTypes clearly as examples, e.g. {code} {code} We must realize that a lot of non-american users out there are already customizing their schemas with the naming pattern "text_", which means you'll find "text_en", "text_it", "text_no" in a lot of installations. Therefore it would be un-wise to introduce new FieldTypes wich crashes with those names out of the box in version 3.2, thus include _example in the type name. When upgrading, I always leave all the example field types intact, and add my custom ones separately, clearly marked by comments for easy copy/paste. I believe this to be a fairly common practice, and wanted as well, which would give no clashes for the above example. With this example naming practice, we can be pretty sure that if people talk about the fieldType "text_example_en" on the lists, they mean the default example type, but if they talk about "text_en", it's something they've customized themselves (if so by simply renaming the example). It'll be more mental resitance for people to start modifying something with "_example" in it wihout also changing the name. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034203#comment-13034203 ] Robert Muir commented on SOLR-2519: --- As someone frustrated by this (but who would ultimately like to move past it and try to help with solr's intl), I just wanted to say +1 to Hoss Man's proposal. My only suggestion on what he said is that I would greatly prefer text_en over text_western or whatever for these reasons: 1. the stemming and stopwords and crap here are english. 2. for other western languages, even if you swap these out to be say, french or italian (which is the seemingly obvious way to cut over), the whole WDF+autophrase is still a huge trap (see http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance for an example). in this case use of ElisionFilter can be taken to avoid it. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034185#comment-13034185 ] Michael McCandless commented on SOLR-2519: -- bq. Bottom line: it's less confusing to remove and add new ones with new names then to make radical changes to existing ones. Ahh, this makes great sense! I really like your proposal Hoss, and that's a great point about emails to the mailing lists. So we'd have no more text fieldType. Just text_en (what text now is) and text_general (basically just StandardAnalyzer, but maybe move/absorb "textgen" over). Over time we can add in more language specific text_XX fieldTypes... > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034176#comment-13034176 ] Hoss Man commented on SOLR-2519: bq. Also: existing users would be unaffected by this? They've already copied over / edited their own schema.xml? This is mainly about new users? The trap we've seen with this type of thing in the past (ie: the numeric fields) is that people who tend to use the example configs w/o changing them much refer to the example field types by name when talking about them on the mailing list, not considering that those names can have differnet meanings depending on version. if we make radical changes to a {{}} but leave the name alone, it could confuse a lot of people, ie: "i tried using the 'text' field but it didn't work"; "which version of solr are you using?"; "Solr 4.1"; "that should work, what exactly does your schema look like"; "..."; "that's the schema from 3.6"; "yeah, i started with 3.6 nad then upgraded to 4.1 later", etc... Bottom line: it's less confusing to *remove* {{}} and add new ones with new names then to make radical changes to existing ones. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034172#comment-13034172 ] Hoss Man commented on SOLR-2519: I feel like we are convoluting two issues here: the "default" behavior of TextField, and the example configs. i don't have any strong opinions about changing the default behavior of TextField when {{autoGeneratePhraseQueries}} is not specified in the {{}} but if we do make such a change, it should be contingent on the schema version property (which we should bump) so that people who upgrade will get consistent behavior with their existing configs (TextField.init already has an example of this for when we changed the default of {{omitNorms}}) as far as the example configs: i agree with yonik, that changing "text" at this point might be confusing ... i think the best way to iterate moving forward would probably be: * rename {{}} and {{}} to something that makes their purpose more clear (text_en, or text_western, or text_european, or some other more general descriptive word for the types of languages were it makes sense) and switch all existing {{}} declarations that currently use use field type "text" to use this new name. * add a new {{}} which is designed (and documented to be a general purpose field type when the language is unknown (it may make sense to fix/repurpose the existing {{}} for this, since it already suggests that's what it's for) * Audit all {{}} declarations that use "text_en" (or whatever name was chosen above) and the existing sample data for those fields to see if it makes more sense to change them to "text_general". also change any where based on usage it shouldn't matter. The end result being that we have no {{}} named "text" in the example configs, so people won't get it confused with previous versions, and we'll have a new {{}} that works as well as possible with all langauges which we use as much as possible with the example data. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034158#comment-13034158 ] Michael McCandless commented on SOLR-2519: -- It's also spooky that "text" fieldType has different index time vs query time analyzers? Ie, WDF is configured differently. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034154#comment-13034154 ] Michael McCandless commented on SOLR-2519: -- bq. I think maybe there's a misconception that the fieldType named "text" was meant to be generic for all languages. Regardless of what the original intention was, "text" today has become the generic text fieldType new users use on starting with Solr. I mean, it has the perfect name for that :) bq. As I said in the thread, if I had to do it over again, I would have named it "text_en" because that's what it's purpose was. Hindsight is 20/20... but, we can still fix this today. We shouldn't lock ourselves into poor defaults. Especially, as things improve and we get better analyzers, etc., we should be free to improve the defaults in schema.xml to take advantage of these improvements. bq. But at this point, it seems like the best way forward is to leave "text" as an english fieldType and simply add other fieldTypes that can support other languages. I think this is a dangerous approach -- the name (ie, missing _en if in fact it has such English-specific configuration) is misleading and traps new users. Ideally, in the future, we wouldn't even have a "text" fieldType, only text_XX per-language examples and then maybe something like text_general, which you use if you cannot find your language. {quote} Some downsides I see to this patch (i.e. trying to make the 'text' fieldType generic): The current WordDelimiterFilter options the fieldType feel like a trap for non-whitespace-delimited languages. WDF is configured to index catenations as well as splits... so all of the tokens (words?) that are split out are also catenated together and indexed (which seems like it could lead to some truly huge tokens erroneously being indexed.) {quote} Ahh good point. I think we should remove WDF altogether from the generic "text" fieldType. {quote} You left the english stemmer on the "text" fieldType... but if it's supposed to be generic, couldn't this be bad for some other western languages where it could cause stemming collisions of words not related to each other? {quote} +1, we should remove the stemming too from "text". bq. Taking into account all the existing users (and all the existing documentation, examples, tutorial, etc), I favor a more conservative approach of adding new fieldTypes rather than radically changing the behavior of existing ones. Can you point to specific examples (docs, examples, tutorial)? I'd like to understand how much work it is to fix these... My feeling is we should simply do the work here (I'll sign up to it) and fix any places that actually rely on the specifics of "text" fieldType, eg autophrase. We shouldn't avoid fixing things well because it's gonna be more work today, especially if someone (me) is signing up to do it. Also: existing users would be unaffected by this? They've already copied over / edited their own schema.xml? This is mainly about new users? > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034120#comment-13034120 ] Yonik Seeley commented on SOLR-2519: I think maybe there's a misconception that the fieldType named "text" was meant to be generic for all languages. As I said in the thread, if I had to do it over again, I would have named it "text_en" because that's what it's purpose was. But at this point, it seems like the best way forward is to leave "text" as an english fieldType and simply add other fieldTypes that can support other languages. Some downsides I see to this patch (i.e. trying to make the 'text' fieldType generic): - The current WordDelimiterFilter options the fieldType feel like a trap for non-whitespace-delimited languages. WDF is configured to index catenations as well as splits... so all of the tokens (words?) that are split out are also catenated together and indexed (which seems like it could lead to some truly huge tokens erroneously being indexed.) - You left the english stemmer on the "text" fieldType... but if it's supposed to be generic, couldn't this be bad for some other western languages where it could cause stemming collisions of words not related to each other? Taking into account all the existing users (and all the existing documentation, examples, tutorial, etc), I favor a more conservative approach of adding new fieldTypes rather than radically changing the behavior of existing ones. Random question: what are the implications of changing from WhitespaceTokenizer to StandardTokenizer, esp w.r.t. WDF? > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034101#comment-13034101 ] Michael McCandless commented on SOLR-2519: -- I think the attached patch is a good starting point. It fixes the generic "text" fieldType to have good all around defaults for all languages, so that non-whitespace languages work fine. Then, I think we should iteratively add in custom languages over time (as separate issues). We can eg add text_en_autophrase, text_en, text_zh, etc. We should at least do first sweep of nice analyzers module and add fieldTypes for them. This way we will eventually get to the ideal future when we have text_XX coverage for many languages. > Improve the defaults for the "text" field type in default schema.xml > > > Key: SOLR-2519 > URL: https://issues.apache.org/jira/browse/SOLR-2519 > Project: Solr > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.2, 4.0 > > Attachments: SOLR-2519.patch > > > Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 > The text fieldType in schema.xml is unusable for non-whitespace > languages, because it has the dangerous auto-phrase feature (of > Lucene's QP -- see LUCENE-2458) enabled. > Lucene leaves this off by default, as does ElasticSearch > (http://http://www.elasticsearch.org/). > Furthermore, the "text" fieldType uses WhitespaceTokenizer when > StandardTokenizer is a better cross-language default. > Until we have language specific field types, I think we should fix > the "text" fieldType to work well for all languages, by: > * Switching from WhitespaceTokenizer to StandardTokenizer > * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org