[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845214#action_12845214 ] Uwe Schindler commented on SOLR-1677: - I also added support for instantiating Lucene Analyzers directly, that broke with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls back to the no-arg one. The only thing that is not working at the moment is the -Aware stuff, as SolrResourceLoader.newInstance() was not useable. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828835#action_12828835 ] Hoss Man commented on SOLR-1677: bq. I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. ...which is all well and good, but it just re-iterates the need for really good documentation about what is impacted by changing a global Version setting -- otherwise users might be depending on a default behavior that is going to change when Version as bumped, and they may not even realize it. Bear in mind: these are just the nuances that people need to worry about when considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more of these over time. And just to be as crystal clear as i possibly can: * my concern is purely about how to document this stuff. * i do in fact agree that a global luceneVersionMatch option is a good idea Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805167#action_12805167 ] Hoss Man commented on SOLR-1677: bq. And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them. sigh. I read both those issues when you filed them, and I agreed with your assessment that they are bugs we should fix -- if i had thought you were wrong i would have said so in the issue comments. But that doesn't change the fact that sometimes people depend on buggy behavior -- and sometimes those people depend on the buggy behavior without even realizing it. Bug fixes in a stemmer might make it more correct according to the stemmer algorithm specification, or the language semantics, but in some peculuar use cases an application might find the correct implementation less useful then the previous buggy version. This is one reason why things like CHANGES.txt are important: to draw attention to what has changed between two versions of a piece of software, so people can make informed opinions about what they should test in their own applications when they upgrade things under the covers. luceneMatchVersion should be no different. We should try to find a simple way to inform people when you switch from luceneMatchVersion=X to luceneMatchVersion=Y here are the bug fixes you will get so they know what to test to determine if they are adversely affected by that bug fix in some way (and find their own work around) bq. Perhaps you should come up with a better example than stemming, as you don't know what you are talking about. 1) It's true, I frequently don't know what i'm talking about ... this issue was a prime example, and i thank you, Uwe, and Miller for helping me realize that i was completely wrong in my understanding about the intended purpose of o.a.l.Version, and that a global setting for it in Solr makes total sense -- But that doesn't make my concerns about documenting the affects of that global setting any less valid. 2) Perhaps you should read the StopFilter example i already posted in my last comment... {quote} bq. Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion -- what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds {{luceneMatchVersion3.0/luceneMatchVersion}} to his schema.xml to fix it. Without clear documentation n _everything_ that is affected when doing that, he may not realize that StopFilter changed at all -- and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion. {quote} Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805187#action_12805187 ] Robert Muir commented on SOLR-1677: --- bq. 2) Perhaps you should read the StopFilter example i already posted in my last comment... https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932 as far as this one goes, i specifically commented before on this not being 'hidden' by Version (with Solr users in mind) but instead its own option that every user should consider, regardless of defaults. For the stopfilter posInc the user should think it through, its pretty strange, like i mention in my comment, that a definite article like 'the' gets a posInc bump in one language but not another, simply because it happens to be separated by a space. I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. I can't really defend the whole stopfilter posInc thing, as again i think it doesn't make a whole lot of sense, maybe it works good for english I guess, I won't argue about it. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802973#action_12802973 ] Hoss Man commented on SOLR-1677: bq. I think I am slightly offended with some of your statements about 'subjective opinion of the Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer whose behavior changed in a small but significant way'. That was not at all my intention, i'm sorry about that. I was in fact trying to speak entirely in generalities and theoretical examples. The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes -- we're not fixing bugs where 1+1=3. Even if everyone on java-dev, and java-user agrees that behavior A is broken and behavior B is correct, that is still (to me) a subjective opinion -- 1000 mens trash may be one mans treasure, and there could be users out there who have come to expect/rely on that behavior A. I tried to use a stemmer as an example because it's the type of class where making behavior more correct (ie: making the stemming match the semantics of the language more accurately) doesn't necessarily improve the percieved behavior for all users -- someone could be very happy with the sloppy stemming in the 3.1 version of a (hypothetical) EsperantoStemmer because it gives him really loose matches. And if you (or any one else) put in a lot of hard work making that stemmer better my all concievable metrics in 3.4, then i've got no problem telling that person Sorry dude, if you don't want those fixes don't upgrade, or here are some other suggestions for getting 'loose' matching on that field. My concern is that there may be people who don't even realize they are depending on behavior like this. Without an easy way for users to understand what objects have improved/fixed behavior between luceneMatchVersion=X and luceneMatchVersion=Y they won't know the full list of things they should be considering/testing when they do change luceneMatchVersion. bq. I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time. But that's not true: a person downloading for the first time won't have any preconcieved expectaionts of how something will behavior; that's a very different boat from a person upgrading is going to expect things that were working to keep working -- those things may have actaully been bugs in earlier versions, but if they _seemed_ to be working for their use cases, it's going to feel like it's broken when the behavior changes. For a user who is conciously upgrading i'm ok with that. but when there is no easy way of knowing what behavior will change as a result of setting luceneMatchVersion=X that doens't feel fair to the user. Robert mentioned in an earlier comment that StopFilter's position increment behavior changes depending on the luceneMatchVersion -- what if an existing Solr 1.3 user notices a bug in some Tokenizer, and adds {{luceneMatchVersion3.0/luceneMatchVersion}} to his schema.xml to fix it. Without clear documentation n _everything_ that is affected when doing that, he may not realize that StopFilter changed at all -- and even though the position incrememnt behavior may now be more correct, it might drasticly change the results he gets when using dismax with a particular qs or ps value. Hence my point that this becomes a serious documentation concern: finding a way to make it clear to users what they need to consider when modifying luceneMatchVersion. bq. I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components. But if the example configs only show a global setting that isn't directly linked to any of hte individual object configurations, then normal users won't have any idea what could have/use individual luceneMatchVerssion settings anyway (even if they wanted to manage it piecemeal) Like i said: i've come around to the idea of having/advocating a global value. Once i got passed my mistaken thinking of Version as controlling alternate versions (as miller very clearly put it) I started to understand what you are all saying and i agree with you: a single global value is a good idea. My concern is just how to document things so that people don't get confused when they do need to change it. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802979#action_12802979 ] Robert Muir commented on SOLR-1677: --- bq. The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes - we're not fixing bugs where 1+1=3. You are wrong, they are absolutes. And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them. LUCENE-2055: I used the snowball tests against these stemmers which claim to implement 'snowball algorithm', and they fail. This is an absolute, and the fix is to instead use snowball. LUCENE-2203: I used the snowball tests against these stemmers and they failed. Here is Martin Porter's confirmation that these are bugs: http://article.gmane.org/gmane.comp.search.snowball/1139 Perhaps you should come up with a better example than stemming, as you don't know what you are talking about. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802614#action_12802614 ] Mark Miller commented on SOLR-1677: --- If you are thinking of VERSION as alternate versions, I can see your point. But I can't imagine thats what VERSION is for. {quote} everyone else seems to have a very fixed view that these Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients might/could subjective decide i want the old behavior and that older Versions are supported purely for back-compatibility. {quote} I don't think Versions is meant to be used so that users can choose how things operate - personally I do see it as purely a way to get bad behavior for back compatibility. If thats not the case, we should not use Version in Lucene, we should make a Class2. Then you pick which you want. To me, Version is for fixing bugs or things that are clearly not the right way of doing things. Not a choice list. If more than one choice makes sense that should be done without Version. Personally thats all that makes sense to me. Perhaps it will be abused, but personally I'd push back. Version is not a functionality selector - its a way to handle back compat for bugs and clear improvements - stuff we plan and hope to drop into a big black hole forever. Not options that make sense and we plan to keep around for users to mull over. I'm also not that worried that users won't know what changed - they will just know that they are in the same boat as those downloading Lucene latest greatest for the first time. Likely the best boat to be in when it comes to this stuff. If they want to manage things piece mail, I'm still all for allowing Version per component for experts use. But man, I wouldn't want to be in the boat, managing all my components as they mimic various bugs/bad behavior for various components. When I download the latest Solr and do a fresh install, I want it to have all of the latest Lucene bugs fixed (not the case currently). When I have an old install, I want to be able to change one setting and reindex to get all known bugs fixed (currently not the case - heck its not even possible to run Solr currently with all the known Lucene bugs fixed). Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802020#action_12802020 ] Mark Miller commented on SOLR-1677: --- In my opinion this should be real simple. Having to specify a Lucene version for each component is not simple - its beyond most users. I think its beyond me (laugh as you see fit). Having to accept Lucene 2.4 behavior by default because of Solr back compat issues is also weak. A new user should get all the bug fixes of the latest Lucene with minimal effort. Hopefully no effort. Older users should be able to get the newest with minimal effort as well - not having to go one by one through each component and upgrading it. I can't imagine juggling all these versions for each component - thats ugly enough in Lucene - it shouldn't infect Solr for the average case. Personally, I do think there should be a global default. And I think right next to it, it should say, if you change this, you must reindex. No worries about action at a distance. The action is to get the latest and greatest Lucene has to offer rather than older buggy or back compat behavior. Reindex, get latest greatest. Don't reindex and your on your own. Solr might rip your head off. We should also offer per component for real experts, but I wouldn't be meddling that way myself unless in a bind. Solr should be real simple about this - and the latest Solr should use the latest bug fixes from Lucene, with previous configs out there defaulting to 2.4 compatibility. I abbreviated the heck out of my arguments and thinking, but damn it thats what I think :) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800471#action_12800471 ] Hoss Man commented on SOLR-1677: bq. And I also can't see anyone really spending time to aggressively ensure that the example schema etc is all up to date I think you are vastly underestimating how much work is spent reviewing the example schema.xml prior to releases. It would be trivial to search/replace luceneMatchVersion=X with luceneMatchVersion=Y anytime the current version of Version was updated in Lucene-Java bq. the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version in my configuration file, then i get this very old behavior. I don't follow you at all -- you have identified no action, or distance in your example. When i say i'm worried about scary action at a distance, i'm talking about editing some thing A in a config file, and having it result in changed behavior (action) in things B, C and D that do not directly refer to A in any way (distance). Further more these changes in behavior are silent (thus scary). If I have {{fieldType name=A/}} and much later in the config {{field name=B type=A/}} the editing A results in and action on B at a distance -- but this should not suprise me at all because B explicitly refrences A. Having a global {{luceneMatchVersion/}} tag that affects the behavior of a variety of different things when it's modified leads to situations where people might change that value triggering changes in many components w/o a clear idea of what might have changed -- so they don't even know what things they should focus on testing for correctness after makign that change. The existing {{schema version=X/}} property also leads to action at a distance type situations -- but that is a lot less scary to me because at least with it there is a uniform set of changes to *all* schema objects between any two versions, so it's easy to document what cahnges when you go from 1.1 to 1.2, or 1.2 to 1.3 ... but with luceneMatchVersion the potential changes are unique to every individual Class that cares about it. {quote} If this is really your concern, then i have an alternative i propose. * No default anywhere, not even in the code * Version is mandatory if the thing requires it {quote} This is something Uwe and i both discussed in previous comments... https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796872#action_12796872 https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796937#action_12796937 ...as i said: i'm fine with this idea in theory -- as a long term plan -- but there has to be a gradual migration process for people. ie: it can be required on certain objects in a future release, but for at least the next release it needs to be possible to not specify the luceneMatchVersion on all of these objects, and when people use them w/o specifying, they can log big fat warnings on initi that it is defaulting to 2.4, and they should set the property explicitly if that's what they want. bq. I still do not want it in schema.xml, as Version is a global Lucene thing! Uwe: I think you are missunderstanding the reason for a distinction between solrconfig.xml and schema.xml in Solr. If (for hte sake of argument) luceneMatchVersion really should be a global Lucene thing then that is precisely why it should be in schema.xml. schema.xml is for configuration that is inheriently part of the index, and must be consistent regardless of who/how/why that index is being used. solrconfig.xml is where settings are put that are specific to how a a particular instance of an index is being used. If a setting is in solrconfig.xml, then it should to be possible for that setting to be completley different on differnet solr instances that use the exact same schema.xml -- even if they use cloned copies of the same index directory. (ie: master/slave distinctions in replication; peer slaves with distinct handler/cache settings to serve distinct use cases; etc...) That's the reason why nothing that hangs off of IndexSchema is currently allowed to be SolrCoreAware, or get access to the SolrConfig object (and the SolrResourceLoader abstraction was created) ... nothing about the SolrCore instance should be allowed to influence the resulting index, because that index may later be used on a differnet instance with a different config. As i mentioned before: solrconfig.xml can depend on schema.xml, but schema.xml can not depend on solrconfig.xml So if a global luceneMatchVersion can affect the behavior of an analyzer or FieldType in a way that is persisted as part of hte index -- and other classes (like QueryParser in Robert's example) need to make sure to use the same luceneMatchVersion to behave correctly with that index, then that setting needs to be in the schema.xml so it is
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800484#action_12800484 ] Marvin Humphrey commented on SOLR-1677: --- I'd still like to clarify this whole issue of wether Lucene-Java, as a project, has an expectation that client applications will always use a consistent value for Version when constructing objects that interact with an index Yes. The whole point is to avoid Analyzer mismatches. Say a stoplist was modified between Lucene versions. Sure, you can hack it and ask for an old match version, so you get a stoplist other than the one that was used to build the index... but why would you want to? Are there any threads/docs about the expecations of Version homo/hetero-genousness in Lucene-Java? The original thread from last May, I guess... which culminated in LUCENE-1684: http://markmail.org/thread/egqe6rm4c4om7swv It's very long, though. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800514#action_12800514 ] Hoss Man commented on SOLR-1677: {quote} Yes. The whole point is to avoid Analyzer mismatches. Say a stoplist was modified between Lucene versions. Sure, you can hack it and ask for an old match version, so you get a stoplist other than the one that was used to build the index... but why would you want to? {quote} ...but that's no different then using StopFilter(someStopWordSet) at indexing and StopFilter(someOtherStopWordSet) at query time -- Solr happily lets you do that with it's index/query analyzers ... you may have a very good reason for doing that. Likewise you may have an existing field using the default stopwords list from Version.LUCENE_24 that you don't want to change because you want clients that search on that field to continue to get the same behavior, but when you add a new field you want it to have the current default stopwords because it's queried by entirely different clients. That's no differernet then saying i want PorterStemmer on fieldA and SnowBall2Stemmer on fieldB. The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java code that if one object was told to use Version.X it wold be assumed that every other object in the application was using Version.X. To be that's the crux of the whole issue: If that _is_ the expectation Lucene-Java has, then we _should_ have a single global config for luceneMatchVersion and not support per-object configuration. If that _is not_ the expectation, then we _should not_ have a global luceneMatchVersion. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800518#action_12800518 ] Robert Muir commented on SOLR-1677: --- bq. The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java code that if one object was told to use Version.X it wold be assumed that every other object in the application was using Version.X. Hoss, I didn't mean to imply any such thing, just that i don't see any tests (or the framework for testing such behavior), so even if its officially supported, in my opinion it does not exist. For example, as far as analysis goes, my personal opinion is that in any given package (say one language, or whatever), we will test the entire Analyzer against Version X, and will test that back compat works for Version Y, Z, etc. But i personally can't see myself ensuring the all the underlying tokenstreams (maybe this language uses 5 lets say), works across all the permutations of different versions { X, Y, Z } you can apply, its simply asking too much. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798916#action_12798916 ] Hoss Man commented on SOLR-1677: bq. I don't think Version is intended so you can use X.Y on this part and Y.Z on this part and have any chance of anything working, for example it controls position increments on stopfilter but also in queryparser, if you use wacky combinations, things might not work. How is that any different from letting users pass any Analyzer they want to the QueryParser constructor? There's no guarantee that anything will every work if you do something crazy (like uppercase all terms when indexing, and lowercase all terms when searching) But lucene exposes that to the devolper and let's them make the choice -- likewise Solr happily lets you configure a query analyzer that's completely different from your index analyzer -- if that's what you want, that's what you get: being able to set different Version params should be no different. If the QueryParser you are using says that version=X.Y will only work with StopFilter if it's version=X.Y as well that's fine -- but maybe you've solved that problem a completely different way with a comppletley alternate implementation of StopFilter (that doesn't care about version). The user should be in control. bq. sometimes things interact in ways we cannot detect automatically which is why i think it's a bad idea to have a global default for this ... there may be situations where people explicitly want different behavior in different instances (ie: in this field i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly break it. bq. its my understanding that things like this are why Version was created in the first place. My understanding is castly different then yours ... All the discussions i remember about it were along the lines of preventing Class proliferation -- that people didn't' like the idea of creating StandardAnalyzer2 just because StandardAnalyzer had some behavior that was considered buggy but couldn't be removed - so now there is a constructor arg instead, and static constants that let you pick a fixed behavior, or a constant that let's you pick current no matter what it is -- so applications that always want the current recommended behavior can just upgrade a jar and get it. But I don't remember any implication that it was expected that every object would have the same Version settings as every other object -- if that was the intention then shouldn't there be a standard interface for Versionable or VersionAware objects so they can test compatibility with one another (ie: QueryParser and Analyzers that might wrap StopFilter) ? ... or a {{public static void setCurrentOperatingVersion(Version)}} method in the Version class, instead of letting each constructor take in an independent value? FWIW: Even though I'm still convinced that having any sort of global default value for luceneMatchVersion is a bad idea -- and i'm going to keep trying to convince other people as well -- I want to make some comments about how i think it should be implemented if we do wind up doing it (just in case i get hit by a bus) Making the Base*Factory analysis classses SolrCoreAware is really overkill for this -- there was a real conscious choice not to let things declared in schema.xml be SolrCoreAware, because it pulls back the curtain and exposes a lot of plumbing related APIs in way that could make it hard to refactor away SolrCore functionality later. The list of plugin types that can be made SolrCoreAware is deliberately small, and confined to plugins that are already exposed to the full SolrCore API at some other time in their life cycle -- being SolrCoreAware just gives them access to the core during initialization. If there is really going to be one uber-default global luceneMatchVersion then i think the place it makes the most sense to declare something like this is in the schema.xml -- many differnet solrconfig.xml files might be used with the same schema.xml, so if we're expecting that the typical behavior is to set this once and have it just work it should propogate from the IndexSchema object to the SolrCore and not vice-versa. My suggestion for how to implement this would be... # Add a new luceneMatchVersion attribute to the existing schema/ tag. # Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to get the default. # When init()ing new objects, include the key=value pair of {{luceneMatchVersion=schema.getLuceneMatchVersion()}} to the init method of the object if it's not already an init param for that particular instance. This would eliminate the need to make any of the Analysis Factories
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798921#action_12798921 ] Robert Muir commented on SOLR-1677: --- {quote} which is why i think it's a bad idea to have a global default for this ... there may be situations where people explicitly want different behavior in different instances (ie: in this field i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly break it. {quote} but this patch does argue for a global default, which is 2.4, its just hardcoded inside the java code. bq. The user should be in control. You argue against yourself when you say this, but prevent the user from changing this hardcoded 2.4 default. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798930#action_12798930 ] Hoss Man commented on SOLR-1677: bq. You argue against yourself when you say this, but prevent the user from changing this hardcoded 2.4 default. WTF?!?! ... now i feel like you are just messing with my head. I've never argued that the user shouldn't be allowed to change the behavior of any class away from the (hardcoded) 2.4 behavior -- i've tried to be very clear that my objection was only to the new global default setting that would have action at a distance for all of these Version dependent classes w/o aby obvious indication what it was affect. To be as clear as i possibly know how: I am completely in favor of this new syntax added by Uwe's patch... {code:title=src/test/test-files/solr/conf/schema-luceneMatchVersion.xml} fieldtype name=text20 class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory luceneMatchVersion=LUCENE_20/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory luceneMatchVersion=LUCENE_24/ filter class=solr.EnglishPorterFilterFactory/ /analyzer /fieldtype {code} ...and this is the *only* new syntax added by Uwe's patch that i am opposed to... {code:title=src/test/test-files/solr/conf/solrconfig.xml} luceneMatchVersionLUCENE_29/luceneMatchVersion {code} Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798936#action_12798936 ] Robert Muir commented on SOLR-1677: --- bq. WTF?!?! ... now i feel like you are just messing with my head. I am really not trying to, i guess we have just put some recent work that only happens with Version = somethign recent and it would be a shame if it were never used because we made this too difficult, and it simply falls back on 2.4 and works without this parameter so no one bothers. And I also can't see anyone really spending time to aggressively ensure that the example schema etc is all up to date (personally i would try to help, it is difficult though with lucene and solr so out of sync) {quote} I've never argued that the user shouldn't be allowed to change the behavior of any class away from the (hardcoded) 2.4 behavior - i've tried to be very clear that my objection was only to the new global default setting that would have action at a distance for all of these Version dependent classes w/o aby obvious indication what it was affect. {quote} the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version in my configuration file, then i get this very old behavior. If this is really your concern, then i have an alternative i propose. * No default anywhere, not even in the code * Version is mandatory if the thing requires it Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798937#action_12798937 ] Uwe Schindler commented on SOLR-1677: - {quote} My suggestion for how to implement this would be... # Add a new luceneMatchVersion attribute to the existing schema/ tag. # Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to get the default. # When init()ing new objects, include the key=value pair of {{luceneMatchVersion=schema.getLuceneMatchVersion()}} to the init method of the object if it's not already an init param for that particular instance. This would eliminate the need to make any of the Analysis Factories SolrCoreAware (or even ResourceLoaderAware) just to know what the luceneMatchVersion should be -- the Base*Factories could still contain a {{protected Version luceneMatchVersion}} set by the base init() method that subclasses could use as needed. NOTE: This still doesn't doesn't solve the Analyzers must have no-arg constructors part of hte issue -- but it doesn't make it worse. We can make IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg Version constructor fairly easily. If/When we provide a more general mechanism for passing constructor args to Analyzers, any Version params could be defaulted just like with the factory init() methods. {quote} That was my proposal a few comments above. But: I still do not want it in schema.xml, as Version is a global Lucene thing! But the behaviour would be the same: The schema code can get the version from somewhere and pass it down to all schema components as you propose. The Analyzers must have no-arg ctor is easy: Use reflection and look first for a ctor with Version, if exist use and pass ctor init/schema/config arg, if not exisatent use no-arg ctor. We already have this in Lucene's benchmark contrib since 3.0. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796854#action_12796854 ] Hoss Man commented on SOLR-1677: bq. User Carl isn't helpful, user Carl is an idiot. Oh come on now ... that's not really a fair criticism of the example: there are plenty of legitimate ways to use (some) TokenFilters only at search time and I specifically structured my example to point out potential problems in cases just like that -- Carl was very clear that if you used FooTokenFilterFactory in an index analyzer you'll need to reindex. But fine, I'll amend my example to do it your way... {panel} ... Bob Asks his question (see previous example) User Carl is on vacation and never sees Bob's email User Dwight helpfully replies... bq. That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your luceneAnalyzerVersionDefault/ value to 3.1 (or 3.2) you'll get the newer/better behavior - but you _must_ reindex all of your data after you make this change. Bob makes the change to 3.2 that Carl recommended, reindexes all of his data, and is happy to see now his queries work and every thing seems fine. What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in his schema.xml file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. This change is generally considered better behavior then YakTokenizer had before, but in combination with another TokenFilter Bob is using on the yakField it causes behavior that is not what Bob wants. Now some types of queries that use the yakField are failing, and *failing silently*. {panel} You could now argue that User Dwight is an idiot because he didn't warn Bob that other Analyzers/Tokenizers/TokenFilters might be affected. But that just leads us to scenerious that re-iterates my point that this type of global value is something that would be dangerous to ever change {panel} ... Bob Asks his question (see previous examples) User Carl has unsubscribed from the solr-user list (because a Bill Murray look-a-like hurt his feelings) and never sees Bob's email. User Dwight is on vacation and never sees Bob's email. User Ernest helpfully replies... {quote} That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your luceneAnalyzerVersionDefault/ value to 3.1 (or 3.2) you'll get the newer/better behavior -- *But this is Very VERY Dangerous: It could potentially affect the behavior of other analyzers you are using. You need to check the javadocs for each and every Analyzer, Tokenizer, and TokenFilter you use to see what their behavior is with various values of the Version property before you make a change like this. Personally I never change the value of luceneAnalyzerVersionDefault/ once i have an existing schema.xml file. Instead i suggest you add {{luceneVersion=3.2}} to your {{filter class=solr.FooTokenFilterFactory /}} declaration so that you know you are only changing the behavior you want to change. BTW: You _must_ reindex all of your data after doing either of these things in order for it to work. {quote} Bob follow's Ernest's advice, and everything is fine .. but Bob is left wondering what the point is of a config option that's so dangerous to change, and wishes there was an easy way to know which of his Analyzers and Factories are depending on that scary gobal value. {panel} At the end of the day it just seems like a bigger risk then a feature ... I feel like i must still be misunderstanding the motivation you guys have for adding it, because it really seems like it boils down to easier then having the property 2.9 set on every analyzer/factory I guess i ultimately have no stringent objection to a global schema.xml seting like this existing as an expert level feature (for people who want really compact config files i guess), I just don't want to see it used in the example schema.xml file(s) where it's likely to screw novice users over. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796862#action_12796862 ] Robert Muir commented on SOLR-1677: --- bq. Oh come on now ... that's not really a fair criticism of the example: there are plenty of legitimate ways to use (some) TokenFilters only at search time and I specifically structured my example to point out potential problems in cases just like that - Carl was very clear that if you used FooTokenFilterFactory in an index analyzer you'll need to reindex. I disagree, Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that you don't need to reindex by bumping Version simply because you aren't using X or Y or Z, for that he should be renamed Oscar. bq. You could now argue that User Dwight is an idiot because he didn't warn Bob that other Analyzers/Tokenizers/TokenFilters might be affected. But that just leads us to scenerious that re-iterates my point that this type of global value is something that would be dangerous to ever change Yeah, I guess I don't think he is an idiot. I just think he is a moron for suggesting such a thing without warning of the consequences. bq. Personally I never change the value of luceneAnalyzerVersionDefault/ once i have an existing schema.xml file. Instead i suggest you add luceneVersion=3.2 to your filter class=solr.FooTokenFilterFactory / declaration so that you know you are only changing the behavior you want to change. Good for Ernest, i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade ever. Unless Ernest carefully reads Lucene CHANGES also and reads all the Solr source code and knows which solr features are tied to which lucene features, because its not obvious at all: i.e. solr's snowball factory doesn't use lucene's snowball, etc etc. bq. At the end of the day it just seems like a bigger risk then a feature ... I feel like i must still be misunderstanding the motivation you guys have for adding it, because it really seems like it boils down to easier then having the property 2.9 set on every analyzer/factory Yes you are right, personally I don't want all users to be stuck with Version.LUCENE_24 forever. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796872#action_12796872 ] Uwe Schindler commented on SOLR-1677: - In my opinion, the default in solrconfig.xml should be possible to set, because there is currently no requirement to set a version for all TS components. This default is in the shipped solrconfig.xml the version of the shipped lucene version. so new users can use the default config and extend it like learned in all courses and books about solr. They do not need to care about the version. If they upgrade their lucene version, their config keeps stuck on the previous seeting and they are fine. If they want to change some of the components (like query parser, index writer, index reader -- flex!!!), they can do it locally. So Bob could change like Ernest proposed. If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for Lucene 3 is removed, then all users cry. With a default version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 is no longer available as enum constant. If you really do not want to have a default version in config (not schema, because it applies to *all* lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. As there may be tokenstream components not from lucene, make this attribute in the schema only mandatory for lucene-streams (this can be done by my initial patch, too: if the matchVersion property is missing then the matchVersion will get NULL and the factory should thow IAE if required. In my original patch, only the parsing code should be moved out of the factory into a util class in solr. Maybe also possible to parse x.y-style versions). The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs get invalid. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796937#action_12796937 ] Hoss Man commented on SOLR-1677: bq. Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that you don't need to reindex by bumping Version simply because you aren't using X or Y or Z, for that he should be renamed Oscar. Ok, fair enough ... i was supposing in that example that since i called it {{luceneAnalyzerVersionDefault/}} it was clearly specific to analysis objects in schema.xml and didn't affect any of the other things Version is used for (which would be specified in solrconfig.xml) bq. i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade ever. No, he uses an OS where he can upgrade indivudal things individually with clear implications -- he sets {{luceneMatchVersion=2.9}} on each and every {{analyzer/}}, {{tokenizer/}} and {{filter/}} that he declares in his schema so that he knows exactly what behavior is changing when he modifies any of them. bq. personally I don't want all users to be stuck with Version.LUCENE_24 forever. I still must be missing something? ... why would all users be stuck with Version.LUCENE_24 forever? I'm not advocating that we don't allow a way to specify Version, i'm saying that having a global value for it that affects things opaquely sounds dangerous -- we should certianly have a way for people to specify the Version they want on each of the objects that care, but it shouldn't be global. The luceneMatchVersion property that Uwe added to BaseTokenizerFactory and BaseTokenFilterFactory in his patch seems perfect to me, it's just the {{SolrCoreAware}} / {{core.getSolrConfig().luceneMatchVersion}} that i think is a bad idea. If we modify the analyzer/ initialization to allow constructor args as Erik suggested (I'm pretty sure there's already code in Solr to do this, we just aren't using it for Analyzers) then we should be good to go for everything in schema.xml If anything declared in solrconfig.xml starts caring about Version (QParser, SolrIndexWriter, etc...) then likewise it should get a luceneMatchVersion init property as well. No one will ever be stuck with LUCENE_24, but they won't be surprised by behavior changes either. bq. If we do not have a default, all users will keep stuck with lucene 2.4, because they do not care about version (it is not required, because it defaults to 2.4 for BW compatibility). So lots of configs will never use the new unicode features of Lucene 3.1. I don't believe that. Almost every solr user on the planet starts with the example configs. if the example configs start specifying luceneMatchVersion=2.9 on every analyzer and factory then people will care about Version just as much as they care about the stopwords.txt file that ships with solr -- that may be not at all, or it may be a lot, but it will be up to them, and it will be obvious to them, because it's right there in the declaration where they can see it, and easy for them to refrence and recognize that changing that value will affect things. bq. If you really do not want to have a default version in config (not schema, because it applies to all lucene components), then you should go the way like Lucene 3.0: Require a matchVersion for all components. I'm totally on board with that idea in the long run -- but there are ways to get there gradually that are back compatible with existing configs. Individual factories that care about luceneMatchVersion should absolutely start warning on startup that users should set luceneMatchVersion to get newer/better behavior may be available if it is unset (or doesn't match the current value of Version.LUCENE_CURRENT) and provide a URL for a wiki page somewhere where more detail is available. The Analyzer init code can do likewise if if sees an {{analyzer class=.../}} being inited w/ a constructor that takes in a Version which is using an old value. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796965#action_12796965 ] Robert Muir commented on SOLR-1677: --- {quote} No, he uses an OS where he can upgrade indivudal things individually with clear implications - he sets luceneMatchVersion=2.9 on each and every analyzer/, tokenizer/ and filter/ that he declares in his schema so that he knows exactly what behavior is changing when he modifies any of them. {quote} Yeah, but this isnt how Version works in lucene either, please see below {quote} I'm not advocating that we don't allow a way to specify Version, i'm saying that having a global value for it that affects things opaquely sounds dangerous - we should certianly have a way for people to specify the Version they want on each of the objects that care, but it shouldn't be global. The luceneMatchVersion property that Uwe added to BaseTokenizerFactory and BaseTokenFilterFactory in his patch seems perfect to me, it's just the SolrCoreAware / core.getSolrConfig().luceneMatchVersion that i think is a bad idea. {quote} And I disagree, I think that the per-tokenfilter matchVersion should be the expert use, with the default global Version being the standard use. I don't think Version is intended so you can use X.Y on this part and Y.Z on this part and have any chance of anything working, for example it controls position increments on stopfilter but also in queryparser, if you use wacky combinations, things might not work. And I personally don't see anyone putting effort into supporting this either, because its enough to supply the back compat for previous versions, but not some cross product of all possible versions. this is too much. sometimes things interact in ways we cannot detect automatically (such as the query parser phrasequery / stopfilter thing), its my understanding that things like this are why Version was created in the first place. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796136#action_12796136 ] Robert Muir commented on SOLR-1677: --- {quote} User Carl helpfully replies... That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your luceneAnalyzerVersionDefault/ value to 3.1 (or 3.2) you'll get the newer/better behavior - but if you used FooTokenFilterFactory in an index analyzer you'll need to reindex. {quote} User Carl isn't helpful, user Carl is an idiot. The javadoc of Version in lucene clearly says: {noformat} * pbWARNING/b: When changing the version parameter * that you supply to components in Lucene, do not simply * change the version at search-time, but instead also adjust * your indexing code to match, and re-index. {noformat} User Carl could also tell Bob that its ok to index with ArabicAnalyzer and query with ChineseAnalyzer, this kind of stupid theoretical situation isn't any kind of valid logical argument against having a default value for this. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796087#action_12796087 ] Hoss Man commented on SOLR-1677: bq. The problem is the default value. If you leave out the version parameter instance-wise, you will get 2.4. And because of that all solr users will get stuck with that version and will never upgrade (because they leave the default and do not specify a different value). That feels like a missleading statement ... the Version property on these objects is really more about getting the recommended behavior as of a particular version of Lucene ... saying that users will be stuck with that version is like saying users will be stuck with StandardAnalyzer instead of getting NewHotnessAnalyzer because they have to edit their config to use the newer/better analyzer -- Lucene-Java has opted to use a Version property on existing classes instead of adding new classes, but it's still conceptually the same thing: they get the bahavior they've always gotten, unless they change their config to get something different. Besides which: 99.9% of Solr users copy the example config when they first start using Solr: we can set a version property on every Analyzer/Factory used in the example schema.xml and update them all when we upgrade the Lucene jars just as easily as we can update a single global value (it's a search+replaceAll instead of a search+replace) bq. Why are you so against a default value? My concern is that it introduces action at a distance -- and not in a good way. Here's the scenerio that seems garunteed to happen quite a bit if we add some new {{luceneAnalyzerVersionDefault/}} syntax to schema.xml... {panel} {{luceneAnalyzerVersionDefault2.9/luceneAnalyzerVersionDefault}} is added to the example schema.xml, and users start using it as a result of copying/modifying the example configs. Time passes, new bugs are fixed, and the example configs evolve to contain {{luceneAnalyzerVersionDefault3.4/luceneAnalyzerVersionDefault}} A little while after that, User Bob emails solr-user with a question like... {quote} Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see behaviorX when it really seems like i should see BehaviorY {quote} User Carl helpfully replies... {quote} That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default behavior was left as is for backcompatibility. If you change your {{luceneAnalyzerVersionDefault/}} value to 3.1 (or 3.2) you'll get the newer/better behavior -- but if you used FooTokenFilterFactory in an _index_ analyzer you'll need to reindex. {quote} Bob makes the change to 3.2 that Carl recommended, and is happy to see now his queries work. He only uses FooTokenFilterFactory at _query_ time, so he doens't bother to reindex, and every thing seems fine. What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in hi's schema.xml file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. Now _some_ documents/queries that use yakField are failing -- and *failing silently.* {panel} Things just get a lot simpler when all of the configuration for an Analyzer, TokenizerFactory, or Tokenizer are all explict in their declaration -- indirect initialization is fine, as long as it's obvious. Ie: field/ declarations referencing fieldTypes by name -- It's easy to fuck up a bunch of fields by making a single change to one fieldType, but at least you can grep for the name of the fieldType to see all the fields you are affecting. Even if Carl knows/remembers to warn Bob that changing {{luceneAnalyzerVersionDefault/}} might change/break other things in his schema.xml the situation doesn't get much better: Uless Bob (or Carl) skim the code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema, they can't be sure what might get affected by making a small increase to the global luceneAnalyzerVersion setting ... which means the only safe thing for Bob to do is to set the property individual on the one place he really wants to make the change. So why have the global in the first place? It really just seems like more trouble then it's worth. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795746#action_12795746 ] Uwe Schindler commented on SOLR-1677: - The problem is the default value. If you leave out the version parameter instance-wise, you will get 2.4. And because of that all solr users will get stuck with that version and will never upgrade (because they leave the default and do not specify a different value). Because of backwards compatibility, we are limited to this version number as default value. The schema/config global version is the global default used by all instances, that do not specify a different value. By that we can ship the default solconfig/schema.xml with the latest possible lucene version, but users upgrading will keep their default value. I repeat: with instance-wise config, nobody will ever use it for new analyzers. With a global default, there is only *one* place that sets the version, which is also valid for user-added tokenizer chains. For the SolrCore problem: For analyzers the idea its, that the default Version constant is automatically passed to all tokenizers in the param map automatically. Local values overwrite the key in the map. But this would only apply the analyzers. Other usages of Version at other places (QP, IW) still need SolrCore. But we can move the SolrCoreAware to the schema classes and not make every TokenFilter/Tokenizer SolrCoreAware. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795760#action_12795760 ] Robert Muir commented on SOLR-1677: --- bq. But as i said: i don't see any compelling need for a schema global Version anyway (let alone an instance wide global that applies to both solrconfig.xml and schema.xml) just like Uwe says this is the problem with having no default If the default Version is going to be 2.4, I would like a global setting so that I get bugfixes and improvements, because a few things have happened to this code since 2.4. I also do not want to list it 10,000 times, but its not enough to make the default Version the latest to fix this problem. I want my config to be wired to '2.9' or whatever, so that when upgrading, everything continues to work. Why are you so against a default value? Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795728#action_12795728 ] Hoss Man commented on SOLR-1677: bq. Is that true? Many times so far, but Version is not limited to such things. It can be used for far more than how to read/write the index properly. Perhaps, but that would be a very different usage ... even if Lucene-Java uses the same o.a.l.util.Version class for driving Analyzers/Tokenizers/TokenFilters and IndexWriters/MergeScheduler/QueryParser ... but those are very different things in Solr land ... in a replication setup, two different instances might use very different Version values for the IndexWriter/MergeScheduler/QueryParser (configured in solrconfig.xml) but they should have identical schema.xml files and identical (versioned) analyzer setttings. But as i said: i don't see any compelling need for a schema global Version anyway (let alone an instance wide global that applies to both solrconfig.xml and schema.xml) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794440#action_12794440 ] Erik Hatcher commented on SOLR-1677: Another comment on this... Solr supports using an Analyzer also, but only ones with zero-arg constructors. It would be nice if this Version support also allowed for Analyzers (say SmartChineseAnalyzer) to be used also directly. I don't think this patch accounts for this case, does it? Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794447#action_12794447 ] Uwe Schindler commented on SOLR-1677: - Thanks for the hint. This means it can instantiate an analyzer via reflection and uses the zero-arg ctor, which is no longer available. So with Lucene 3.0 it will no longer work at all. As I have not so much experience with hacking Solr, I did not recognize this. In my own project I have the same mechanism, for that i did a reflection-analysis of the loaded class and use the ctor with Version, if not avail an empty ctor. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793599#action_12793599 ] Mark Miller commented on SOLR-1677: --- bq. it should be in schema.xml, as it pertains to the index itself and how to read/write to the index properly and not to the paticularities of how a particular solr installation might be using that data Is that true? Many times so far, but Version is not limited to such things. It can be used for far more than how to read/write the index properly. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793228#action_12793228 ] Hoss Man commented on SOLR-1677: {quote} * As a first hack the solrConfig schema has a new element luceneMatchVersion that contains a solr-wide default luceneMatchVersion value that is used as default for QueryParser, Analyzers if not specified different * On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and get the SolrConfig. {quote} I'd really prefer that nothing like this make it into solr. One: we've worked pretty hard to make sure that nothing in the analysis code is SolrCoreAware -- the goal was to try and keep the schema related code reusable w/o risk of factories adding tendrals that reach deep into the other solr code (it's onbly a matter of time until someone starts refactoring all of the schema related code out of Solr and into a Lucene contrib. If we really want to add a new global setting for the default match version, it should be in schema.xml, as it pertains to the index itself and how to read/write to the index properly and not to the paticularities of how a particular solr installation might be using that data (schema.xml = the nature of the data; solrconfig.xml = the usage of the data) Two: I really question the need for a configurable default across all analysis factories. This seems like the type of thing that's going to be changed rarely if ever, and when it is changed each field will need to be considered very carefully to decide wether the new behavior is desired over hte old I suspect the only time anyone is going to upgrade all factories at once is when we rev lucene jars and update the example configs -- in that case (and in the case of a user who is happy to blow away all of their data and take the newest, regardless of what it is, for every analyzer) a search and replace seem perfectly appropriate. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793022#action_12793022 ] Robert Muir commented on SOLR-1677: --- Hello Uwe, I would like to be able to specify the default, at some global level, for all tokenstreams. for example, if i was setting up a new solr configuration, i would want to say 'give me 3.1 support for all tokenstreams by default' ? Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793047#action_12793047 ] Uwe Schindler commented on SOLR-1677: - bq. for example, if i was setting up a new solr configuration, i would want to say 'give me 3.1 support for all tokenstreams by default' ? I have no idea how to define global properties in schema.xml that apply for all factories. If this is possible the LUCENE_24 else clause and the default value can be changed to the global default (which itsself defaults to Version.LUCENE_24). In this case the parser map (for Lucene 2.9/Java 1.4) on the version enum should also move to a more central page. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.