[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-03-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845214#action_12845214
 ] 

Uwe Schindler commented on SOLR-1677:
-

I also added support for instantiating Lucene Analyzers directly, that broke 
with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls 
back to the no-arg one. The only thing that is not working at the moment is the 
-Aware stuff, as SolrResourceLoader.newInstance() was not useable.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch, 
 SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-02-02 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828835#action_12828835
 ] 

Hoss Man commented on SOLR-1677:


bq. I guess I could care less what the default is, if you care about such 
things you shouldn't be using the defaults and instead specifying this yourself 
in the schema, and Version has no effect.

...which is all well and good, but it just re-iterates the need for really good 
documentation about what is impacted by changing a global Version setting -- 
otherwise users might be depending on a default behavior that is going to 
change when Version as bumped, and they may not even realize it.

Bear in mind: these are just the nuances that people need to worry about when 
considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more 
of these over time.

And just to be as crystal clear as i possibly can:
* my concern is purely about how to document this stuff.
* i do in fact agree that a global luceneVersionMatch option is a good idea

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-26 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805167#action_12805167
 ] 

Hoss Man commented on SOLR-1677:


bq. And here are the JIRA issues for stemming bugs, since you didnt take my 
hint to go and actually read them.

sigh.  I read both those issues when you filed them, and I agreed with your 
assessment that they are bugs we should fix -- if i had thought you were wrong 
i would have said so in the issue comments.

But that doesn't change the fact that sometimes people depend on buggy behavior 
-- and sometimes those people depend on the buggy behavior without even 
realizing it.  Bug fixes in a stemmer might make it more correct according to 
the stemmer algorithm specification, or the language semantics, but in some 
peculuar use cases an application might find the correct implementation less 
useful then the previous buggy version.

This is one reason why things like CHANGES.txt are important: to draw attention 
to what has changed between two versions of a piece of software, so people can 
make informed opinions about what they should test in their own applications 
when they upgrade things under the covers.  luceneMatchVersion should be no 
different.  We should try to find a simple way to inform people when you 
switch from luceneMatchVersion=X to luceneMatchVersion=Y here are the bug fixes 
you will get so they know what to test to determine if they are adversely 
affected by that bug fix in some way (and find their own work around)

bq. Perhaps you should come up with a better example than stemming, as you 
don't know what you are talking about.

1) It's true, I frequently don't know what i'm talking about ... this issue was 
a prime example, and i thank you, Uwe, and Miller for helping me realize that i 
was completely wrong in my understanding about the intended purpose of 
o.a.l.Version, and that a global setting for it in Solr makes total sense -- 
But that doesn't make my concerns about documenting the affects of that global 
setting any less valid.

2) Perhaps you should read the StopFilter example i already posted in my last 
comment...

{quote}
bq. Robert mentioned in an earlier comment that StopFilter's position increment 
behavior changes depending on the luceneMatchVersion -- what if an existing 
Solr 1.3 user notices a bug in some Tokenizer, and adds 
{{luceneMatchVersion3.0/luceneMatchVersion}} to his schema.xml to fix it.  
Without clear documentation n _everything_ that is affected when doing that, he 
may not realize that StopFilter changed at all -- and even though the position 
incrememnt behavior may now be more correct, it might drasticly change the 
results he gets when using dismax with a particular qs or ps value.  Hence my 
point that this becomes a serious documentation concern: finding a way to make 
it clear to users what they need to consider when modifying luceneMatchVersion.
{quote}

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805187#action_12805187
 ] 

Robert Muir commented on SOLR-1677:
---

bq. 2) Perhaps you should read the StopFilter example i already posted in my 
last comment...

https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932

as far as this one goes, i specifically commented before on this not being 
'hidden' by Version (with Solr users in mind) but instead its own option that 
every user should consider, regardless of defaults.

For the stopfilter posInc the user should think it through, its pretty strange, 
like i mention in my comment, that a definite article like 'the' gets a posInc 
bump in one language but not another, simply because it happens to be separated 
by a space.

I guess I could care less what the default is, if you care about such things 
you shouldn't be using the defaults and instead specifying this yourself in the 
schema, and Version has no effect. I can't really defend the whole stopfilter 
posInc thing, as again i think it doesn't make a whole lot of sense, maybe it 
works good for english I guess, I won't argue about it.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-20 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802973#action_12802973
 ] 

Hoss Man commented on SOLR-1677:



bq. I think I am slightly offended with some of your statements about 
'subjective opinion of the Lucene Community' and 'they should do relevancy 
testing which use some language-specific stemmer whose behavior changed in a 
small but significant way'.

That was not at all my intention, i'm sorry about that.  I was in fact trying 
to speak entirely in generalities and theoretical examples.

The point I was trying to make is that the types of bug fixes we make in Lucene 
are no mathematical absolutes -- we're not fixing bugs where 1+1=3.  Even if 
everyone on java-dev, and java-user agrees that behavior A is broken and 
behavior B is correct, that is still (to me) a subjective opinion -- 1000 mens 
trash may be one mans treasure, and there could be users out there who have 
come to expect/rely on that behavior A.

I tried to use a stemmer as an example because it's the type of class where 
making behavior more correct (ie: making the stemming match the semantics of 
the language more accurately) doesn't necessarily improve the percieved 
behavior for all users -- someone could be very happy with the sloppy 
stemming in the 3.1 version of a (hypothetical) EsperantoStemmer because it 
gives him really loose matches.  And if you (or any one else) put in a lot of 
hard work making that stemmer better my all concievable metrics in 3.4, then 
i've got no problem telling that person Sorry dude, if you don't want those 
fixes don't upgrade, or here are some other suggestions for getting 'loose' 
matching on that field.

My concern is that there may be people who don't even realize they are 
depending on behavior like this.  Without an easy way for users to understand 
what objects have improved/fixed behavior between luceneMatchVersion=X and 
luceneMatchVersion=Y they won't know the full list of things they should be 
considering/testing when they do change luceneMatchVersion.

bq. I'm also not that worried that users won't know what changed - they will 
just know that they are in the same boat as those downloading Lucene latest 
greatest for the first time.

But that's not true:  a person downloading for the first time won't have any 
preconcieved expectaionts of how something will behavior; that's a very 
different boat from a person upgrading is going to expect things that were 
working to keep working -- those things may have actaully been bugs in earlier 
versions, but if they _seemed_ to be working for their use cases, it's going to 
feel like it's broken when the behavior changes.  For a user who is conciously 
upgrading i'm ok with that.  but when there is no easy way of knowing what 
behavior will change as a result of setting luceneMatchVersion=X that doens't 
feel fair to the user.

Robert mentioned in an earlier comment that StopFilter's position increment 
behavior changes depending on the luceneMatchVersion -- what if an existing 
Solr 1.3 user notices a bug in some Tokenizer, and adds 
{{luceneMatchVersion3.0/luceneMatchVersion}} to his schema.xml to fix it.  
Without clear documentation n _everything_ that is affected when doing that, he 
may not realize that StopFilter changed at all -- and even though the position 
incrememnt behavior may now be more correct, it might drasticly change the 
results he gets when using dismax with a particular qs or ps value.  Hence my 
point that this becomes a serious documentation concern: finding a way to make 
it clear to users what they need to consider when modifying luceneMatchVersion.

bq. I'm still all for allowing Version per component for experts use. But man, 
I wouldn't want to be in the boat, managing all my components as they mimic 
various bugs/bad behavior for various components.

But if the example configs only show a global setting that isn't directly 
linked to any of hte individual object configurations, then normal users 
won't have any idea what could have/use individual luceneMatchVerssion settings 
anyway (even if they wanted to manage it piecemeal)

Like i said: i've come around to the idea of having/advocating a global value.  
Once i got passed my mistaken thinking of Version as controlling alternate 
versions (as miller very clearly put it) I started to understand what you are 
all saying and i agree with you: a single global value is a good idea.

My concern is just how to document things so that people don't get confused 
when they do need to change it.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802979#action_12802979
 ] 

Robert Muir commented on SOLR-1677:
---

bq. The point I was trying to make is that the types of bug fixes we make in 
Lucene are no mathematical absolutes - we're not fixing bugs where 1+1=3.

You are wrong, they are absolutes.
And here are the JIRA issues for stemming bugs, since you didnt take my hint to 
go and actually read them.

LUCENE-2055: I used the snowball tests against these stemmers which claim to 
implement 'snowball algorithm', and they fail. This is an absolute, and the fix 
is to instead use snowball.
LUCENE-2203: I used the snowball tests against these stemmers and they failed. 
Here is Martin Porter's confirmation that these are bugs: 
http://article.gmane.org/gmane.comp.search.snowball/1139

Perhaps you should come up with a better example than stemming, as you don't 
know what you are talking about.  

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802614#action_12802614
 ] 

Mark Miller commented on SOLR-1677:
---

If you are thinking of VERSION as alternate versions, I can see your point.

But I can't imagine thats what VERSION is for.

{quote} everyone else seems to have a very fixed view that these Version based 
changes are genuine improvements/bug-fixes, w/o any expectation that clients 
might/could subjective decide i want the old behavior and that older 
Versions are supported purely for back-compatibility. {quote}

I don't think Versions is meant to be used so that users can choose how things 
operate - personally I do see it as purely a way to get bad behavior for back 
compatibility. If thats not the case, we should not use Version in Lucene, we 
should make a Class2. Then you pick which you want. To me, Version is for 
fixing bugs or things that are clearly not the right way of doing things. Not a 
choice list. If more than one choice makes sense that should be done without 
Version. Personally thats all that makes sense to me. Perhaps it will be 
abused, but personally I'd push back. Version is not a functionality selector - 
its a way to handle back compat for bugs and clear improvements - stuff we plan 
and hope to drop into a big black hole forever. Not options that make sense 
and we plan to keep around for users to mull over.

I'm also not that worried that users won't know what changed - they will just 
know that they are in the same boat as those downloading Lucene latest greatest 
for the first time. Likely the best boat to be in when it comes to this stuff. 
If they want to manage things piece mail, I'm still all for allowing Version 
per component for experts use. But man, I wouldn't want to be in the boat, 
managing all my components as they mimic various bugs/bad behavior for various 
components.

When I download the latest Solr and do a fresh install, I want it to have all 
of the latest Lucene bugs fixed (not the case currently). When I have an old 
install, I want to be able to change one setting and reindex to get all known 
bugs fixed (currently not the case - heck its not even possible to run Solr 
currently with all the known Lucene bugs fixed).

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802020#action_12802020
 ] 

Mark Miller commented on SOLR-1677:
---

In my opinion this should be real simple. Having to specify a Lucene version 
for each component is not simple - its beyond most users. I think its beyond me 
(laugh as you see fit). Having to accept Lucene 2.4 behavior by default because 
of Solr back compat issues is also weak. A new user should get all the bug 
fixes of the latest Lucene with minimal effort. Hopefully no effort. Older 
users should be able to get the newest with minimal effort as well - not having 
to go one by one through each component and upgrading it. I can't imagine 
juggling all these versions for each component - thats ugly enough in Lucene - 
it shouldn't infect Solr for the average case.

Personally, I do think there should be a global default. And I think right next 
to it, it should say, if you change this, you must reindex. No worries about 
action at a distance. The action is to get the latest and greatest Lucene has 
to offer rather than older buggy or back compat behavior. Reindex, get latest 
greatest. Don't reindex and your on your own. Solr might rip your head off.

We should also offer per component for real experts, but I wouldn't be meddling 
that way myself unless in a bind. Solr should be real simple about this - and 
the latest Solr should use the latest bug fixes from Lucene, with previous 
configs out there defaulting to 2.4 compatibility.

I abbreviated the heck out of my arguments and thinking, but damn it thats what 
I think :)

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-14 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800471#action_12800471
 ] 

Hoss Man commented on SOLR-1677:


bq. And I also can't see anyone really spending time to aggressively ensure 
that the example schema etc is all up to date

I think you are vastly underestimating how much work is spent reviewing the 
example schema.xml prior to releases.  It would be trivial to search/replace 
luceneMatchVersion=X with luceneMatchVersion=Y anytime the current 
version of Version was updated in Lucene-Java

bq. the hardcoded 2.4 behavior is the action at a distance, because if i do not 
specify Version in my configuration file, then i get this very old behavior.

I don't follow you at all -- you have identified no action, or distance in your 
example.

When i say i'm worried about scary action at a distance, i'm talking about 
editing some thing A in a config file, and having it result in changed behavior 
(action) in things B, C and D that do not directly refer to A in any way 
(distance).  Further more these changes in behavior are silent (thus scary).

If I have {{fieldType name=A/}} and much later in the config {{field 
name=B type=A/}} the editing A results in and action on B at a distance -- 
but this should not suprise me at all because B explicitly refrences A.

Having a global {{luceneMatchVersion/}} tag that affects the behavior of a 
variety of different things when it's modified leads to situations where people 
might change that value triggering changes in many components w/o a clear idea 
of what might have changed -- so they don't even know what things they should 
focus on testing for correctness after makign that change.

The existing {{schema version=X/}} property also leads to action at a 
distance type situations -- but that is a lot less scary to me because at least 
with it there is a uniform set of changes to *all* schema objects between any 
two versions, so it's easy to document what cahnges when you go from 1.1 to 
1.2, or 1.2 to 1.3 ... but with luceneMatchVersion the potential changes are 
unique to every individual Class that cares about it.

{quote}
If this is really your concern, then i have an alternative i propose.

* No default anywhere, not even in the code
* Version is mandatory if the thing requires it
{quote}

This is something Uwe and i both discussed in previous comments...

https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796872#action_12796872
https://issues.apache.org/jira/browse/SOLR-1677?focusedCommentId=12796937#action_12796937

...as i said: i'm fine with this idea in theory -- as a long term plan -- but 
there has to be a gradual migration process for people. ie: it can be required 
on certain objects in a future release, but for at least the next release it 
needs to be possible to not specify the luceneMatchVersion on all of these 
objects, and when people use them w/o specifying, they can log big fat warnings 
on initi that it is defaulting to 2.4, and they should set the property 
explicitly if that's what they want.



bq. I still do not want it in schema.xml, as Version is a global Lucene thing!

Uwe: I think you are missunderstanding the reason for a distinction between 
solrconfig.xml and schema.xml in Solr.  If (for hte sake of argument) 
luceneMatchVersion really should be a global Lucene thing then that is 
precisely why it should be in schema.xml.

schema.xml is for configuration that is inheriently part of the index, and must 
be consistent regardless of who/how/why that index is being used.  
solrconfig.xml is where settings are put that are specific to how a a 
particular instance of an index is being used.   If a setting is in 
solrconfig.xml, then it should to be possible for that setting to be completley 
different on differnet solr instances that use the exact same schema.xml -- 
even if they use cloned copies of the same index directory. (ie: master/slave 
distinctions in replication; peer slaves with distinct handler/cache settings 
to serve distinct use cases; etc...)

That's the reason why nothing that hangs off of IndexSchema is currently 
allowed to be SolrCoreAware, or get access to the SolrConfig object (and the 
SolrResourceLoader abstraction was created) ... nothing about the SolrCore 
instance should be allowed to influence the resulting index, because that 
index may later be used on a differnet instance with a different config.

As i mentioned before: solrconfig.xml can depend on schema.xml, but schema.xml 
can not depend on solrconfig.xml

So if a global luceneMatchVersion can affect the behavior of an analyzer or 
FieldType in a way that is persisted as part of hte index -- and other 
classes (like QueryParser in Robert's example) need to make sure to use the 
same luceneMatchVersion to behave correctly with that index, then that setting 
needs to be in the schema.xml so it is 

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-14 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800484#action_12800484
 ] 

Marvin Humphrey commented on SOLR-1677:
---

 I'd still like to clarify this whole issue of wether Lucene-Java, as a 
 project,
 has an expectation that client applications will always use a consistent value
 for Version when constructing objects that interact with an index

Yes. The whole point is to avoid Analyzer mismatches.

Say a stoplist was modified between Lucene versions. Sure, you can hack it
and ask for an old match version, so you get a stoplist other than the one that
was used to build the index... but why would you want to?

 Are there any threads/docs about the expecations of Version
 homo/hetero-genousness in Lucene-Java?

The original thread from last May, I guess... which culminated in LUCENE-1684:

http://markmail.org/thread/egqe6rm4c4om7swv

It's very long, though.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-14 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800514#action_12800514
 ] 

Hoss Man commented on SOLR-1677:


{quote}
Yes. The whole point is to avoid Analyzer mismatches.

Say a stoplist was modified between Lucene versions. Sure, you can hack it
and ask for an old match version, so you get a stoplist other than the one that
was used to build the index... but why would you want to?
{quote}

...but that's no different then using StopFilter(someStopWordSet) at indexing 
and StopFilter(someOtherStopWordSet) at query time -- Solr happily lets you do 
that with it's index/query analyzers ... you may have a very good reason for 
doing that.  Likewise you may have an existing field using the default 
stopwords list from Version.LUCENE_24 that you don't want to change because you 
want clients that search on that field to continue to get the same behavior, 
but when you add a new field you want it to have the current default stopwords 
because it's queried by entirely different clients.

That's no differernet then saying i want PorterStemmer on fieldA and 
SnowBall2Stemmer on fieldB.

The implication i got from Robert was that there was (or would soon be) 
expectations in Lucene-Java code that if one object was told to use Version.X 
it wold be assumed that every other object in the application was using 
Version.X.

To be that's the crux of the whole issue:  If that _is_ the expectation 
Lucene-Java has, then we _should_ have a single global config for 
luceneMatchVersion and not support per-object configuration.  If that _is not_ 
the expectation, then we _should not_ have a global luceneMatchVersion.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800518#action_12800518
 ] 

Robert Muir commented on SOLR-1677:
---

bq. The implication i got from Robert was that there was (or would soon be) 
expectations in Lucene-Java code that if one object was told to use Version.X 
it wold be assumed that every other object in the application was using 
Version.X.

Hoss, I didn't mean to imply any such thing, just that i don't see any tests 
(or the framework for testing such behavior), so even if its officially 
supported, in my opinion it does not exist.

For example, as far as analysis goes, my personal opinion is that in any given 
package (say one language, or whatever), we will test the entire Analyzer 
against Version X, and will test that back compat works for Version Y, Z, etc. 

But i personally can't see myself ensuring the all the underlying tokenstreams 
(maybe this language uses 5 lets say), works across all the permutations of 
different versions { X, Y, Z } you can apply, its simply asking too much.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-11 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798916#action_12798916
 ] 

Hoss Man commented on SOLR-1677:


bq. I don't think Version is intended so you can use X.Y on this part and Y.Z 
on this part and have any chance of anything working, for example it controls 
position increments on stopfilter but also in queryparser, if you use wacky 
combinations, things might not work.

How is that any different from letting users pass any Analyzer they want to the 
QueryParser constructor?  There's no guarantee that anything will every work if 
you do something crazy (like uppercase all terms when indexing, and lowercase 
all terms when searching) But lucene exposes that to the devolper and let's 
them make the choice -- likewise Solr happily lets you configure a query 
analyzer that's completely different from your index analyzer -- if that's what 
you want, that's what you get: being able to set different Version params 
should be no different.  If the QueryParser you are using says that version=X.Y 
will only work with StopFilter if it's version=X.Y as well that's fine -- but 
maybe you've solved that problem a completely different way with a comppletley 
alternate implementation of StopFilter (that doesn't care about version).  The 
user should be in control.

bq. sometimes things interact in ways we cannot detect automatically

which is why i think it's a bad idea to have a global default for this ... 
there may be situations where people explicitly want different behavior in 
different instances (ie: in this field i want the legacy 2.4 StopFilter 
behavior, but in this field i want the current 2.9 stop filter behavior) and 
having a default will mask the ability to do this, and make it easy to 
inadvertantly break it.

bq. its my understanding that things like this are why Version was created in 
the first place.

My understanding is castly different then yours ... All the discussions i 
remember about it were along the lines of preventing Class proliferation -- 
that people didn't' like the idea of creating StandardAnalyzer2 just because 
StandardAnalyzer had some behavior that was considered buggy but couldn't be 
removed - so now there is a constructor arg instead, and static constants that 
let you pick a fixed behavior, or a constant that let's you pick current no 
matter what it is -- so applications that always want the current recommended 
behavior can just upgrade a jar and get it.

But I don't remember any implication that it was expected that every object 
would have the same Version settings as every other object -- if that was the 
intention then shouldn't there be a standard interface for Versionable or 
VersionAware objects so they can test compatibility with one another (ie: 
QueryParser and Analyzers that might wrap StopFilter) ? ... or a {{public 
static void setCurrentOperatingVersion(Version)}} method in the Version class, 
instead of letting each constructor take in an independent value?



FWIW: Even though I'm still convinced that having any sort of global default 
value for luceneMatchVersion is a bad idea -- and i'm going to keep trying to 
convince other people as well -- I want to make some comments about how i think 
it should be implemented if we do wind up doing it (just in case i get hit by a 
bus)

Making the Base*Factory analysis classses SolrCoreAware is really overkill for 
this -- there was a real conscious choice not to let things declared in 
schema.xml be SolrCoreAware, because it pulls back the curtain and exposes a 
lot of plumbing related APIs in way that could make it hard to refactor away 
SolrCore functionality later.  The list of plugin types that can be made 
SolrCoreAware is deliberately small, and confined to plugins that are already 
exposed to the full SolrCore API at some other time in their life cycle -- 
being SolrCoreAware just gives them access to the core during initialization.

If there is really going to be one uber-default global luceneMatchVersion 
then i think the place it makes the most sense to declare something like this 
is in the schema.xml -- many differnet solrconfig.xml files might be used with 
the same schema.xml, so if we're expecting that the typical behavior is to 
set this once and have it just work it should propogate from the IndexSchema 
object to the SolrCore and not vice-versa.

My suggestion for how to implement this would be...

# Add a new luceneMatchVersion attribute to the existing schema/ tag.
# Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can 
use this to get the default.
# When init()ing new objects, include the key=value pair of 
{{luceneMatchVersion=schema.getLuceneMatchVersion()}} to the init method of 
the object if it's not already an init param for that particular instance.

This would eliminate the need to make any of the Analysis Factories 

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798921#action_12798921
 ] 

Robert Muir commented on SOLR-1677:
---

{quote}
which is why i think it's a bad idea to have a global default for this ... 
there may be situations where people explicitly want different behavior in 
different instances (ie: in this field i want the legacy 2.4 StopFilter 
behavior, but in this field i want the current 2.9 stop filter behavior) and 
having a default will mask the ability to do this, and make it easy to 
inadvertantly break it.
{quote}

but this patch does argue for a global default, which is 2.4, its just 
hardcoded inside the java code.

bq. The user should be in control.

You argue against yourself when you say this, but prevent the user from 
changing this hardcoded 2.4 default.



 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-11 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798930#action_12798930
 ] 

Hoss Man commented on SOLR-1677:


bq. You argue against yourself when you say this, but prevent the user from 
changing this hardcoded 2.4 default.

WTF?!?! ... now i feel like you are just messing with my head.

I've never argued that the user shouldn't be allowed to change the behavior of 
any class away from the (hardcoded) 2.4 behavior -- i've tried to be very clear 
that my objection was only to the new global default setting that would have 
action at a distance for all of these Version dependent classes w/o aby obvious 
indication what it was affect.

To be as clear as i possibly know how: I am completely in favor of this new 
syntax added by Uwe's patch...

{code:title=src/test/test-files/solr/conf/schema-luceneMatchVersion.xml}
  fieldtype name=text20 class=solr.TextField
analyzer
  tokenizer class=solr.StandardTokenizerFactory 
luceneMatchVersion=LUCENE_20/
  filter class=solr.StandardFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory luceneMatchVersion=LUCENE_24/
  filter class=solr.EnglishPorterFilterFactory/
/analyzer
  /fieldtype
{code}

...and this is the *only* new syntax added by Uwe's patch that i am opposed 
to...

{code:title=src/test/test-files/solr/conf/solrconfig.xml}
luceneMatchVersionLUCENE_29/luceneMatchVersion
{code}

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798936#action_12798936
 ] 

Robert Muir commented on SOLR-1677:
---

bq. WTF?!?! ... now i feel like you are just messing with my head.

I am really not trying to, i guess we have just put some recent work that only 
happens with Version = somethign recent and it would be a shame if it were 
never used because we made this too difficult, and it simply falls back on 2.4 
and works without this parameter so no one bothers.

And I also can't see anyone really spending time to aggressively ensure that 
the example schema etc is all up to date (personally i would try to help, it is 
difficult though with lucene and solr so out of sync)

{quote}
I've never argued that the user shouldn't be allowed to change the behavior of 
any class away from the (hardcoded) 2.4 behavior - i've tried to be very clear 
that my objection was only to the new global default setting that would have 
action at a distance for all of these Version dependent classes w/o aby obvious 
indication what it was affect.
{quote}

the hardcoded 2.4 behavior is the action at a distance, because if i do not 
specify Version in my configuration file, then i get this very old behavior.

If this is really your concern, then i have an alternative i propose.
* No default anywhere, not even in the code
* Version is mandatory if the thing requires it


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798937#action_12798937
 ] 

Uwe Schindler commented on SOLR-1677:
-

{quote}
My suggestion for how to implement this would be...

# Add a new luceneMatchVersion attribute to the existing schema/ tag.
# Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can 
use this to get the default.
# When init()ing new objects, include the key=value pair of 
{{luceneMatchVersion=schema.getLuceneMatchVersion()}} to the init method of 
the object if it's not already an init param for that particular instance.

This would eliminate the need to make any of the Analysis Factories 
SolrCoreAware (or even ResourceLoaderAware) just to know what the 
luceneMatchVersion should be -- the Base*Factories could still contain a 
{{protected Version luceneMatchVersion}} set by the base init() method that 
subclasses could use as needed.

NOTE: This still doesn't doesn't solve the Analyzers must have no-arg 
constructors part of hte issue -- but it doesn't make it worse.  We can make 
IndexSchema pass this.getLuceneMatchVersion() to any Analyzer with a single arg 
Version constructor fairly easily.  If/When we provide a more general 
mechanism for passing constructor args to Analyzers, any Version params could 
be defaulted just like with the factory init() methods.
{quote}

That was my proposal a few comments above. But: I still do not want it in 
schema.xml, as Version is a global Lucene thing! But the behaviour would be the 
same: The schema code can get the version from somewhere and pass it down to 
all schema components as you propose.

The Analyzers must have no-arg ctor is easy: Use reflection and look first for 
a ctor with Version, if exist use and pass ctor init/schema/config arg, if not 
exisatent use no-arg ctor. We already have this in Lucene's benchmark contrib 
since 3.0.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796854#action_12796854
 ] 

Hoss Man commented on SOLR-1677:


bq. User Carl isn't helpful, user Carl is an idiot.

Oh come on now ... that's not really a fair criticism of the example: there are 
plenty of legitimate ways to use (some) TokenFilters only at search time and I 
specifically structured my example to point out potential problems in cases 
just like that -- Carl was very clear that if you used FooTokenFilterFactory 
in an index analyzer you'll need to reindex.


But fine, I'll amend my example to do it your way...


{panel}
...
Bob Asks his question (see previous example)

User Carl is on vacation and never sees Bob's email

User Dwight helpfully replies...

bq. That was identified as a bug with FooTokenFilter that was fixed in Lucene 
3.1, but the default behavior was left as is for backcompatibility. If you 
change your luceneAnalyzerVersionDefault/ value to 3.1 (or 3.2) you'll get 
the newer/better behavior - but you _must_ reindex all of your data after you 
make this change.

Bob makes the change to 3.2 that Carl recommended, reindexes all of his data, 
and is happy to see now his queries work and every thing seems fine.

What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in 
his schema.xml file, Bob is also using the YakTokenizerFactory on a differnet 
field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0.  
This change is generally considered better behavior then YakTokenizer had 
before, but in combination with another TokenFilter Bob is using on the 
yakField it causes behavior that is not what Bob wants.  Now some types of 
queries that use the yakField are failing, and *failing silently*.

{panel}

You could now argue that User Dwight is an idiot because he didn't warn Bob 
that other Analyzers/Tokenizers/TokenFilters might be affected.  But that just 
leads us to scenerious that re-iterates my point that this type of global value 
is something that would be dangerous to ever change

{panel}
...
Bob Asks his question (see previous examples)

User Carl has unsubscribed from the solr-user list (because a Bill Murray 
look-a-like hurt his feelings) and never sees Bob's email.

User Dwight is on vacation and never sees Bob's email.

User Ernest helpfully replies...

{quote}
That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, 
but the default behavior was left as is for backcompatibility. If you change 
your luceneAnalyzerVersionDefault/ value to 3.1 (or 3.2) you'll get the 
newer/better behavior -- *But this is Very VERY Dangerous: It could potentially 
affect the behavior of other analyzers you are using.  You need to check the 
javadocs for each and every Analyzer, Tokenizer, and TokenFilter you use to see 
what their behavior is with various values of the Version property before you 
make a change like this.

Personally I never change the value of luceneAnalyzerVersionDefault/ once i 
have an existing schema.xml file.  Instead i suggest you add 
{{luceneVersion=3.2}} to your {{filter class=solr.FooTokenFilterFactory 
/}} declaration so that you know you are only changing the behavior you want 
to change.

BTW: You _must_ reindex all of your data after doing either of these things in 
order for it to work.
{quote}

Bob follow's Ernest's advice, and everything is fine .. but Bob is left 
wondering what the point is of a config option that's so dangerous to change, 
and wishes there was an easy way to know which of his Analyzers and Factories 
are depending on that scary gobal value.

{panel}

At the end of the day it just seems like a bigger risk then a feature ... I 
feel like i must still be misunderstanding the motivation you guys have for 
adding it, because it really seems like it boils down to easier then having 
the property 2.9 set on every analyzer/factory  

I guess i ultimately have no stringent objection to a global schema.xml seting 
like this existing as an expert level feature (for people who want really 
compact config files i guess), I just don't want to see it used in the example 
schema.xml file(s) where it's likely to screw novice users over.



 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most 

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796862#action_12796862
 ] 

Robert Muir commented on SOLR-1677:
---

bq. Oh come on now ... that's not really a fair criticism of the example: there 
are plenty of legitimate ways to use (some) TokenFilters only at search time 
and I specifically structured my example to point out potential problems in 
cases just like that - Carl was very clear that if you used 
FooTokenFilterFactory in an index analyzer you'll need to reindex.

I disagree, Version applies to all of lucene (even more than tokenstreams), so 
for Carl to imply that you don't need to reindex by bumping Version simply 
because you aren't using X or Y or Z, for that he should be renamed Oscar.

bq. You could now argue that User Dwight is an idiot because he didn't warn Bob 
that other Analyzers/Tokenizers/TokenFilters might be affected. But that just 
leads us to scenerious that re-iterates my point that this type of global value 
is something that would be dangerous to ever change

Yeah, I guess I don't think he is an idiot. I just think he is a moron for 
suggesting such a thing without warning of the consequences.

bq. Personally I never change the value of luceneAnalyzerVersionDefault/ once 
i have an existing schema.xml file. Instead i suggest you add 
luceneVersion=3.2 to your filter class=solr.FooTokenFilterFactory / 
declaration so that you know you are only changing the behavior you want to 
change.

Good for Ernest, i guess he is probably using Windows 3.1 still too because he 
doesn't want to upgrade ever. Unless Ernest carefully reads Lucene CHANGES also 
and reads all the Solr source code and knows which solr features are tied to 
which lucene features, because its not obvious at all: i.e. solr's snowball 
factory doesn't use lucene's snowball, etc etc.

bq. At the end of the day it just seems like a bigger risk then a feature ... I 
feel like i must still be misunderstanding the motivation you guys have for 
adding it, because it really seems like it boils down to easier then having 
the property 2.9 set on every analyzer/factory

Yes you are right, personally I don't want all users to be stuck with 
Version.LUCENE_24 forever. 


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796872#action_12796872
 ] 

Uwe Schindler commented on SOLR-1677:
-

In my opinion, the default in solrconfig.xml should be possible to set, because 
there is currently no requirement to set a version for all TS components. This 
default is in the shipped solrconfig.xml the version of the shipped lucene 
version. so new users can use the default config and extend it like learned in 
all courses and books about solr. They do not need to care about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous 
seeting and they are fine. If they want to change some of the components (like 
query parser, index writer, index reader -- flex!!!), they can do it locally. 
So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because 
they do not care about version (it is not required, because it defaults to 2.4 
for BW compatibility). So lots of configs will never use the new unicode 
features of Lucene 3.1. And suddenly Lucene 4.0 comes out and all support for 
Lucene  3 is removed, then all users cry. With a default version set to 2.4, 
they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24 
is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, 
because it applies to *all* lucene components), then you should go the way like 
Lucene 3.0: Require a matchVersion for all components. As there may be 
tokenstream components not from lucene, make this attribute in the schema only 
mandatory for lucene-streams (this can be done by my initial patch, too: if the 
matchVersion property is missing then the matchVersion will get NULL and the 
factory should thow IAE if required. In my original patch, only the parsing 
code should be moved out of the factory into a util class in solr. Maybe also 
possible to parse x.y-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, 
because their configs get invalid.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796937#action_12796937
 ] 

Hoss Man commented on SOLR-1677:




bq. Version applies to all of lucene (even more than tokenstreams), so for Carl 
to imply that you don't need to reindex by bumping Version simply because you 
aren't using X or Y or Z, for that he should be renamed Oscar.

Ok, fair enough ... i was supposing in that example that since i called it 
{{luceneAnalyzerVersionDefault/}} it was clearly specific to analysis objects 
in schema.xml and didn't affect any of the other things Version is used for 
(which would be specified in solrconfig.xml)

bq. i guess he is probably using Windows 3.1 still too because he doesn't want 
to upgrade ever.

No, he uses an OS where he can upgrade indivudal things individually with clear 
implications -- he sets {{luceneMatchVersion=2.9}} on each and every 
{{analyzer/}}, {{tokenizer/}} and {{filter/}} that he declares in his 
schema so that he knows exactly what behavior is changing when he modifies any 
of them.

bq. personally I don't want all users to be stuck with Version.LUCENE_24 
forever. 

I still must be missing something? ... why would all users be stuck with 
Version.LUCENE_24 forever?   

I'm not advocating that we don't allow a way to specify Version, i'm saying 
that having a global value for it that affects things opaquely sounds dangerous 
-- we should certianly have a way for people to specify the Version they want 
on each of the objects that care, but it shouldn't be global.  The 
luceneMatchVersion property that Uwe added to BaseTokenizerFactory and 
BaseTokenFilterFactory in his patch seems perfect to me, it's just the 
{{SolrCoreAware}} / {{core.getSolrConfig().luceneMatchVersion}} that i think is 
a bad idea.

If we modify the analyzer/ initialization to allow constructor args as Erik 
suggested (I'm pretty sure there's already code in Solr to do this, we just 
aren't using it for Analyzers) then we should be good to go for everything in 
schema.xml

If anything declared in solrconfig.xml starts caring about Version (QParser, 
SolrIndexWriter, etc...) then likewise it should get a luceneMatchVersion 
init property as well.  No one will ever be stuck with LUCENE_24, but they 
won't be surprised by behavior changes either.

bq. If we do not have a default, all users will keep stuck with lucene 2.4, 
because they do not care about version (it is not required, because it defaults 
to 2.4 for BW compatibility). So lots of configs will never use the new unicode 
features of Lucene 3.1.

I don't believe that.  Almost every solr user on the planet starts with the 
example configs.  if the example configs start specifying 
luceneMatchVersion=2.9 on every analyzer and factory then people will care 
about Version just as much as they care about the stopwords.txt file that ships 
with solr -- that may be not at all, or it may be a lot, but it will be up to 
them, and it will be obvious to them, because it's right there in the 
declaration where they can see it, and easy for them to refrence and recognize 
that changing that value will affect things.

bq. If you really do not want to have a default version in config (not schema, 
because it applies to all lucene components), then you should go the way like 
Lucene 3.0: Require a matchVersion for all components.

I'm totally on board with that idea in the long run -- but there are ways to 
get there gradually that are back compatible with existing configs.  Individual 
factories that care about luceneMatchVersion should absolutely start warning on 
startup that users should set luceneMatchVersion to get newer/better behavior 
may be available if it is unset (or doesn't match the current value of 
Version.LUCENE_CURRENT) and provide a URL for a wiki page somewhere where more 
detail is available.  The Analyzer init code can do likewise if if sees an 
{{analyzer class=.../}} being inited w/ a constructor that takes in a 
Version which is using an old value.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and 

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796965#action_12796965
 ] 

Robert Muir commented on SOLR-1677:
---

{quote}
No, he uses an OS where he can upgrade indivudal things individually with clear 
implications - he sets luceneMatchVersion=2.9 on each and every analyzer/, 
tokenizer/ and filter/ that he declares in his schema so that he knows 
exactly what behavior is changing when he modifies any of them.
{quote}

Yeah, but this isnt how Version works in lucene either, please see below

{quote}
I'm not advocating that we don't allow a way to specify Version, i'm saying 
that having a global value for it that affects things opaquely sounds dangerous 
- we should certianly have a way for people to specify the Version they want on 
each of the objects that care, but it shouldn't be global. The 
luceneMatchVersion property that Uwe added to BaseTokenizerFactory and 
BaseTokenFilterFactory in his patch seems perfect to me, it's just the 
SolrCoreAware / core.getSolrConfig().luceneMatchVersion that i think is a bad 
idea.
{quote}

And I disagree, I think that the per-tokenfilter matchVersion should be the 
expert use, with the default global Version being the standard use. 

I don't think Version is intended so you can use X.Y on this part and Y.Z on 
this part and have any chance of anything working, for example it controls 
position increments on stopfilter but also in queryparser, if you use wacky 
combinations, things might not work.

And I personally don't see anyone putting effort into supporting this either, 
because its enough to supply the back compat for previous versions, but not 
some cross product of all possible versions. this is too much. sometimes things 
interact in ways we cannot detect automatically (such as the query parser 
phrasequery / stopfilter thing), its my understanding that things like this are 
why Version was created in the first place.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-04 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796136#action_12796136
 ] 

Robert Muir commented on SOLR-1677:
---

{quote}
User Carl helpfully replies...

That was identified as a bug with FooTokenFilter that was fixed in Lucene 
3.1, but the default behavior was left as is for backcompatibility. If you 
change your luceneAnalyzerVersionDefault/ value to 3.1 (or 3.2) you'll get 
the newer/better behavior - but if you used FooTokenFilterFactory in an index 
analyzer you'll need to reindex.
{quote}

User Carl isn't helpful, user Carl is an idiot.

The javadoc of Version in lucene clearly says:
{noformat}
 * pbWARNING/b: When changing the version parameter
 * that you supply to components in Lucene, do not simply
 * change the version at search-time, but instead also adjust
 * your indexing code to match, and re-index.
{noformat}

User Carl could also tell Bob that its ok to index with ArabicAnalyzer and 
query with ChineseAnalyzer, this kind of stupid theoretical situation isn't any 
kind of valid logical argument against having a default value for this.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-03 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796087#action_12796087
 ] 

Hoss Man commented on SOLR-1677:



bq. The problem is the default value. If you leave out the version parameter 
instance-wise, you will get 2.4. And because of that all solr users will get 
stuck with that version and will never upgrade (because they leave the default 
and do not specify a different value).

That feels like a missleading statement ... the Version property on these 
objects is really more about getting the recommended behavior as of a 
particular version of Lucene ... saying that users will be stuck with that 
version is like saying users will be stuck with StandardAnalyzer instead of 
getting NewHotnessAnalyzer because they have to edit their config to use the 
newer/better analyzer -- Lucene-Java has opted to use a Version property on 
existing classes instead of adding new classes, but it's still conceptually the 
same thing: they get the bahavior they've always gotten, unless they change 
their config to get something different.

Besides which: 99.9% of Solr users copy the example config when they first 
start using Solr: we can set a version property on every Analyzer/Factory 
used in the example schema.xml and update them all when we upgrade the Lucene 
jars just as easily as we can update a single global value (it's a 
search+replaceAll instead of a search+replace)


bq. Why are you so against a default value? 

My concern is that it introduces action at a distance -- and not in a good way.

Here's the scenerio that seems garunteed to happen quite a bit if we add some 
new {{luceneAnalyzerVersionDefault/}} syntax to schema.xml...

{panel}

{{luceneAnalyzerVersionDefault2.9/luceneAnalyzerVersionDefault}} is added 
to the example schema.xml, and users start using it as a result of 
copying/modifying the example configs.  Time passes, new bugs are fixed, and 
the example configs evolve to contain 
{{luceneAnalyzerVersionDefault3.4/luceneAnalyzerVersionDefault}} 

A little while after that, User Bob emails solr-user with a question like...

{quote}
Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see 
behaviorX when it really seems like i should see BehaviorY 
{quote}

User Carl helpfully replies...

{quote}
That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, 
but the default behavior was left as is for backcompatibility.  If you change 
your {{luceneAnalyzerVersionDefault/}} value to 3.1 (or 3.2) you'll get the 
newer/better behavior -- but if you used FooTokenFilterFactory in an _index_ 
analyzer you'll need to reindex.
{quote}

Bob makes the change to 3.2 that Carl recommended, and is happy to see now his 
queries work.  He only uses FooTokenFilterFactory at _query_ time, so he 
doens't bother to reindex, and every thing seems fine.

What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in 
hi's schema.xml file, Bob is also using the YakTokenizerFactory on a differnet 
field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0. 
Now _some_ documents/queries that use yakField are failing -- and *failing 
silently.*

{panel}

Things just get a lot simpler when all of the configuration for an Analyzer, 
TokenizerFactory, or Tokenizer are all explict in their declaration -- indirect 
initialization is fine, as long as it's obvious.  Ie: field/ declarations 
referencing fieldTypes by name -- It's easy to fuck up a bunch of fields by 
making a single change to one fieldType, but at least you can grep for the name 
of the fieldType to see all the fields you are affecting.  

Even if Carl knows/remembers to warn Bob that changing 
{{luceneAnalyzerVersionDefault/}} might change/break other things in his 
schema.xml the situation doesn't get much better: Uless Bob (or Carl) skim the 
code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema, they 
can't be sure what might get affected by making a small increase to the 
global luceneAnalyzerVersion setting ... which means the only safe thing for 
Bob to do is to set the property individual on the one place he really wants to 
make the change.

So why have the global in the first place?  It really just seems like more 
trouble then it's worth.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-01 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795746#action_12795746
 ] 

Uwe Schindler commented on SOLR-1677:
-

The problem is the default value. If you leave out the version parameter 
instance-wise, you will get 2.4. And because of that all solr users will get 
stuck with that version and will never upgrade (because they leave the default 
and do not specify a different value). Because of backwards compatibility, we 
are limited to this version number as default value.

The schema/config global version is the global default used by all instances, 
that do not specify a different value. By that we can ship the default 
solconfig/schema.xml with the latest possible lucene version, but users 
upgrading will keep their default value.

I repeat: with instance-wise config, nobody will ever use it for new analyzers. 
With a global default, there is only *one* place that sets the version, which 
is also valid for user-added tokenizer chains.

For the SolrCore problem: For analyzers the idea its, that the default Version 
constant is automatically passed to all tokenizers in the param map 
automatically. Local values overwrite the key in the map. But this would only 
apply the analyzers. Other usages of Version at other places (QP, IW) still 
need SolrCore. But we can move the SolrCoreAware to the schema classes and not 
make every TokenFilter/Tokenizer SolrCoreAware.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795760#action_12795760
 ] 

Robert Muir commented on SOLR-1677:
---

bq. But as i said: i don't see any compelling need for a schema global 
Version anyway (let alone an instance wide global that applies to both 
solrconfig.xml and schema.xml)

just like Uwe says this is the problem with having no default

If the default Version is going to be 2.4, I would like a global setting so 
that I get bugfixes and improvements, because a few things have happened to 
this code since 2.4.

I also do not want to list it 10,000 times, but its not enough to make the 
default Version the latest to fix this problem.

I want my config to be wired to '2.9' or whatever, so that when upgrading, 
everything continues to work. Why are you so against a default value? 

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-31 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795728#action_12795728
 ] 

Hoss Man commented on SOLR-1677:


bq. Is that true? Many times so far, but Version is not limited to such things. 
It can be used for far more than how to read/write the index properly.

Perhaps, but that would be a very different usage ... even if Lucene-Java uses 
the same o.a.l.util.Version class for driving Analyzers/Tokenizers/TokenFilters 
and IndexWriters/MergeScheduler/QueryParser ... but those are very different 
things in Solr land ... in a replication setup, two different instances might 
use very different Version values for the 
IndexWriter/MergeScheduler/QueryParser (configured in solrconfig.xml) but they 
should have identical schema.xml files and identical (versioned) analyzer 
setttings.

But as i said: i don't see any compelling need for a schema global Version 
anyway (let alone an instance wide global that applies to both solrconfig.xml 
and schema.xml)



 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-24 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794440#action_12794440
 ] 

Erik Hatcher commented on SOLR-1677:


Another comment on this... Solr supports using an Analyzer also, but only ones 
with zero-arg constructors.  It would be nice if this Version support also 
allowed for Analyzers (say SmartChineseAnalyzer) to be used also directly.   I 
don't think this patch accounts for this case, does it?

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794447#action_12794447
 ] 

Uwe Schindler commented on SOLR-1677:
-

Thanks for the hint. This means it can instantiate an analyzer via reflection 
and uses the zero-arg ctor, which is no longer available. So with Lucene 3.0 it 
will no longer work at all. As I have not so much experience with hacking Solr, 
I did not recognize this.

In my own project I have the same mechanism, for that i did a 
reflection-analysis of the loaded class and use the ctor with Version, if not 
avail an empty ctor.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793599#action_12793599
 ] 

Mark Miller commented on SOLR-1677:
---

bq. it should be in schema.xml, as it pertains to the index itself and how to 
read/write to the index properly and not to the paticularities of how a 
particular solr installation might be using that data 

Is that true? Many times so far, but Version is not limited to such things. It 
can be used for far more than how to read/write the index properly.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-21 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793228#action_12793228
 ] 

Hoss Man commented on SOLR-1677:


{quote}
* As a first hack the solrConfig schema has a new element luceneMatchVersion 
that contains a solr-wide default luceneMatchVersion value that is used as 
default for QueryParser, Analyzers if not specified different
* On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now 
extend SolrCoreAware (and I also allowed these classes to be SolrCoreAware) and 
get the SolrConfig.
{quote}

I'd really prefer that nothing like this make it into solr.

One: we've worked pretty hard to make sure that nothing in the analysis code is 
SolrCoreAware -- the goal was to try and keep the schema related code reusable 
w/o risk of factories adding tendrals that reach deep into the other solr code 
(it's onbly a matter of time until someone starts refactoring all of the schema 
related code out of Solr and into a Lucene contrib.

If we really want to add a new global setting for the default match version, 
it should be in schema.xml, as it pertains to the index itself and how to 
read/write to the index properly and not to the paticularities of how a 
particular solr installation might be using that data (schema.xml = the nature 
of the data; solrconfig.xml = the usage of the data)

Two: I really question the need for a configurable default across all analysis 
factories.  This seems like the type of thing that's going to be changed rarely 
if ever, and when it is changed each field will need to be considered very 
carefully to decide wether the new behavior is desired over hte old

I suspect the only time anyone is going to upgrade all factories at once is 
when we rev lucene jars and update the example configs -- in that case (and in 
the case of a user who is happy to blow away all of their data and take the 
newest, regardless of what it is, for every analyzer) a search and replace seem 
perfectly appropriate.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793022#action_12793022
 ] 

Robert Muir commented on SOLR-1677:
---

Hello Uwe, I would like to be able to specify the default, at some global 
level, for all tokenstreams.

for example, if i was setting up a new solr configuration, i would want to say 
'give me 3.1 support for all tokenstreams by default' ?

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2009-12-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793047#action_12793047
 ] 

Uwe Schindler commented on SOLR-1677:
-

bq. for example, if i was setting up a new solr configuration, i would want to 
say 'give me 3.1 support for all tokenstreams by default' ?

I have no idea how to define global properties in schema.xml that apply for all 
factories. If this is possible the LUCENE_24 else clause and the default value 
can be changed to the global default (which itsself defaults to 
Version.LUCENE_24). In this case the parser map (for Lucene 2.9/Java 1.4) on 
the version enum should also move to a more central page.

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.