[
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802973#action_12802973
]
Hoss Man commented on SOLR-1677:
--------------------------------
bq. I think I am slightly offended with some of your statements about
'subjective opinion of the Lucene Community' and 'they should do relevancy
testing which use some language-specific stemmer whose behavior changed in a
small but significant way'.
That was not at all my intention, i'm sorry about that. I was in fact trying
to speak entirely in generalities and theoretical examples.
The point I was trying to make is that the types of bug fixes we make in Lucene
are no mathematical absolutes -- we're not fixing bugs where 1+1=3. Even if
everyone on java-dev, and java-user agrees that behavior A is broken and
behavior B is correct, that is still (to me) a subjective opinion -- 1000 mens
trash may be one mans treasure, and there could be users out there who have
come to expect/rely on that behavior A.
I tried to use a stemmer as an example because it's the type of class where
making behavior more correct (ie: making the stemming match the semantics of
the language more accurately) doesn't necessarily improve the percieved
behavior for all users -- someone could be very happy with the "sloppy
stemming" in the 3.1 version of a (hypothetical) EsperantoStemmer because it
gives him really "loose" matches. And if you (or any one else) put in a lot of
hard work making that stemmer "better" my all concievable metrics in 3.4, then
i've got no problem telling that person "Sorry dude, if you don't want those
fixes don't upgrade, or here are some other suggestions for getting 'loose'
matching on that field."
My concern is that there may be people who don't even realize they are
depending on behavior like this. Without an easy way for users to understand
what objects have improved/fixed behavior between luceneMatchVersion=X and
luceneMatchVersion=Y they won't know the full list of things they should be
considering/testing when they do change luceneMatchVersion.
bq. I'm also not that worried that users won't know what changed - they will
just know that they are in the same boat as those downloading Lucene latest
greatest for the first time.
But that's not true: a person downloading for the first time won't have any
preconcieved expectaionts of how something will behavior; that's a very
different boat from a person upgrading is going to expect things that were
working to keep working -- those things may have actaully been bugs in earlier
versions, but if they _seemed_ to be working for their use cases, it's going to
feel like it's broken when the behavior changes. For a user who is conciously
upgrading i'm ok with that. but when there is no easy way of knowing what
behavior will change as a result of setting luceneMatchVersion=X that doens't
feel fair to the user.
Robert mentioned in an earlier comment that StopFilter's position increment
behavior changes depending on the luceneMatchVersion -- what if an existing
Solr 1.3 user notices a bug in some Tokenizer, and adds
{{<luceneMatchVersion>3.0</luceneMatchVersion>}} to his schema.xml to fix it.
Without clear documentation n _everything_ that is affected when doing that, he
may not realize that StopFilter changed at all -- and even though the position
incrememnt behavior may now be more correct, it might drasticly change the
results he gets when using dismax with a particular qs or ps value. Hence my
point that this becomes a serious documentation concern: finding a way to make
it clear to users what they need to consider when modifying luceneMatchVersion.
bq. I'm still all for allowing Version per component for experts use. But man,
I wouldn't want to be in the boat, managing all my components as they mimic
various bugs/bad behavior for various components.
But if the example configs only show a global setting that isn't directly
"linked" to any of hte individual object configurations, then normal users
won't have any idea what could have/use individual luceneMatchVerssion settings
anyway (even if they wanted to manage it piecemeal)
Like i said: i've come around to the idea of having/advocating a global value.
Once i got passed my mistaken thinking of "Version" as controlling "alternate
versions" (as miller very clearly put it) I started to understand what you are
all saying and i agree with you: a single global value is a good idea.
My concern is just how to document things so that people don't get confused
when they do need to change it.
> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
> Issue Type: Sub-task
> Components: Schema and Analysis
> Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch,
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards
> compatibility with old indexes created using older versions of Lucene. The
> most important example is StandardTokenizer, which changed its behaviour with
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with
> much more Unicode support, almost every Tokenizer/TokenFilter needs this
> Version parameter. In 2.9, the deprecated old ctors without Version take
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently
> contains a helper map to decode the version strings, but in 3.0 is can be
> replaced by Version.valueOf(String), as the Version is a subclass of Java5
> enums. The default value is Version.LUCENE_24 (as this is the default for the
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed
> to match Lucene 3.0.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.