[
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802609#action_12802609
]
Hoss Man commented on SOLR-1677:
--------------------------------
I'm definitely of two minds on this.
On the one hand...
Robert's clarification of his concerns convinces me that we don't need a global
setting. The issue of multiple related components in an analysis chain (ie:
EsperantoTokenizer, EsperantoStopFilter, and EsperantoStemmerFilter) not being
well tested in Lucene-Java when those components use differnet Version
proeprties doesn't seem like a compelling argument because we've never made any
claims that any combinations of analysis componets will work together. People
can easily construct Analyzers in their schema.xml that make no sense, and
don't work at all, we'll never be able to solve that problem for everyone.
Worrying about people miss-matching version numbers doesn't seem any different
then worrying about them using inconsistent stopword files between an index
analyzer and a query analyzer on the same field: buyer beware.
On the other hand...
I view the Version property of all these Lucene-Java classes an as
implementation detail of the generalized ideal of providing multiple solutions
for a similar problem that have subtly differnet behavior. To my mind: Adding
a version property to StandardTokenizer is just an alternate approach to
deprecating StandardTokenizer and providing a new StadanrdTokenizer2 where the
behavior is "improved" based on the subjective opinion of the Lucene community.
The Version property approach is easier to maintain in the Lucene source tree,
but still requires roughly the same amount of work on the part of client app
maintainers when upgrading: consider whether you think the "improved" behavior
is better for your application, and modify your code as needed. I've been
looking at how this should be supported in Solr with that perspective, putting
the schema.xml owner in the role of the client app maintainer.
But I'm realizing now that I'm clearly in the minority in viewing these
multiple versions as "alternate implementations" ... everyone else seems to
have a very fixed view that these Version based changes are genuine
improvements/bug-fixes, w/o any expectation that clients might/could subjective
decide "i want the old behavior" and that older "Versions" are supported purely
for back-compatibility.
If that's how Version is really going to be used in Lucene-Java moving forward,
then I can definitely understand the push for having it globally configured in
Solr for simplification.
----
I won't fight you guys on this ... if I'm the only one that feels like a global
value is bad, then i concede that probably says more about me then about the
idea.
But I'm still really worried about the problem of (opaque) action at a
distance, and the difficulties in understanding what effects there will be when
changing the luceneVersionMatch property from one value to another.
This comment from Mark illustrates what scares me the most...
bq. it should say, if you change this, you must reindex. No worries about
action at a distance. The action is to get the latest and greatest Lucene has
to offer rather than older buggy or back compat behavior.
...that mindset, that as long as you reindex you'll be fine, totally downplays
the fact that changes will happen in places the user may not realize. w/o a
clear way of knowing what exactly is changing when you modify that (global)
value, users will have no idea what to look for when they "upgrade" it. they
won't have any visibility into what the fully set of behavior changes to
exepect as a result of that update, to know what they should test to make sure
it still works the way they need it to.
If they read in mailing list thread that they need to switch from
{{<luceneMatchVersion>2.4</luceneMatchVersion>}} to
{{<luceneMatchVersion>2.9</luceneMatchVersion>}} and completley reindex in
order to get positions to be preserved in StopFilterFactory, that doesn't help
them realize that they should do relevancy testing on fieldA and fieldB which
use some language specific stemmer whose behavior changed in a small but
significant way.
As a user, that's the nightmare scenario i don't want to have to deal with:
greping through every class in Lucene-Java that has a Version property to see
which ones have differnet behavior between the luceneMatchVersion property i'm
currently using and the luceneMatchVersion property i've been told i should
upgrade to in order to fix a bug ... just so i know what things i need to test
after i make my change.
I guess this is will just be a documentation problem, but it seems like a
pretty fucking big one.
> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
> Issue Type: Sub-task
> Components: Schema and Analysis
> Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch,
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards
> compatibility with old indexes created using older versions of Lucene. The
> most important example is StandardTokenizer, which changed its behaviour with
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with
> much more Unicode support, almost every Tokenizer/TokenFilter needs this
> Version parameter. In 2.9, the deprecated old ctors without Version take
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently
> contains a helper map to decode the version strings, but in 3.0 is can be
> replaced by Version.valueOf(String), as the Version is a subclass of Java5
> enums. The default value is Version.LUCENE_24 (as this is the default for the
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed
> to match Lucene 3.0.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.