[
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796087#action_12796087
]
Hoss Man commented on SOLR-1677:
--------------------------------
bq. The problem is the default value. If you leave out the version parameter
instance-wise, you will get 2.4. And because of that all solr users will get
stuck with that version and will never upgrade (because they leave the default
and do not specify a different value).
That feels like a missleading statement ... the "Version" property on these
objects is really more about getting the "recommended" behavior as of a
particular version of Lucene ... saying that users will be "stuck with that
version" is like saying users will be "stuck with StandardAnalyzer" instead of
getting "NewHotnessAnalyzer" because they have to edit their config to use the
newer/better analyzer -- Lucene-Java has opted to use a Version property on
existing classes instead of adding new classes, but it's still conceptually the
same thing: they get the bahavior they've always gotten, unless they change
their config to get something different.
Besides which: 99.9% of Solr users copy the example config when they first
start using Solr: we can set a "version" property on every Analyzer/Factory
used in the example schema.xml and update them all when we upgrade the Lucene
jars just as easily as we can update a single "global" value (it's a
search+replaceAll instead of a search+replace)
bq. Why are you so against a default value?
My concern is that it introduces action at a distance -- and not in a good way.
Here's the scenerio that seems garunteed to happen quite a bit if we add some
new {{<luceneAnalyzerVersionDefault/>}} syntax to schema.xml...
{panel}
{{<luceneAnalyzerVersionDefault>2.9</luceneAnalyzerVersionDefault>}} is added
to the example schema.xml, and users start using it as a result of
copying/modifying the example configs. Time passes, new bugs are fixed, and
the example configs evolve to contain
{{<luceneAnalyzerVersionDefault>3.4</luceneAnalyzerVersionDefault>}}
A little while after that, User Bob emails solr-user with a question like...
{quote}
Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see
behaviorX when it really seems like i should see BehaviorY
{quote}
User Carl helpfully replies...
{quote}
That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1,
but the default behavior was left as is for backcompatibility. If you change
your {{<luceneAnalyzerVersionDefault/>}} value to 3.1 (or 3.2) you'll get the
newer/better behavior -- but if you used FooTokenFilterFactory in an _index_
analyzer you'll need to reindex.
{quote}
Bob makes the change to 3.2 that Carl recommended, and is happy to see now his
queries work. He only uses FooTokenFilterFactory at _query_ time, so he
doens't bother to reindex, and every thing seems fine.
What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in
hi's schema.xml file, Bob is also using the YakTokenizerFactory on a differnet
field (yakField), and the behavior of the YakTokenizer changed in Lucene 3.0.
Now _some_ documents/queries that use yakField are failing -- and *failing
silently.*
{panel}
Things just get a lot simpler when all of the configuration for an Analyzer,
TokenizerFactory, or Tokenizer are all explict in their declaration -- indirect
initialization is fine, as long as it's obvious. Ie: <field/> declarations
referencing fieldTypes by name -- It's easy to fuck up a bunch of fields by
making a single change to one fieldType, but at least you can grep for the name
of the fieldType to see all the fields you are affecting.
Even if "Carl" knows/remembers to warn "Bob" that changing
{{<luceneAnalyzerVersionDefault/>}} might change/break other things in his
schema.xml the situation doesn't get much better: Uless Bob (or Carl) skim the
code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema, they
can't be sure what might get affected by making a small increase to the
"global" luceneAnalyzerVersion setting ... which means the only safe thing for
Bob to do is to set the property individual on the one place he really wants to
make the change.
So why have the "global" in the first place? It really just seems like more
trouble then it's worth.
> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and
> BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
> Key: SOLR-1677
> URL: https://issues.apache.org/jira/browse/SOLR-1677
> Project: Solr
> Issue Type: Sub-task
> Components: Schema and Analysis
> Reporter: Uwe Schindler
> Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch,
> SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards
> compatibility with old indexes created using older versions of Lucene. The
> most important example is StandardTokenizer, which changed its behaviour with
> posIncr and incorrect host token types in 2.4 and also in 2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with
> much more Unicode support, almost every Tokenizer/TokenFilter needs this
> Version parameter. In 2.9, the deprecated old ctors without Version take
> LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base
> factories. Subclasses then can use the luceneMatchVersion decoded enum (in
> 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently
> contains a helper map to decode the version strings, but in 3.0 is can be
> replaced by Version.valueOf(String), as the Version is a subclass of Java5
> enums. The default value is Version.LUCENE_24 (as this is the default for the
> no-version ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from
> StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed
> to match Lucene 3.0.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.