Another example is long ago Lucene allowed pos=-1 to be indexed and it
caused all sorts of problems.  We also stopped allowing positions close to
Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382).  Yet
another is allowing negative vInts which are possible but horribly
inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).

We do need to be free to fix these problems and then know after N+2
releases that no index can have the issue.

I like the idea of providing "expert" / best effort / limited way of
carrying forward such ancient indices, but I think the huge challenge for
someone using that tool on an important index will be enumerating the list
of issues that might "matter" (the 3 Adrien listed + the 3 I listed above
is a start for this list) and taking appropriate steps to "correct" the
index if so.  E.g. on a norms encoding change, somehow these expert tools
must decode norms the old way, encode them the new way, and then rewrite
the norms files.  Or if the index has pos=-1, changing that to pos=0.  Or
if it has negative vInts, ... etc.

Or maybe the "special" DirectoryReader only reads stored fields?  And so
you would enumerate your _source and reindex into the latest format ...

> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> help make it harder to introduce corrupt data in an index.

+1

Every time we catch something like "don't allow pos = -1 into the index" we
need somehow remember to go and add the check also in addIndices.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand <[email protected]> wrote:

> Agreed with Michael that setting expectations is going to be
> important. The thing that I would like to make sure is that we would
> never refrain from moving Lucene forward because of this feature. In
> particular, lucene-core should be free to make assumptions that are
> valid for N and N-1 indices without worrying about the fact that we
> have this super-expert feature that allows opening older indices. Here
> are some assumptions that I have in mind which have not always been
> true:
>  - norms might be encoded in a different way (this changed in 7)
>  - all index files have a checksum (only true since Lucene 5)
>  - offsets are always going forward (only enforced since Lucene 7)
>
> This means that carrying indices over by just merging them with the
> new version to move them to a new codec won't work all the time. For
> instance if your index has backward offsets and new codecs assume that
> offsets are going forward, then merging might fail or corrupt offsets
> - I'd like to make sure that we would not consider this a bug.
>
> Erick, I don't think this feature would be suitable for "robust index
> upgrades". To me it is really a best effort and shouldn't be trusted
> too much.
>
> I think some users will be tempted to wrap old readers to make them
> look good and then add them back to an index using addIndexes?
> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> help make it harder to introduce corrupt data in an index.
>
> On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
> <[email protected]> wrote:
> >
> > Hey folks,
> >
> > tl;dr; I want to be able to open an indexreader on an old index if the
> > SegmentInfo version is supported and all segment codecs are available.
> > Today that's not possible even if I port old formats to current
> > versions.
> >
> > Our BWC policy for quite a while has been N-1 major versions. That's
> > good and I think we should keep it that way. Only recently, caused by
> > changes how we encode/decode norms we also hard-enforce a the
> > index-version-created in several places and the version a segment was
> > written with. These are great enforcements and I understand why. My
> > request here is if we can find consensus on allowing somehow (a
> > special DirectoryReader for instance) to open such an index for
> > reading only that doesn't provide the guarantees that our high level
> > APIs decode norms correctly for instance. This would be enough to for
> > instance consume stored fields etc. for reindexing or if a users are
> > aware do they norms decoding in the codec. I am happy to work on a
> > proposal how this would work. It would still enforce no writing or
> > anything like this. I am also all for putting such a reader into misc
> > and being experimental.
> >
> > simon
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to