Re: [DISCUSS] Opening old indices for reading

2019-03-08 Thread Cassandra Targett
I have a question about Simon’s commit that he discussed in an earlier mail to 
this thread, found at 
https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752

I see the commit diffs and files changed in GitHub at the above URL, but one 
odd thing about it is that it doesn’t refer to any branch and a scan of the 
code doesn’t show these changes at all. I looked for branches and PRs and 
didn’t find anything that jumped out at me. There also weren’t any 
notifications to the commits@lucene.a.o list about these changes.

So, were the changes really made? Was it just intended as some code for 
discussion, or was it meant to be in master branch? If the former, how does one 
make a commit without a branch? If the change was intended to be in master, 
though, it seems something has gone awry and we should try to fix it.

Cassandra
On Jan 31, 2019, 8:23 AM -0600, Adrien Grand , wrote:
> This looks reasonable to me.
>
> On Tue, Jan 29, 2019 at 4:23 PM Simon Willnauer
>  wrote:
> >
> > thanks folks,
> >
> > these are all good points. I created a first cut of what I had in mind
> > [1] . It's relatively simple and from a java visibility perspective
> > the only change that a user can take advantage of is this [2] and this
> > [3] respectively. This would allow opening indices back to Lucene 7.0
> > given that the codecs and postings formats are available. From a
> > documentation perspective I added [4]. Thisi s a pure read-only change
> > and doesn't allow opening these indices for writing. You can't merge
> > them neither would you be able to open an index writer on top of it. I
> > still need to add support to Check-Index but that's what it is
> > basically.
> >
> > lemme know what you think,
> >
> > simon
> > [1] 
> > https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752
> > [2] 
> > https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e0352098b027d6f41a17c068ad8d7ef0R689
> > [3] 
> > https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e3ccf9ee90355b10f2dd22ce2da6c73cR306
> > [4] 
> > https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-1bedf4d0d52ff88ef8a16a6788ad7684R86
> >
> > On Fri, Jan 25, 2019 at 3:14 PM Michael McCandless
> >  wrote:
> > >
> > > Another example is long ago Lucene allowed pos=-1 to be indexed and it 
> > > caused all sorts of problems. We also stopped allowing positions close to 
> > > Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382). 
> > > Yet another is allowing negative vInts which are possible but horribly 
> > > inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).
> > >
> > > We do need to be free to fix these problems and then know after N+2 
> > > releases that no index can have the issue.
> > >
> > > I like the idea of providing "expert" / best effort / limited way of 
> > > carrying forward such ancient indices, but I think the huge challenge for 
> > > someone using that tool on an important index will be enumerating the 
> > > list of issues that might "matter" (the 3 Adrien listed + the 3 I listed 
> > > above is a start for this list) and taking appropriate steps to "correct" 
> > > the index if so. E.g. on a norms encoding change, somehow these expert 
> > > tools must decode norms the old way, encode them the new way, and then 
> > > rewrite the norms files. Or if the index has pos=-1, changing that to 
> > > pos=0. Or if it has negative vInts, ... etc.
> > >
> > > Or maybe the "special" DirectoryReader only reads stored fields? And so 
> > > you would enumerate your _source and reindex into the latest format ...
> > >
> > > > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> > > > help make it harder to introduce corrupt data in an index.
> > >
> > > +1
> > >
> > > Every time we catch something like "don't allow pos = -1 into the index" 
> > > we need somehow remember to go and add the check also in addIndices.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand  wrote:
> > > >
> > > > Agreed with Michael that setting expectations is going to be
> > > > important. The thing that I would like to make sure is that we would
> > > > never refrain from moving Lucene forward because of this feature. In
> > > > particular, lucene-core should be free to make assumptions that are
> > > > valid for N and N-1 indices without worrying about the fact that we
> > > > have this super-expert feature that allows opening older indices. Here
> > > > are some assumptions that I have in mind which have not always been
> > > > true:
> > > > - norms might be encoded in a different way (this changed in 7)
> > > > - all index files have a checksum (only true since Lucene 5)
> > > > - offsets are always going forward (only enforced since Lucene 7)
> > > >
> > > > This means that carrying 

Re: [DISCUSS] Opening old indices for reading

2019-01-31 Thread Adrien Grand
This looks reasonable to me.

On Tue, Jan 29, 2019 at 4:23 PM Simon Willnauer
 wrote:
>
> thanks folks,
>
> these are all good points. I created a first cut of what I had in mind
> [1] . It's relatively simple and from a java visibility perspective
> the only change that a user can take advantage of is this [2] and this
> [3] respectively. This would allow opening indices back to Lucene 7.0
> given that the codecs and postings formats are available. From a
> documentation perspective I added [4]. Thisi s a pure read-only change
> and doesn't allow opening these indices for writing. You can't merge
> them neither would you be able to open an index writer on top of it. I
> still need to add support to Check-Index but that's what it is
> basically.
>
> lemme know what you think,
>
> simon
> [1] 
> https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752
> [2] 
> https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e0352098b027d6f41a17c068ad8d7ef0R689
> [3] 
> https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e3ccf9ee90355b10f2dd22ce2da6c73cR306
> [4] 
> https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-1bedf4d0d52ff88ef8a16a6788ad7684R86
>
> On Fri, Jan 25, 2019 at 3:14 PM Michael McCandless
>  wrote:
> >
> > Another example is long ago Lucene allowed pos=-1 to be indexed and it 
> > caused all sorts of problems.  We also stopped allowing positions close to 
> > Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382).  Yet 
> > another is allowing negative vInts which are possible but horribly 
> > inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).
> >
> > We do need to be free to fix these problems and then know after N+2 
> > releases that no index can have the issue.
> >
> > I like the idea of providing "expert" / best effort / limited way of 
> > carrying forward such ancient indices, but I think the huge challenge for 
> > someone using that tool on an important index will be enumerating the list 
> > of issues that might "matter" (the 3 Adrien listed + the 3 I listed above 
> > is a start for this list) and taking appropriate steps to "correct" the 
> > index if so.  E.g. on a norms encoding change, somehow these expert tools 
> > must decode norms the old way, encode them the new way, and then rewrite 
> > the norms files.  Or if the index has pos=-1, changing that to pos=0.  Or 
> > if it has negative vInts, ... etc.
> >
> > Or maybe the "special" DirectoryReader only reads stored fields?  And so 
> > you would enumerate your _source and reindex into the latest format ...
> >
> > > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> > > help make it harder to introduce corrupt data in an index.
> >
> > +1
> >
> > Every time we catch something like "don't allow pos = -1 into the index" we 
> > need somehow remember to go and add the check also in addIndices.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand  wrote:
> >>
> >> Agreed with Michael that setting expectations is going to be
> >> important. The thing that I would like to make sure is that we would
> >> never refrain from moving Lucene forward because of this feature. In
> >> particular, lucene-core should be free to make assumptions that are
> >> valid for N and N-1 indices without worrying about the fact that we
> >> have this super-expert feature that allows opening older indices. Here
> >> are some assumptions that I have in mind which have not always been
> >> true:
> >>  - norms might be encoded in a different way (this changed in 7)
> >>  - all index files have a checksum (only true since Lucene 5)
> >>  - offsets are always going forward (only enforced since Lucene 7)
> >>
> >> This means that carrying indices over by just merging them with the
> >> new version to move them to a new codec won't work all the time. For
> >> instance if your index has backward offsets and new codecs assume that
> >> offsets are going forward, then merging might fail or corrupt offsets
> >> - I'd like to make sure that we would not consider this a bug.
> >>
> >> Erick, I don't think this feature would be suitable for "robust index
> >> upgrades". To me it is really a best effort and shouldn't be trusted
> >> too much.
> >>
> >> I think some users will be tempted to wrap old readers to make them
> >> look good and then add them back to an index using addIndexes?
> >> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> >> help make it harder to introduce corrupt data in an index.
> >>
> >> On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
> >>  wrote:
> >> >
> >> > Hey folks,
> >> >
> >> > tl;dr; I want to be able to open an indexreader on an old index if the
> >> > SegmentInfo version is supported and all segment codecs are available.
> >> > Today that's not possible even if 

Re: [DISCUSS] Opening old indices for reading

2019-01-29 Thread Simon Willnauer
thanks folks,

these are all good points. I created a first cut of what I had in mind
[1] . It's relatively simple and from a java visibility perspective
the only change that a user can take advantage of is this [2] and this
[3] respectively. This would allow opening indices back to Lucene 7.0
given that the codecs and postings formats are available. From a
documentation perspective I added [4]. Thisi s a pure read-only change
and doesn't allow opening these indices for writing. You can't merge
them neither would you be able to open an index writer on top of it. I
still need to add support to Check-Index but that's what it is
basically.

lemme know what you think,

simon
[1] 
https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752
[2] 
https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e0352098b027d6f41a17c068ad8d7ef0R689
[3] 
https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e3ccf9ee90355b10f2dd22ce2da6c73cR306
[4] 
https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-1bedf4d0d52ff88ef8a16a6788ad7684R86

On Fri, Jan 25, 2019 at 3:14 PM Michael McCandless
 wrote:
>
> Another example is long ago Lucene allowed pos=-1 to be indexed and it caused 
> all sorts of problems.  We also stopped allowing positions close to 
> Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382).  Yet 
> another is allowing negative vInts which are possible but horribly 
> inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).
>
> We do need to be free to fix these problems and then know after N+2 releases 
> that no index can have the issue.
>
> I like the idea of providing "expert" / best effort / limited way of carrying 
> forward such ancient indices, but I think the huge challenge for someone 
> using that tool on an important index will be enumerating the list of issues 
> that might "matter" (the 3 Adrien listed + the 3 I listed above is a start 
> for this list) and taking appropriate steps to "correct" the index if so.  
> E.g. on a norms encoding change, somehow these expert tools must decode norms 
> the old way, encode them the new way, and then rewrite the norms files.  Or 
> if the index has pos=-1, changing that to pos=0.  Or if it has negative 
> vInts, ... etc.
>
> Or maybe the "special" DirectoryReader only reads stored fields?  And so you 
> would enumerate your _source and reindex into the latest format ...
>
> > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> > help make it harder to introduce corrupt data in an index.
>
> +1
>
> Every time we catch something like "don't allow pos = -1 into the index" we 
> need somehow remember to go and add the check also in addIndices.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand  wrote:
>>
>> Agreed with Michael that setting expectations is going to be
>> important. The thing that I would like to make sure is that we would
>> never refrain from moving Lucene forward because of this feature. In
>> particular, lucene-core should be free to make assumptions that are
>> valid for N and N-1 indices without worrying about the fact that we
>> have this super-expert feature that allows opening older indices. Here
>> are some assumptions that I have in mind which have not always been
>> true:
>>  - norms might be encoded in a different way (this changed in 7)
>>  - all index files have a checksum (only true since Lucene 5)
>>  - offsets are always going forward (only enforced since Lucene 7)
>>
>> This means that carrying indices over by just merging them with the
>> new version to move them to a new codec won't work all the time. For
>> instance if your index has backward offsets and new codecs assume that
>> offsets are going forward, then merging might fail or corrupt offsets
>> - I'd like to make sure that we would not consider this a bug.
>>
>> Erick, I don't think this feature would be suitable for "robust index
>> upgrades". To me it is really a best effort and shouldn't be trusted
>> too much.
>>
>> I think some users will be tempted to wrap old readers to make them
>> look good and then add them back to an index using addIndexes?
>> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
>> help make it harder to introduce corrupt data in an index.
>>
>> On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
>>  wrote:
>> >
>> > Hey folks,
>> >
>> > tl;dr; I want to be able to open an indexreader on an old index if the
>> > SegmentInfo version is supported and all segment codecs are available.
>> > Today that's not possible even if I port old formats to current
>> > versions.
>> >
>> > Our BWC policy for quite a while has been N-1 major versions. That's
>> > good and I think we should keep it that way. Only recently, caused by
>> > changes how we encode/decode norms we also hard-enforce a the
>> > 

Re: [DISCUSS] Opening old indices for reading

2019-01-25 Thread Michael McCandless
Another example is long ago Lucene allowed pos=-1 to be indexed and it
caused all sorts of problems.  We also stopped allowing positions close to
Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382).  Yet
another is allowing negative vInts which are possible but horribly
inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).

We do need to be free to fix these problems and then know after N+2
releases that no index can have the issue.

I like the idea of providing "expert" / best effort / limited way of
carrying forward such ancient indices, but I think the huge challenge for
someone using that tool on an important index will be enumerating the list
of issues that might "matter" (the 3 Adrien listed + the 3 I listed above
is a start for this list) and taking appropriate steps to "correct" the
index if so.  E.g. on a norms encoding change, somehow these expert tools
must decode norms the old way, encode them the new way, and then rewrite
the norms files.  Or if the index has pos=-1, changing that to pos=0.  Or
if it has negative vInts, ... etc.

Or maybe the "special" DirectoryReader only reads stored fields?  And so
you would enumerate your _source and reindex into the latest format ...

> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> help make it harder to introduce corrupt data in an index.

+1

Every time we catch something like "don't allow pos = -1 into the index" we
need somehow remember to go and add the check also in addIndices.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand  wrote:

> Agreed with Michael that setting expectations is going to be
> important. The thing that I would like to make sure is that we would
> never refrain from moving Lucene forward because of this feature. In
> particular, lucene-core should be free to make assumptions that are
> valid for N and N-1 indices without worrying about the fact that we
> have this super-expert feature that allows opening older indices. Here
> are some assumptions that I have in mind which have not always been
> true:
>  - norms might be encoded in a different way (this changed in 7)
>  - all index files have a checksum (only true since Lucene 5)
>  - offsets are always going forward (only enforced since Lucene 7)
>
> This means that carrying indices over by just merging them with the
> new version to move them to a new codec won't work all the time. For
> instance if your index has backward offsets and new codecs assume that
> offsets are going forward, then merging might fail or corrupt offsets
> - I'd like to make sure that we would not consider this a bug.
>
> Erick, I don't think this feature would be suitable for "robust index
> upgrades". To me it is really a best effort and shouldn't be trusted
> too much.
>
> I think some users will be tempted to wrap old readers to make them
> look good and then add them back to an index using addIndexes?
> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> help make it harder to introduce corrupt data in an index.
>
> On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
>  wrote:
> >
> > Hey folks,
> >
> > tl;dr; I want to be able to open an indexreader on an old index if the
> > SegmentInfo version is supported and all segment codecs are available.
> > Today that's not possible even if I port old formats to current
> > versions.
> >
> > Our BWC policy for quite a while has been N-1 major versions. That's
> > good and I think we should keep it that way. Only recently, caused by
> > changes how we encode/decode norms we also hard-enforce a the
> > index-version-created in several places and the version a segment was
> > written with. These are great enforcements and I understand why. My
> > request here is if we can find consensus on allowing somehow (a
> > special DirectoryReader for instance) to open such an index for
> > reading only that doesn't provide the guarantees that our high level
> > APIs decode norms correctly for instance. This would be enough to for
> > instance consume stored fields etc. for reindexing or if a users are
> > aware do they norms decoding in the codec. I am happy to work on a
> > proposal how this would work. It would still enforce no writing or
> > anything like this. I am also all for putting such a reader into misc
> > and being experimental.
> >
> > simon
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [DISCUSS] Opening old indices for reading

2019-01-25 Thread Adrien Grand
Agreed with Michael that setting expectations is going to be
important. The thing that I would like to make sure is that we would
never refrain from moving Lucene forward because of this feature. In
particular, lucene-core should be free to make assumptions that are
valid for N and N-1 indices without worrying about the fact that we
have this super-expert feature that allows opening older indices. Here
are some assumptions that I have in mind which have not always been
true:
 - norms might be encoded in a different way (this changed in 7)
 - all index files have a checksum (only true since Lucene 5)
 - offsets are always going forward (only enforced since Lucene 7)

This means that carrying indices over by just merging them with the
new version to move them to a new codec won't work all the time. For
instance if your index has backward offsets and new codecs assume that
offsets are going forward, then merging might fail or corrupt offsets
- I'd like to make sure that we would not consider this a bug.

Erick, I don't think this feature would be suitable for "robust index
upgrades". To me it is really a best effort and shouldn't be trusted
too much.

I think some users will be tempted to wrap old readers to make them
look good and then add them back to an index using addIndexes?
Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
help make it harder to introduce corrupt data in an index.

On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
 wrote:
>
> Hey folks,
>
> tl;dr; I want to be able to open an indexreader on an old index if the
> SegmentInfo version is supported and all segment codecs are available.
> Today that's not possible even if I port old formats to current
> versions.
>
> Our BWC policy for quite a while has been N-1 major versions. That's
> good and I think we should keep it that way. Only recently, caused by
> changes how we encode/decode norms we also hard-enforce a the
> index-version-created in several places and the version a segment was
> written with. These are great enforcements and I understand why. My
> request here is if we can find consensus on allowing somehow (a
> special DirectoryReader for instance) to open such an index for
> reading only that doesn't provide the guarantees that our high level
> APIs decode norms correctly for instance. This would be enough to for
> instance consume stored fields etc. for reindexing or if a users are
> aware do they norms decoding in the codec. I am happy to work on a
> proposal how this would work. It would still enforce no writing or
> anything like this. I am also all for putting such a reader into misc
> and being experimental.
>
> simon
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [DISCUSS] Opening old indices for reading

2019-01-24 Thread Michael Sokolov
+1 it makes sense to me; real world problems sometimes require messy
solutions. I guess the alternative is everybody develops their own suite of
tools and it is hard to share.

Some caution is warranted though I think; even with misc/experimental
caveats, these tools will only be useful if people can understand what to
expect from them, so it should be explicit what guarantees can be offered:
I don't know what they will be exactly, but supposing stored fields/doc
values fields can be retrieved/iterated over, but search results might
differ due to ranking differences, early termination relying on new index
structures?  Maybe naming/defining these as having a limited scope like
disaster recovery or migration or similar would give a hint that it should
not be used as some kind of adapter in a production system for old indexes.
I expect explaining what these tools are for to a wider audience will
deserve some care.

-Mike

On Wed, Jan 23, 2019 at 3:30 PM Erick Erickson 
wrote:

> +1, A lot of this was discussed on SOLR-12259, we should probably link
> any Lucene JIRAs for this back to that one to make an easy trail to
> follow.
>
> One thing I'd thought of is whether we should merge segments during
> this operation. If we're going to rewrite the entire index anyway,
> does it make sense to combine segments into max-sized segments a-la
> TieredMergePolicy?
>
> I'm not thinking of anything fancy at all here, there's no "cost" to
> calculate for instance. Just
> 1> go through the list of segments adding to a OneMerge until it's as
> big as it can be.
> 2> repeat until you have a list of OneMerge's that contain all the
> original segments.
>
> How big "as big as it can be" is TBD, TMP uses 5G. Could be a param I
> suppose.
>
> Erick
>
>
> On Wed, Jan 23, 2019 at 9:24 AM Andrzej Białecki  wrote:
> >
> > +1. I think that even with these caveats (read-only, some data may
> require re-interpretation) it would still be a great help for accessing
> legacy data, for which the original source may no longer exist.
> >
> > > On 23 Jan 2019, at 15:11, Simon Willnauer 
> wrote:
> > >
> > > Hey folks,
> > >
> > > tl;dr; I want to be able to open an indexreader on an old index if the
> > > SegmentInfo version is supported and all segment codecs are available.
> > > Today that's not possible even if I port old formats to current
> > > versions.
> > >
> > > Our BWC policy for quite a while has been N-1 major versions. That's
> > > good and I think we should keep it that way. Only recently, caused by
> > > changes how we encode/decode norms we also hard-enforce a the
> > > index-version-created in several places and the version a segment was
> > > written with. These are great enforcements and I understand why. My
> > > request here is if we can find consensus on allowing somehow (a
> > > special DirectoryReader for instance) to open such an index for
> > > reading only that doesn't provide the guarantees that our high level
> > > APIs decode norms correctly for instance. This would be enough to for
> > > instance consume stored fields etc. for reindexing or if a users are
> > > aware do they norms decoding in the codec. I am happy to work on a
> > > proposal how this would work. It would still enforce no writing or
> > > anything like this. I am also all for putting such a reader into misc
> > > and being experimental.
> > >
> > > simon
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [DISCUSS] Opening old indices for reading

2019-01-23 Thread Erick Erickson
+1, A lot of this was discussed on SOLR-12259, we should probably link
any Lucene JIRAs for this back to that one to make an easy trail to
follow.

One thing I'd thought of is whether we should merge segments during
this operation. If we're going to rewrite the entire index anyway,
does it make sense to combine segments into max-sized segments a-la
TieredMergePolicy?

I'm not thinking of anything fancy at all here, there's no "cost" to
calculate for instance. Just
1> go through the list of segments adding to a OneMerge until it's as
big as it can be.
2> repeat until you have a list of OneMerge's that contain all the
original segments.

How big "as big as it can be" is TBD, TMP uses 5G. Could be a param I
suppose.

Erick


On Wed, Jan 23, 2019 at 9:24 AM Andrzej Białecki  wrote:
>
> +1. I think that even with these caveats (read-only, some data may require 
> re-interpretation) it would still be a great help for accessing legacy data, 
> for which the original source may no longer exist.
>
> > On 23 Jan 2019, at 15:11, Simon Willnauer  wrote:
> >
> > Hey folks,
> >
> > tl;dr; I want to be able to open an indexreader on an old index if the
> > SegmentInfo version is supported and all segment codecs are available.
> > Today that's not possible even if I port old formats to current
> > versions.
> >
> > Our BWC policy for quite a while has been N-1 major versions. That's
> > good and I think we should keep it that way. Only recently, caused by
> > changes how we encode/decode norms we also hard-enforce a the
> > index-version-created in several places and the version a segment was
> > written with. These are great enforcements and I understand why. My
> > request here is if we can find consensus on allowing somehow (a
> > special DirectoryReader for instance) to open such an index for
> > reading only that doesn't provide the guarantees that our high level
> > APIs decode norms correctly for instance. This would be enough to for
> > instance consume stored fields etc. for reindexing or if a users are
> > aware do they norms decoding in the codec. I am happy to work on a
> > proposal how this would work. It would still enforce no writing or
> > anything like this. I am also all for putting such a reader into misc
> > and being experimental.
> >
> > simon
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [DISCUSS] Opening old indices for reading

2019-01-23 Thread Andrzej Białecki
+1. I think that even with these caveats (read-only, some data may require 
re-interpretation) it would still be a great help for accessing legacy data, 
for which the original source may no longer exist.

> On 23 Jan 2019, at 15:11, Simon Willnauer  wrote:
> 
> Hey folks,
> 
> tl;dr; I want to be able to open an indexreader on an old index if the
> SegmentInfo version is supported and all segment codecs are available.
> Today that's not possible even if I port old formats to current
> versions.
> 
> Our BWC policy for quite a while has been N-1 major versions. That's
> good and I think we should keep it that way. Only recently, caused by
> changes how we encode/decode norms we also hard-enforce a the
> index-version-created in several places and the version a segment was
> written with. These are great enforcements and I understand why. My
> request here is if we can find consensus on allowing somehow (a
> special DirectoryReader for instance) to open such an index for
> reading only that doesn't provide the guarantees that our high level
> APIs decode norms correctly for instance. This would be enough to for
> instance consume stored fields etc. for reindexing or if a users are
> aware do they norms decoding in the codec. I am happy to work on a
> proposal how this would work. It would still enforce no writing or
> anything like this. I am also all for putting such a reader into misc
> and being experimental.
> 
> simon
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[DISCUSS] Opening old indices for reading

2019-01-23 Thread Simon Willnauer
Hey folks,

tl;dr; I want to be able to open an indexreader on an old index if the
SegmentInfo version is supported and all segment codecs are available.
Today that's not possible even if I port old formats to current
versions.

Our BWC policy for quite a while has been N-1 major versions. That's
good and I think we should keep it that way. Only recently, caused by
changes how we encode/decode norms we also hard-enforce a the
index-version-created in several places and the version a segment was
written with. These are great enforcements and I understand why. My
request here is if we can find consensus on allowing somehow (a
special DirectoryReader for instance) to open such an index for
reading only that doesn't provide the guarantees that our high level
APIs decode norms correctly for instance. This would be enough to for
instance consume stored fields etc. for reindexing or if a users are
aware do they norms decoding in the codec. I am happy to work on a
proposal how this would work. It would still enforce no writing or
anything like this. I am also all for putting such a reader into misc
and being experimental.

simon

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org