[jira] [Commented] (LUCENE-9324) Give IDs to SegmentCommitInfo

2020-04-22 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089975#comment-17089975
 ] 

David Smiley commented on LUCENE-9324:
--

I really appreciate that explanation Simon; thanks!

> Give IDs to SegmentCommitInfo
> -
>
> Key: LUCENE-9324
> URL: https://issues.apache.org/jira/browse/LUCENE-9324
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: master (9.0), 8.6
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> We already have IDs in SegmentInfo, which are useful to uniquely identify 
> segments. Having IDs on SegmentCommitInfo would be useful too in order to 
> compare commits for equality and make snapshots incremental on generational 
> files too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9324) Give IDs to SegmentCommitInfo

2020-04-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086441#comment-17086441
 ] 

ASF subversion and git services commented on LUCENE-9324:
-

Commit 2d63a9d1208bbf950135b90496268b0a40e119b5 in lucene-solr's branch 
refs/heads/branch_8x from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2d63a9d ]

LUCENE-9324: Add an ID to SegmentCommitInfo (#1434)

We already have IDs in SegmentInfo, as well as on SegmentInfos which are useful
to uniquely identify segments and entire commits. Having IDs on
SegmentCommitInfo is be useful too in order to compare commits for equality and
make snapshots incremental on generational files.  This change adds a unique ID
to SegmentCommitInfo starting from Lucene 8.6. Older segments won't have an ID
until the segment receives an update or a delete even if they have been opened
and / or committed by Lucene 8.6 or above.


> Give IDs to SegmentCommitInfo
> -
>
> Key: LUCENE-9324
> URL: https://issues.apache.org/jira/browse/LUCENE-9324
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: master (9.0), 8.6
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> We already have IDs in SegmentInfo, which are useful to uniquely identify 
> segments. Having IDs on SegmentCommitInfo would be useful too in order to 
> compare commits for equality and make snapshots incremental on generational 
> files too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9324) Give IDs to SegmentCommitInfo

2020-04-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086438#comment-17086438
 ] 

ASF subversion and git services commented on LUCENE-9324:
-

Commit 113043b1ed2ac95de17f6bdd203f6050ff6ca1f7 in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=113043b ]

LUCENE-9324: Add an ID to SegmentCommitInfo (#1434)

We already have IDs in SegmentInfo, as well as on SegmentInfos which are useful 
to uniquely identify segments and entire commits. Having IDs on 
SegmentCommitInfo is be useful too in
order to compare commits for equality and make snapshots incremental on 
generational files.
This change adds a unique ID to SegmentCommitInfo starting from Lucene 8.6. 
Older segments won't have an ID until the segment receives an update or a 
delete even if they have been opened and / or committed by Lucene 8.6 or above.

> Give IDs to SegmentCommitInfo
> -
>
> Key: LUCENE-9324
> URL: https://issues.apache.org/jira/browse/LUCENE-9324
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: master (9.0), 8.6
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> We already have IDs in SegmentInfo, which are useful to uniquely identify 
> segments. Having IDs on SegmentCommitInfo would be useful too in order to 
> compare commits for equality and make snapshots incremental on generational 
> files too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9324) Give IDs to SegmentCommitInfo

2020-04-16 Thread Simon Willnauer (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084682#comment-17084682
 ] 

Simon Willnauer commented on LUCENE-9324:
-

I am trying to give a bit more context to this issue. Today we have 
_SegmentInfo_ which represents a segment once it's written to disk for instance 
at flush or merge time. We have a randomly generated ID in _SegmentInfo_ that 
can be used to verify if two segments are the same. Since we use incremental 
numbers for segment naming it's likely that two IndexWriters produce a segment 
with very similar contents and the same name. Yet, the _SegmentInfo_ id would 
be different. In addition to this ID we also have checksums on files which can 
be used to verify identity in addition to the ID but should not be treated 
identity by itself since they are very weak checksums. 
Now segments also get _updated_ for instance when a documents is marked as 
deleted or the segment receives a doc values update. The only thing that 
changes is the delete or update generation which also allow two IndexWriters 
that opened two copies of a segment (with the same segment ID) to produce a new 
delGen or dvGen that looks identical from the outside but are actually 
different. This is a problem that we see quite frequently in Elasticsearch and 
we'd like to prevent or have a better tool in our hands to distinguish 
_SegmentCommitInfo_ instances from another. If we'd have an ID on 
SegmentCommitInfo that changes each time one of these generations changes we 
could much easier tell if only the updated files (which are often very small) 
need to be replaced in order to recover an index. 

The plan is to implement this in a very similar fashion as we did on the 
_SegmentInfo_ but also invalidate the once any of the generations change in 
order to force a new _SegmentCommitInfo_ ID for the new generation. Yet, the 
IDs would not be the same if two IndexWriters start from the same segment 
making an identical change to the segment ie. it's not a replacement for a 
strong hash function.

> Give IDs to SegmentCommitInfo
> -
>
> Key: LUCENE-9324
> URL: https://issues.apache.org/jira/browse/LUCENE-9324
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
>
> We already have IDs in SegmentInfo, which are useful to uniquely identify 
> segments. Having IDs on SegmentCommitInfo would be useful too in order to 
> compare commits for equality and make snapshots incremental on generational 
> files too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org