mikemccand commented on code in PR #12872:
URL: https://github.com/apache/lucene/pull/12872#discussion_r1418947126
##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -873,13 +876,26 @@ public int compare(String a, String b) {
if (0 == result.numBadSegments) {
result.clean = true;
} else {
- msg(
- infoStream,
- "WARNING: "
- + result.numBadSegments
- + " broken segments (containing "
- + result.totLoseDocCount
- + " documents) detected");
+ if (result.numBadSegmentsWithoutSegmentInfo > 0) {
+ msg(
+ infoStream,
+ "WARNING: "
+ + result.numBadSegments
+ + " broken segments (containing "
Review Comment:
Maybe say `containing at least`?
##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -664,7 +667,7 @@ public int compare(String a, String b) {
// failure (fastFail == false). so if we get here, we should // always
have a valid lastCommit:
assert lastCommit != null;
- if (lastCommit == null) {
+ if (lastCommit == null) { // TODO: Remove/move unreachable if statement?
Review Comment:
Hmm this might still happen? E.g. some transient `IOException` or an Access
Denied or so?
##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment(
SegmentReader reader = null;
try {
+ if (info.info.maxDoc() == 0) {
+ throw new CheckIndexException(" this segment has missing or corrupt
segment info");
Review Comment:
Remove the extra space before `this`?
Also, can we include specifics about which segment we were unable to read
the `.si` for in the exception message?
##########
lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java:
##########
@@ -389,13 +389,39 @@ private static void parseSegmentInfos(
}
long totalDocs = 0;
+ SegmentInfo info;
for (int seg = 0; seg < numSegments; seg++) {
String segName = input.readString();
byte[] segmentID = new byte[StringHelper.ID_LENGTH];
input.readBytes(segmentID, 0, segmentID.length);
Codec codec = readCodec(input);
- SegmentInfo info =
- codec.segmentInfoFormat().read(directory, segName, segmentID,
IOContext.READ);
+ try {
+ info = codec.segmentInfoFormat().read(directory, segName, segmentID,
IOContext.READ);
+ } catch (FileNotFoundException | NoSuchFileException e) {
Review Comment:
Hmm can we upgrade this to catch any "normal" (not crazy things like OOME)
`Throwable`? E.g. if the `.si` file had some bits scrambled or what not we
will get a more exotic exception or maybe an exception because the checksum is
mismatched.
Can we somehow return the root cause exception here and include it in
`CheckIndexException` that we raise above?
##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment(
SegmentReader reader = null;
try {
+ if (info.info.maxDoc() == 0) {
Review Comment:
Ahh I see, you now separately handle the `maxDoc == 0` case, here. But,
what if an index has an illegal 0 doc segment, and an intact `.si`? We will
now (incorrectly) state that it is missing its `SegmentInfo`? Maybe we could
add a separate boolean somewhere to track `segmentInfoMissing` or so?
##########
lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java:
##########
@@ -389,13 +389,39 @@ private static void parseSegmentInfos(
}
long totalDocs = 0;
+ SegmentInfo info;
for (int seg = 0; seg < numSegments; seg++) {
String segName = input.readString();
byte[] segmentID = new byte[StringHelper.ID_LENGTH];
input.readBytes(segmentID, 0, segmentID.length);
Codec codec = readCodec(input);
- SegmentInfo info =
- codec.segmentInfoFormat().read(directory, segName, segmentID,
IOContext.READ);
+ try {
+ info = codec.segmentInfoFormat().read(directory, segName, segmentID,
IOContext.READ);
+ } catch (FileNotFoundException | NoSuchFileException e) {
+ // https://github.com/apache/lucene/issues/7820 - If we throw a
CorruptIndex exception here,
+ // a missing segment info will cause parseSegmentInfos() to
short-circuit and stop reading
+ // the rest of the segments. So, when encountering a segment with a
missing
+ // segment info (.si) file, we initialize an empty segment info
(containing 0 docs)
+ // to keep track of the segment name, so that we can continue reading
other segments, and
+ // the segment with the missing .si file can be later validated or
exorcised by CheckIndex()
+ info =
Review Comment:
Hmm I don't like that this will suppress corruption when `SegmentInfos` is
being loaded NOT from `CheckIndex`? This is the wrong layer to have such
leniency? This method should throw the exception back up to the caller and
caller should handle properly?
Could we instead 1) create a subclass of `CorruptIndexException`, maybe
`CorruptSegmentInfoException` or so, 2) require that caller pass in the segment
name, and root cause exception, when creating it, and 3) catch any `Throwable`
here and re-throw that subclass?
Then we can fix `CheckIndex` to specifically catch that, pull the segment
name out, and do its thing.
##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -948,7 +965,7 @@ private Status.SegmentInfoStatus testSegment(
segInfoStat.maxDoc = info.info.maxDoc();
final Version version = info.info.getVersion();
- if (info.info.maxDoc() <= 0) {
+ if (info.info.maxDoc() < 0) {
Review Comment:
Hmm why did you have to remove the `0` case? Lucene considers this
corruption (IW should never flush or merge to a 0 doc segment).
##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -4270,11 +4290,21 @@ public int doCheck(Options opts) throws IOException,
InterruptedException {
+ result.totLoseDocCount
+ " documents would be lost, if -exorcise were specified\n");
} else {
- opts.out.println("WARNING: " + result.totLoseDocCount + " documents
will be lost\n");
- opts.out.println(
- "NOTE: will write new segments file in 5 seconds; this will remove
"
- + result.totLoseDocCount
- + " docs from the index. YOU WILL LOSE DATA. THIS IS YOUR LAST
CHANCE TO CTRL+C!");
+ if (result.numBadSegmentsWithoutSegmentInfo > 0) {
+ opts.out.println(
+ "WARNING: " + result.totLoseDocCount + " or more documents will
be lost\n");
Review Comment:
Can we change the wording to `at least ...`? I think that's stronger / more
alarming on casual read than ` or more`.
Or maybe `more than XX documents will be lost`, since we (almost certainly)
know the lost segment(s) have `maxDoc > 0`?
Whichever wording we use, let's be consistent in the thrown exception
message above too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]