Nhat Nguyen created LUCENE-10518: ------------------------------------ Summary: FieldInfos consistency check can refuse to open Lucene 8 index Key: LUCENE-10518 URL: https://issues.apache.org/jira/browse/LUCENE-10518 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 8.10.1 Reporter: Nhat Nguyen
A field-infos consistency check introduced in Lucene 9 (LUCENE-9334) can refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if hitting a non-aborting exception (for example [term is too long|https://github.com/apache/lucene-solr/blob/6a6484ba396927727b16e5061384d3cd80d616b2/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L944]) during processing fields of a document. We don't have this problem in Lucene 9 as we process fields in two phases with the [first phase|https://github.com/apache/lucene/blob/10ebc099c846c7d96f4ff5f9b7853df850fa8442/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L589-L614] processing only FieldInfos. The issue can be reproduced with this snippet. {code:java} public void testWriteIndexOn8x() throws Exception { FieldType KeywordField = new FieldType(); KeywordField.setTokenized(false); KeywordField.setOmitNorms(true); KeywordField.setIndexOptions(IndexOptions.DOCS); KeywordField.freeze(); try (Directory dir = newDirectory()) { IndexWriterConfig config = new IndexWriterConfig(); config.setCommitOnClose(false); config.setMergePolicy(NoMergePolicy.INSTANCE); try (IndexWriter writer = new IndexWriter(dir, config)) { // first segment writer.addDocument(new Document()); // an empty doc Document d1 = new Document(); byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1]; Arrays.fill(chars, (byte) 'a'); d1.add(new Field("field", new BytesRef(chars), KeywordField)); d1.add(new BinaryDocValuesField("field", new BytesRef(chars))); expectThrows(IllegalArgumentException.class, () -> writer.addDocument(d1)); writer.flush(); // second segment Document d2 = new Document(); d2.add(new Field("field", new BytesRef("hello world"), KeywordField)); d2.add(new SortedDocValuesField("field", new BytesRef("hello world"))); writer.addDocument(d2); writer.flush(); writer.commit(); // Check for doc values types consistency Map<String, DocValuesType> docValuesTypes = new HashMap<>(); try(DirectoryReader reader = DirectoryReader.open(dir)){ for (LeafReaderContext leaf : reader.leaves()) { for (FieldInfo fi : leaf.reader().getFieldInfos()) { DocValuesType current = docValuesTypes.putIfAbsent(fi.name, fi.getDocValuesType()); if (current != null && current != fi.getDocValuesType()) { fail("cannot change DocValues type from " + current + " to " + fi.getDocValuesType() + " for field \"" + fi.name + "\""); } } } } } } } {code} I would like to propose to: - Backport the two-phase fields processing from Lucene9 to Lucene8. The patch should be small and contained. - Introduce an option in Lucene9 to skip checking field-infos consistency (i.e., behave like Lucene 8 when the option is enabled). /cc [~mayya] and [~jpountz] -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org