[
https://issues.apache.org/jira/browse/LUCENE-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shai Erera updated LUCENE-5189:
-------------------------------
Attachment: LUCENE-5189.patch
Patch adds some nocommits and tests that expose some problems:
+Problem 1+
If you run the test with {{-Dtests.method=testSegmentMerges
-Dtests.seed=7651E2AEEBC55BDF}}, you'll hit an exception:
{noformat}
NOTE: reproduce with: ant test -Dtestcase=TestNumericDocValuesUpdates
-Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF
-Dtests.locale=en_AU -Dtests.timezone=Etc/GMT+11 -Dtests.file.encoding=UTF-8
Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl
checkThreadLeaks
WARNING: Will linger awaiting termination of 1 leaked thread(s).
Aug 29, 2013 11:57:35 AM
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
uncaughtException
WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread
#0,6,TGRP-TestNumericDocValuesUpdates]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError:
formatName=Lucene45 prevValue=Memory
at __randomizedtesting.SeedInfo.seed([7651E2AEEBC55BDF]:0)
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
at
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.getInstance(PerFieldDocValuesFormat.java:133)
at
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:105)
at
org.apache.lucene.index.ReadersAndLiveDocs.writeLiveDocs(ReadersAndLiveDocs.java:389)
at
org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:178)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3732)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3401)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
{noformat}
What happens is the test uses RandomCodec and picks MemoryDVF for writing that
field. Later, when ReaderAndLiveDocs applies updates to that field, it uses
SI.codec, which is not RandomCodec anymore, but Lucene45Codec (or in this case
Facet45Codec - based on Codec.forName("Lucene45")), and its DVF returns for
that field Lucene45DVF, because Lucene45Codec always returns that. The way it
works during search is that PerFieldDVF.FieldsReader does not rely on the Codec
at all, but rather looks up an attribute in FieldInfo which tells it the
DVFormat.name and then it calls DVF.forName. But for writing, it relies on the
Codec.
I am not sure how to resolve this. I don't think ReaderAndLiveDocs is doing
anything wrong -- per-field is not exposed on Codec API, therefore it shouldn't
assume it should do any per-field stuff. But on the other hand, Lucene45Codec
instances return per-field DVF based on what the instance says, and don't look
at the FieldInfo attributes, as PerFieldDVF.FieldsReader does. Any ideas?
+Problem 2+
Robert thought of this usecase: if you have a sparse DocValue field 'f', such
that say in segment 1 only doc1 has a value, but in segment 2 none of the
documents have values, you cannot really update documents in segment 2, because
the FieldInfos for that segment won't list the field as having DocValues at
all. For now, I catch that case in ReaderAndLiveDocs and throw an exception.
The workaround is to make sure you always have values for a field in a segment,
by e.g. always setting some default value. But this is ugly and exposes
internal stuff (e.g. segments) to users. Also, it's bad because e.g. if
segments 1+2 are merged, you suddenly *can* update documents that were in
segment2 before.
A way to solve it is to gen FieldInfos as well. That will allow us to
additionally support adding new fields through field updates, though that's
optional and we can still choose to forbid it. If we gen FieldInfos though, the
changes I've done to SegmentInfos (recording per-field dvGen) need to be
reverted. So it's important that we come to a resolution about this in this
issue. This is somewhat of a corner case (sparse fields), but I don't like the
fact that users can trip on exceptions that depend whether or not the segment
was merged...
+Problem 3+
FieldInfos.Builder neglect to update globalFieldNumbers.docValuesType map, if
it updates a FieldInfo's DocValueType. It's an easy fix, and I added a test to
numeric updates. If someone has an idea how to reproduce this outside of
numeric updates scope, I'll be happy handle this in a separate issue. The
problem is that if you add two fields to a document with same name, one as a
posting and one as DV, globalFieldNumbers does not record that field name in
its docValuesType map. This map however is currently used by
IW.updateNumericDVField and by an assert in FieldInfos.Builder, therefore I
wasn's able to reproduce it outside of numeric updates scope.
> Numeric DocValues Updates
> -------------------------
>
> Key: LUCENE-5189
> URL: https://issues.apache.org/jira/browse/LUCENE-5189
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: Shai Erera
> Assignee: Shai Erera
> Attachments: LUCENE-5189.patch, LUCENE-5189.patch, LUCENE-5189.patch,
> LUCENE-5189.patch
>
>
> In LUCENE-4258 we started to work on incremental field updates, however the
> amount of changes are immense and hard to follow/consume. The reason is that
> we targeted postings, stored fields, DV etc., all from the get go.
> I'd like to start afresh here, with numeric-dv-field updates only. There are
> a couple of reasons to that:
> * NumericDV fields should be easier to update, if e.g. we write all the
> values of all the documents in a segment for the updated field (similar to
> how livedocs work, and previously norms).
> * It's a fairly contained issue, attempting to handle just one data type to
> update, yet requires many changes to core code which will also be useful for
> updating other data types.
> * It has value in and on itself, and we don't need to allow updating all the
> data types in Lucene at once ... we can do that gradually.
> I have some working patch already which I'll upload next, explaining the
> changes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]