[jira] [Updated] (LUCENE-5189) Numeric DocValues Updates

Shai Erera (JIRA) Thu, 29 Aug 2013 02:39:15 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shai Erera updated LUCENE-5189:
-------------------------------

    Attachment: LUCENE-5189.patch

Patch adds some nocommits and tests that expose some problems:

+Problem 1+
If you run the test with {{-Dtests.method=testSegmentMerges 
-Dtests.seed=7651E2AEEBC55BDF}}, you'll hit an exception:

{noformat}
NOTE: reproduce with: ant test  -Dtestcase=TestNumericDocValuesUpdates 
-Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF 
-Dtests.locale=en_AU -Dtests.timezone=Etc/GMT+11 -Dtests.file.encoding=UTF-8
Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl 
checkThreadLeaks
WARNING: Will linger awaiting termination of 1 leaked thread(s).
Aug 29, 2013 11:57:35 AM 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread 
#0,6,TGRP-TestNumericDocValuesUpdates]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError: 
formatName=Lucene45 prevValue=Memory
        at __randomizedtesting.SeedInfo.seed([7651E2AEEBC55BDF]:0)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
        at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.getInstance(PerFieldDocValuesFormat.java:133)
        at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:105)
        at 
org.apache.lucene.index.ReadersAndLiveDocs.writeLiveDocs(ReadersAndLiveDocs.java:389)
        at 
org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:178)
        at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3732)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3401)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
{noformat}

What happens is the test uses RandomCodec and picks MemoryDVF for writing that 
field. Later, when ReaderAndLiveDocs applies updates to that field, it uses 
SI.codec, which is not RandomCodec anymore, but Lucene45Codec (or in this case 
Facet45Codec - based on Codec.forName("Lucene45")), and its DVF returns for 
that field Lucene45DVF, because Lucene45Codec always returns that. The way it 
works during search is that PerFieldDVF.FieldsReader does not rely on the Codec 
at all, but rather looks up an attribute in FieldInfo which tells it the 
DVFormat.name and then it calls DVF.forName. But for writing, it relies on the 
Codec.

I am not sure how to resolve this. I don't think ReaderAndLiveDocs is doing 
anything wrong -- per-field is not exposed on Codec API, therefore it shouldn't 
assume it should do any per-field stuff. But on the other hand, Lucene45Codec 
instances return per-field DVF based on what the instance says, and don't look 
at the FieldInfo attributes, as PerFieldDVF.FieldsReader does. Any ideas?

+Problem 2+
Robert thought of this usecase: if you have a sparse DocValue field 'f', such 
that say in segment 1 only doc1 has a value, but in segment 2 none of the 
documents have values, you cannot really update documents in segment 2, because 
the FieldInfos for that segment won't list the field as having DocValues at 
all. For now, I catch that case in ReaderAndLiveDocs and throw an exception. 
The workaround is to make sure you always have values for a field in a segment, 
by e.g. always setting some default value. But this is ugly and exposes 
internal stuff (e.g. segments) to users. Also, it's bad because e.g. if 
segments 1+2 are merged, you suddenly *can* update documents that were in 
segment2 before.

A way to solve it is to gen FieldInfos as well. That will allow us to 
additionally support adding new fields through field updates, though that's 
optional and we can still choose to forbid it. If we gen FieldInfos though, the 
changes I've done to SegmentInfos (recording per-field dvGen) need to be 
reverted. So it's important that we come to a resolution about this in this 
issue. This is somewhat of a corner case (sparse fields), but I don't like the 
fact that users can trip on exceptions that depend whether or not the segment 
was merged...

+Problem 3+
FieldInfos.Builder neglect to update globalFieldNumbers.docValuesType map, if 
it updates a FieldInfo's DocValueType. It's an easy fix, and I added a test to 
numeric updates. If someone has an idea how to reproduce this outside of 
numeric updates scope, I'll be happy handle this in a separate issue. The 
problem is that if you add two fields to a document with same name, one as a 
posting and one as DV, globalFieldNumbers does not record that field name in 
its docValuesType map. This map however is currently used by 
IW.updateNumericDVField and by an assert in FieldInfos.Builder, therefore I 
wasn's able to reproduce it outside of numeric updates scope.
                
> Numeric DocValues Updates
> -------------------------
>
>                 Key: LUCENE-5189
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5189
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5189.patch, LUCENE-5189.patch, LUCENE-5189.patch, 
> LUCENE-5189.patch
>
>
> In LUCENE-4258 we started to work on incremental field updates, however the 
> amount of changes are immense and hard to follow/consume. The reason is that 
> we targeted postings, stored fields, DV etc., all from the get go.
> I'd like to start afresh here, with numeric-dv-field updates only. There are 
> a couple of reasons to that:
> * NumericDV fields should be easier to update, if e.g. we write all the 
> values of all the documents in a segment for the updated field (similar to 
> how livedocs work, and previously norms).
> * It's a fairly contained issue, attempting to handle just one data type to 
> update, yet requires many changes to core code which will also be useful for 
> updating other data types.
> * It has value in and on itself, and we don't need to allow updating all the 
> data types in Lucene at once ... we can do that gradually.
> I have some working patch already which I'll upload next, explaining the 
> changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5189) Numeric DocValues Updates

Reply via email to