[jira] Commented: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction

Michael McCandless (JIRA) Fri, 07 Aug 2009 09:03:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740604#action_12740604
 ]


Michael McCandless commented on LUCENE-1789:
--------------------------------------------


bq. Correct: a change like this could cause 2.9 to introduce a time based 
performance hit from the added method call to resolve the sub(reader|docvalue) 
on each method call ... but if we don't have a change like this, 2.9 could 
introduce a memory based performance hit from the other FieldCache changes as 
it client code accessing DocValues for the top level reader will create a 
duplication of the whole array.

True, and of the two, I agree a hidden time cost is the lesser evil.

But I'd prefer to not hide the cost, ie, encourage/force an explicit
choice when users upgrade to 2.9.  If we can't think of some realistic
way to do that, then I agree we should go forward with the current
approach...

bq. Incidently: I'm willing to believe you that the time based perf hit would 
be high, but my instinct is that it wouldn't be that bad: the DocValues API 
already introduces at least one method call per doc lookup (two depending on 
datatype). adding a second method call to delegate to a sub-DocValues isntance 
doesn't seem that bad (especially since a new MultDocValues class could get the 
subReader list and compute the docId offsets on init, and then reuse them on 
each method call)

It's the added binary search in ReaderUtil.subSearcher that worries
me.

{quote} 
bq. In the core I think we should always switch "up high".

(In case there is any confusion: wasn't suggesting that we stop using "up high" 
switching on DocValues in code included in the Lucene dist, i was suggesting 
that if someone uses DocValues directly in their code (against a top level 
reader) then we help them out by giving them the "down low" switching ... so 
"expected" usages wouldn't pay the added time based hit, just "unexpected" 
usages (which would be saved from the memory hit))
{quote}

Understood.  We are only talking about external usages of these APIs,
and even then, exceptionally advance usage.  Ie, users who make their
own ValueSourceQuery and then run it against an IndexSearcher will be
fine.  It's only people who directly invoke getValues, w/ some random
reader, that hit the hidden cost.

{quote}
bq. But, if we make the proposed change here, the app could in fact just keep 
working off the top-level values (eg if the ctor in their class is pulling 
these values), thinking everything is fine when in fact there is a sizable, 
silent perf hit.

I agree ... but unless i'm missing something about the code on the trunk, that 
situation already exists: the developer might switch to using the Collector 
API, but nothing about the current trunk will prevent/warn him that this...

ValueSource vs = new ValueSource("aFieldIAlsoSortOn");
IndexReader r = getCurrentReaderThatCouldBeAMultiReader();
DocValues vals = vs.getDocValues(r);
...could have a sizable, silent, memory perf hit in 2.9

(ValueSource.getValues has a javadoc indicating that caching will be done on 
the IndexReader passed in, but your comment suggests that if 2.9 were released 
today (with hte current trunk) people upgrading would have some obvious way of 
noticing that they need to pass a sub reader to getValues)
{quote}

How about this: we add a new param to the ctors of the value sources,
called (say) acceptMultiReader.  It has 3 values:

  - NO means an exception is thrown on seeing a top reader (where "top
    reader" means any reader whose getSequentialSubReaders is
    non-null)

  - YES_BURN_TIME means accept the top reader and make a
     MultiDocValues

  - YES_BURN_MEMORY means use the top reader against the field cache

We deprecate the existing ctors, so on moving to 3.0 you have to make
an explicit choice, but default it to YES_BURN_TIME.

One benefit of making the choice explicit is for those apps that have
memory to burn they may in fact choose to burn it.

Would this give a clean migration path forward?


> getDocValues should provide a MultiReader DocValues abstraction
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1789
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1789
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Hoss Man
>            Priority: Minor
>             Fix For: 2.9
>
>
> When scoring a ValueSourceQuery, the scoring code calls 
> ValueSource.getValues(reader) on *each* leaf level subreader -- so DocValue 
> instances are backed by the individual FieldCache entries of the subreaders 
> -- but if Client code were to inadvertently  called getValues() on a 
> MultiReader (or DirectoryReader) they would wind up using the "outer" 
> FieldCache.
> Since getValues(IndexReader) returns DocValues, we have an advantage here 
> that we don't have with FieldCache API (which is required to provide direct 
> array access). getValues(IndexReader) could be implimented so that *IF* some 
> a caller inadvertently passes in a reader with non-null subReaders, getValues 
> could generate a DocValues instance for each of the subReaders, and then wrap 
> them in a composite "MultiDocValues".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1789) getDocValues should provide a MultiReader DocValues abstraction

Reply via email to