Re: FW: Is there a really performant way to store a full 32-bit int in doc values?

Michael McCandless Tue, 08 Oct 2013 11:32:09 -0700

I dug into this a bit: I ran the test on the current 4.x branch:

With the default DVFormat as of 4.5 (which leaves values on disk)
I get this:


Scoring 25000000 documents with direct buffers (without square root) took 178
Scoring 25000000 documents with direct buffers (without square root) took 165
Scoring 25000000 documents with direct buffers (without square root) took 179
Scoring 25000000 documents with direct buffers (and a square root) took 276
Scoring 25000000 documents with direct buffers (and a square root) took 270
Scoring 25000000 documents with direct buffers (and a square root) took 269
Scoring 25000000 documents with a lucene value (without square root)
source took 1361
Scoring 25000000 documents with a lucene value (without square root)
source took 1347
Scoring 25000000 documents with a lucene value (without square root)
source took 1350
Scoring 25000000 documents with a lucene value (without square root)
source took 1353
Scoring 25000000 documents with a lucene value (without square root)
source took 1354
Scoring 25000000 documents with a lucene value (and a square root)
source took 1364
Scoring 25000000 documents with a lucene value (and a square root)
source took 1358
Scoring 25000000 documents with a lucene value (and a square root)
source took 1358
Scoring 25000000 documents with a lucene value (and a square root)
source took 1408
Scoring 25000000 documents with a lucene value (and a square root)
source took 1362

And with "Memory" DVFormat (default before 4.5) I get this:

Scoring 25000000 documents with direct buffers (without square root) took 180
Scoring 25000000 documents with direct buffers (without square root) took 163
Scoring 25000000 documents with direct buffers (without square root) took 180
Scoring 25000000 documents with direct buffers (and a square root) took 277
Scoring 25000000 documents with direct buffers (and a square root) took 280
Scoring 25000000 documents with direct buffers (and a square root) took 269
Scoring 25000000 documents with a lucene value (without square root)
source took 1001
Scoring 25000000 documents with a lucene value (without square root)
source took 593
Scoring 25000000 documents with a lucene value (without square root)
source took 592
Scoring 25000000 documents with a lucene value (without square root)
source took 645
Scoring 25000000 documents with a lucene value (without square root)
source took 592
Scoring 25000000 documents with a lucene value (and a square root)
source took 643
Scoring 25000000 documents with a lucene value (and a square root)
source took 645
Scoring 25000000 documents with a lucene value (and a square root)
source took 646
Scoring 25000000 documents with a lucene value (and a square root)
source took 649
Scoring 25000000 documents with a lucene value (and a square root)
source took 646

So the disk-seek + decode is costly ... to force "Memory" DVF I just
made this simple codec:

  static class MyCodec extends Lucene45Codec {
    @Override
    public DocValuesFormat getDocValuesFormatForField(String field) {
      return new MemoryDocValuesFormat();
    }
  }

And set that in the IndexWriterConfig.

Using Memory, I checked: it looks like even with random floats you're
using 27 or 28 bits from packed ints, which is likely the source of
the slowness.  So I then I changed the codec to this:

  static class MyCodec extends Lucene45Codec {
    @Override
    public DocValuesFormat getDocValuesFormatForField(String field) {
      return new MemoryDocValuesFormat(PackedInts.FASTEST);
    }
  }

This tells the Memory DVFormat that you're willing to waste RAM to get
faster decode speed.  However, it had no effect!  Apparently, Memory
DVFormat only uses that parameter when the number of unique values is
small ... so I temporarily hacked the code to use 32 bits per value
(i.e., int[]):

Scoring 25000000 documents with direct buffers (without square root) took 179
Scoring 25000000 documents with direct buffers (without square root) took 165
Scoring 25000000 documents with direct buffers (without square root) took 179
Scoring 25000000 documents with direct buffers (and a square root) took 274
Scoring 25000000 documents with direct buffers (and a square root) took 268
Scoring 25000000 documents with direct buffers (and a square root) took 269
Scoring 25000000 documents with a lucene value (without square root)
source took 1193
Scoring 25000000 documents with a lucene value (without square root)
source took 411
Scoring 25000000 documents with a lucene value (without square root)
source took 407
Scoring 25000000 documents with a lucene value (without square root)
source took 405
Scoring 25000000 documents with a lucene value (without square root)
source took 407
Scoring 25000000 documents with a lucene value (and a square root)
source took 423
Scoring 25000000 documents with a lucene value (and a square root)
source took 418
Scoring 25000000 documents with a lucene value (and a square root)
source took 415
Scoring 25000000 documents with a lucene value (and a square root)
source took 420
Scoring 25000000 documents with a lucene value (and a square root)
source took 414

So, ~46% faster: ~405 msec vs 592 msec

You should be able to do even better with a custom DVFormat, because
even forcing 32 bits per value, it's still doing block decode, so
every get must first find the block then lookup value within the block.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 8, 2013 at 1:27 PM,  <[email protected]> wrote:
> Hi David,
>
> We tried that and still didn't come close to DirectBuffer speed.  It was only 
> about 20% faster.  I've attached updated numbers.
>
> We looked through the Lucene code and determined that very likely the costly 
> part is loading each part of an int out of the byte array.  There are much 
> faster (in fact, native) operations available for reading a whole word or 
> float at one time, if we could get access to the DirectBuffer behind the 
> DocValues implementation.  But when Lucene loads the byte array into Java 
> heap memory that ability is lost.
>
> Karl
>
>
> -----Original Message-----
> From: ext David Smiley (@MITRE.org) [mailto:[email protected]]
> Sent: Tuesday, October 08, 2013 11:52 AM
> To: [email protected]
> Subject: Re: FW: Is there a really performant way to store a full 32-bit int 
> in doc values?
>
> Hi Karl!
>
> I suggest that you put the point data you need in BinaryDocValues.  That is 
> both the x & y into the same byte[] chunk.  I've done this for a Solr 
> integration in https://issues.apache.org/jira/browse/SOLR-5170
>
> ~ David
>
>
> karl.wright-2 wrote
>> Hi All (and especially Robert),
>>
>> Lucene NumericDocValues seems to operate slower than we would expect.
>> In our application, we're using it for storing coordinate values,
>> which we retrieve to compute a distance.  While doing timings trying
>> to determine the impact of including a sqrt in the calculation, we
>> noted that the lucene overhead itself overwhelmed pretty much anything
>> we did in the ValueSource.
>>
>> One of our engineers did performance testing (code attached, hope it gets
>> through), which shows what we are talking about.   Please see the thread
>> below.  The question is: why is lucene 2.5x slower than a direct
>> buffer access for this case?  And is there anything we can do in the
>> Lucene paradigm to get our performance back closer to the direct buffer case?
>>
>> Karl
>>
>> -----Original Message-----
>> From: Ziech Christian (HERE/Berlin)
>> Sent: Tuesday, October 08, 2013 9:08 AM
>> To: Wright Karl (HERE/Cambridge)
>> Subject: AW: Is there a really performant way to store a full 32-bit
>> int in doc values?
>>
>> Hi,
>>
>> I have tested now the approach with usind the NumericDocValues
>> directly and it indeed helps about 20% compared to the original Lucene
>> numbers - Lucene is still 2,5x slower than using a DirectBuffer alone but it 
>> helps.
>> The funny thing is really that with lucene using the SquareRoot is
>> almost meaningless which can be explained well by the CPU calculating
>> the SquareRoot while other things are computated and since it doesn't
>> need the result for a while in my micro-Benchmark it can happily do
>> other things in the meantime. Since we also have a lot of other query
>> aspects we'd get that gain either way I assume so calculating about
>> 30-50ms for the square root for the scoring 25M documents should be
>> about accurate. So what is lucene doing that causes it to be 3 times slower 
>> than the naive approach.
>> And why is that impact compared to the one of a simple square root
>> (slowing down things by ~20% when assuming the 30ms with more complex
>> actions) so big? I mean 20% vs 200% is a magnitude!
>> As a side note: Storing the values as a int when using a DirectBuffer
>> doesn't seem helpful - I assume because we have to cast the in to
>> float either way later.
>>
>> BR
>>   Christian
>>
>> PS: The new numbers are:
>> Scoring 25000000 documents with direct float buffers (without square
>> root) took 190
>>
>> Scoring 25000000 documents with direct float buffers (without square
>> root) took 171
>>
>> Scoring 25000000 documents with direct float buffers (without square
>> root) took 172
>>
>> Scoring 25000000 documents with direct float buffers (and a square
>> root) took 281
>>
>> Scoring 25000000 documents with direct float buffers (and a square
>> root) took 280
>>
>> Scoring 25000000 documents with direct float buffers (and a square
>> root) took 266
>>
>> Scoring 25000000 documents with a lucene float value source (without
>> square root) took 1045
>>
>> Scoring 25000000 documents with a lucene float value source (without
>> square root) took 625
>>
>> Scoring 25000000 documents with a lucene float value source (without
>> square root) took 630
>>
>> Scoring 25000000 documents with a lucene float value source (and a
>> square
>> root) took 661
>>
>> Scoring 25000000 documents with a lucene float value source (and a
>> square
>> root) took 670
>>
>> Scoring 25000000 documents with a lucene float value source (and a
>> square
>> root) took 665
>>
>> Scoring 25000000 documents with direct int buffers (without square
>> root) took 218
>>
>> Scoring 25000000 documents with direct int buffers (without square
>> root) took 219
>>
>> Scoring 25000000 documents with direct int buffers (without square
>> root) took 204
>>
>> Scoring 25000000 documents with a lucene numeric values (without
>> square
>> root) source took 1123
>>
>> Scoring 25000000 documents with a lucene numeric values (without
>> square
>> root) source took 500
>>
>> Scoring 25000000 documents with a lucene numeric values (without
>> square
>> root) source took 499
>>
>> Scoring 25000000 documents with a lucene numeric values (and a square
>> root) source took 531
>>
>> Scoring 25000000 documents with a lucene numeric values (and a square
>> root) source took 531
>>
>> Scoring 25000000 documents with a lucene numeric values (and a square
>> root) source took 535
>>
>>
>> ________________________________________
>> Von: Wright Karl (HERE/Cambridge)
>> Gesendet: Montag, 7. Oktober 2013 09:22
>> An: Ziech Christian (HERE/Berlin)
>> Betreff: FW: Is there a really performant way to store a full 32-bit
>> int in doc values?
>>
>> -----Original Message-----
>> From: ext Michael McCandless [mailto:
>
>> lucene@
>
>> ]
>> Sent: Monday, October 07, 2013 8:28 AM
>> To: Wright Karl (HERE/Cambridge)
>> Subject: Re: Is there a really performant way to store a full 32-bit
>> int in doc values?
>>
>> Well, it is a micro-benchmark ... so it'd be better to test in the
>> wider/full context of the application?
>>
>> I'm also a little worried that you go through ValueSource instead of
>> interacting directly with the NumericDocValues instance; it's just an
>> additional level of indirection that may confuse hotspot.  But it
>> really ought not be so bad ...
>>
>> Under the hood we encode a float to an int using
>> Float.floatToRawIntBits; it could be that this doesn't work well w/
>> the compression we then do on the ints by default?  I'm curious which
>> impl the Lucene45DocValuesConsumer is using in your case.  Looks like
>> you are using random floats, so I'd expect it's using DELTA_COMPRESSED.
>>
>> It'd be a simple test to just make your own DVFormat using raw 32 bit
>> ints, to see how much that helps.
>>
>> But, yes, I would just email the list and see if there are other ideas?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Oct 7, 2013 at 7:14 AM,  &lt;
>
>> karl.wright@
>
>> &gt; wrote:
>>> Hi Mike,
>>>
>>>
>>>
>>> Before I post to the general list, do you see any problem with our
>>> testing methodology?
>>>
>>>
>>>
>>> Basically, we conclude that by far the most expensive thing is
>>> retrieving the NumericDocValue value.  This currently overwhelms any
>>> expensive operations we might do in the scoring ourselves, which is
>>> why we're looking for potential improvements in that area.
>>>
>>>
>>>
>>> Do you agree with the assessment?
>>>
>>> Karl
>>>
>>>
>>>
>>> From: Ziech Christian (HERE/Berlin)
>>> Sent: Friday, October 04, 2013 11:09 PM
>>> To: Wright Karl (HERE/Cambridge)
>>> Subject: AW: Is there a really performant way to store a full 32-bit
>>> int in doc values?
>>>
>>>
>>>
>>> Hi,
>>>
>>> maybe it's best if I share where I got my numbers from - I have
>>> written a small test (which originally should only test the
>>> Math.sqrt() impact for 10M scorings).
>>>
>>> The output is (I looped over the search invocation to give lucene a
>>> chance to load everything):
>>> Scoring 25000000 documents with direct buffers (without square root)
>>> took
>>> 203
>>> Scoring 25000000 documents with direct buffers (without square root)
>>> took
>>> 179
>>> Scoring 25000000 documents with direct buffers (without square root)
>>> took
>>> 172
>>> Scoring 25000000 documents with direct buffers (and a square root)
>>> took 292 Scoring 25000000 documents with direct buffers (and a square
>>> root) took 289 Scoring 25000000 documents with direct buffers (and a
>>> square root) took 289 Scoring 25000000 documents with a lucene value
>>> (without square root) source took 1045 Scoring 25000000 documents
>>> with a lucene value (without square root) source took 656 Scoring
>>> 25000000 documents with a lucene value (without square root) source
>>> took 660 Scoring 25000000 documents with a lucene value (without
>>> square root) source took 658 Scoring 25000000 documents with a lucene
>>> value (without square root) source took 663 Scoring 25000000
>>> documents with a lucene value (and a square root) source took 711
>>> Scoring 25000000 documents with a lucene value (and a square root)
>>> source took 710 Scoring 25000000 documents with a lucene value (and a
>>> square root) source took 713 Scoring 25000000 documents with a lucene
>>> value (and a square root) source took 711 Scoring 25000000 documents
>>> with a lucene value (and a square root) source took 714
>>>
>>> So the impact of a square root is roughly 110ms while the impact of
>>> using the lucene function values is far higher (depending on the run
>>> between 300-350ms). Interstingly the square root impact is not as
>>> high on the lucene function query for some reason (most likely java
>>> or the cpu can just optimize the very simple scorer best).
>>>
>>> I did measure the values with a FSDirectory and a RAMDirectory which
>>> both essentially yield the same performance. Do you see any problem
>>> with the attached code?
>>>
>>> BR
>>>   Christian
>>>
>>> ________________________________
>>>
>>> Von: Wright Karl (HERE/Cambridge)
>>> Gesendet: Freitag, 4. Oktober 2013 20:56
>>> An: Ziech Christian (HERE/Berlin)
>>> Betreff: FW: Is there a really performant way to store a full 32-bit
>>> int in doc values?
>>>
>>>
>>> FYI
>>> Karl
>>>
>>> Sent from my Windows Phone
>>>
>>> ________________________________
>>>
>>> From: ext Michael McCandless
>>> Sent: 10/4/2013 4:51 PM
>>> To: Wright Karl (HERE/Cambridge)
>>> Subject: Re: Is there a really performant way to store a full 32-bit
>>> int in doc values?
>>>
>>> Hmmm, that's interesting that you see decode cost is too high.  Are
>>> you sure?
>>>
>>> Can you email the list?  I'm sure Rob will have suggestions.  The
>>> worst case is you make a custom DV format that stores things raw.
>>>
>>> 4.5 has a new default DocValuesFormat with more compression, but with
>>> values stored on disk by default (cached by the OS if you have the
>>> RAM) ... I wonder how that would compare to what you're using now.
>>>
>>> I think the simplest thing to do is to instantiate the
>>> Lucene42DocValuesConsumer (renamed to MemoryDVConsumer in 4.5),
>>> passing a very high acceptableOverheadRatio?  This should caused
>>> packed ints to upgraded to a byte[], short[], int[], long[].  If this
>>> is still not fast enough then I suspect a custom DVFormat that just
>>> uses int[] directly (avoiding the abstractions of packed ints) is
>>> your best shot.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Fri, Oct 4, 2013 at 8:46 AM,  &lt;
>
>> karl.wright@
>
>> &gt; wrote:
>>>>
>>>>
>>>> Hi Mike,
>>>>
>>>>
>>>>
>>>> We're using docvalues to store geocoordinates in meters in X,Y,Z
>>>> space, and discovering that they are taking more time to unpack than
>>>> we'd like.  I was surprised to find no raw representation available
>>>> for docvalues right now
>>>> -
>>>> otherwise, a fixed 4-byte representation would have been ideal.
>>>> Would you have any suggestions?
>>>>
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>
>> [email protected]
>
>> For additional commands, e-mail:
>
>> [email protected]
>
>>
>> LuceneFloatSourceTest.java (16K)
>> &lt;http://lucene.472066.n3.nabble.com/attachment/4094104/0/LuceneFloa
>> tSourceTest.java&gt;
>
>
>
>
>
> -----
>  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/FW-Is-there-a-really-performant-way-to-store-a-full-32-bit-int-in-doc-values-tp4094104p4094120.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For additional 
> commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: FW: Is there a really performant way to store a full 32-bit int in doc values?

Reply via email to