Re: Off-heap binary doc values

2024-12-09 Thread Ignacio Vera
I have run the luceneutil benchmark with higher iterations and repeat count but they are still very noisy, which I blame for running those benchmarks on a laptop. The results always show some of the facets tasks having speed ups while others having small slowdowns. One run that clearly shows a slo

Re: Off-heap binary doc values

2024-12-07 Thread Adrien Grand
FWIW I have also seen some users store sparse vectors or bloom filters in binary doc values. In both cases, the serialized size may be non negligible while not all bytes are needed. This change would likely help. Having the binary sort and faceting tasks not show a big slowdown would be good as th

Re: Off-heap binary doc values

2024-12-05 Thread Ignacio Vera
@Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward. @Mike: Yes, there are other use cases. One that is close to my heart is the geo use case where in many cases you don't need to read all the bytes, and geometries can be big. In lucene there are some interesting usages in

Re: Off-heap binary doc values

2024-12-05 Thread Michael Sokolov
That makes sense to me too in the abstract. At Amazon we also have interesting BDV fields we have to decode on the fly, so this looks attractive for that reason (not just faceting). I would say though that it would be easier to evaluate the fitness for purpose (faceting) if we had some examples of

Re: Off-heap binary doc values

2024-12-05 Thread Chris Hegarty
Hi Ignacio, I completely agree with the idea of having a BytesRef-like thing that can be off-heap. For a while now I’ve been thinking about how we could evolve BytesRef so as to not expose its on-heap representation. Having a separate primitive is probably a better way to go. -Chris. > On 5 D

Off-heap binary doc values

2024-12-05 Thread Ignacio Vera
Hello, I have been working with the idea of reading binary doc values off-heap for a while. The idea behind it is that binary doc values are often used for faceting where structure data is encoded at write time and decoded at read time. It feels wasteful to have to read the data on-heap before dec