On Mon, Nov 15, 2021 at 1:14 PM Robert Muir <rcm...@gmail.com> wrote:
On Mon, Nov 15, 2021 at 12:57 PM Michael McCandless > <luc...@mikemccandless.com> wrote: > > > > I think for PR 420 (https://github.com/apache/lucene/pull/420) we are > (confusingly!) not really seeing performance benefits -- taxonomy index got > a bit bigger, and loading the parent arrays no faster? So Patrick closed > that one. > > I'm confused about this (Sorry I am not up to speed), but are we not > able to offload today's very large arrays to docvalues (e.g. mmap) > with the change? Wasn't that the original motivation, that the memory > usage was somewhat trappy? I wouldn't expect to see performance > benefits over today's on-heap arrays that are read from payloads or > whatever, instead it would be a memory benefit? > Yeah I love that idea, but that's not what Patrick's PR explored (yet?). His explored switching away from custom token positions to NumericDocValues to store the same data (ordinal -> parent mapping), but it still loaded all of those into massive heap-resident int[]. I agree it would be awesome to try avoiding those big int[] and reading live from NumericDocValues during faceting! It would require some re-work of the facetting code to e.g. sort the ordinals to (efficiently) visiting them in forward iterator-friendly order. But that is a different change and probably we should not hold 9.0 for it? Mike McCandless http://blog.mikemccandless.com