[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808968#comment-13808968
 ] 

Gilad Barkai commented on LUCENE-5316:
--------------------------------------

Thanks Mike, you're right on the money.

The ram consumption is indeed an issue here.
I'm not sure that the parent array is used during search at all... and perhaps 
could be removed (looking into that one as well).
The youngestChild/olderSibing arrays should and could be replaces with either a 
more compact RAM representation, or at extreme cases, even on-disk.

For a better RAM representation, the idea of a map from ord -> int[] of it's 
children is a start.
In such a case, we benefit from not holding a 'youngerChild' int for each 
ordinal in the flat dimension - as they have no children.
2nd, we could benefit from the locality of ref, as all the children are near by 
and not spread over an array of millions. The non-huge-flat dimensions will no 
longer suffer because of the other dimensions.
Also, it would make the worst case of NRT the same as the current update 
(O(Taxo-size)) but might be very small if only a few ordinals were added, as 
only their 'family' would be reallocated and managed, rather than the entire 
array.

At a further phase - a compression could be allowed into that int[] of children 
- we know the children are in ascending order, and could only encode the DGaps, 
figure the largest DGAP and use packed ints instead of the int[]. It would add 
some (I hope) minor CPU consumption to the loop, but would benefit greatly when 
it comes to RAM consumption. 
I hope that all the logic could be encapsulated in the {{ChildrenIterator}} and 
the user will benefit from a clean API and better RAM utilization.

I'll post a patch shortly, which covers the very first part - hiding the 
implementation detail of children arrays (making 
TaxoReader.getParallelTaxoArrays protected to begin with), and moving 
{{TopKFacetResultHandler}} to use {{ChildrenIterator}}. 

Currently debugging some nasty loop-as-a-recursion related bug :)

> Taxonomy tree traversing improvement
> ------------------------------------
>
>                 Key: LUCENE-5316
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5316
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Gilad Barkai
>            Priority: Minor
>
> The taxonomy traversing is done today utilizing the 
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
> which hold for each ordinal it's (array #1) youngest child and (array #2) 
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not 
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in 
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable, 
> and open it for future improvements (i.e memory footprint and NRT cost) - 
> without changing any of the internals. 
> A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to