[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Shai Erera (JIRA) Sat, 02 Nov 2013 11:55:12 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812112#comment-13812112
 ]


Shai Erera commented on LUCENE-5316:
------------------------------------

Using NO_PARENTS is not that simple decision, since the counts of the parents 
will be wrong if more than one category of that dimension is added to a 
document. If it's a flat dimension, and you don't care about the dimension's 
count, that may be fine. But if it's a hierarchical dimension, the counts of 
the inner taxonomy nodes will be wrong in that case.

While indexing as NO_PARENTS does exercise the API more, I think it's wrong to 
test it here. NO_PARENTS should be used only for hierarchical dimensions, in 
order to save space in the category list and eventually (hopefully) speed 
things up since less bytes are read and decoded during search. But for flat 
dimensions, it adds the rollupValues cost. If we make the search code smart to 
detect this is a flat dimension, we'd save that cost (no need to rollup), but I 
think in general you should tweak OrdPolicy to NO_PARENTS only for hierarchical 
dimensions. I wonder what the perf numbers will be if you used NO_PARENTS only 
for the hierarchical dims - that's what we recommend the users to use, so I 
think that's what we should benchmark.

I'll review the patch later.

> Taxonomy tree traversing improvement
> ------------------------------------
>
>                 Key: LUCENE-5316
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5316
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Gilad Barkai
>            Priority: Minor
>         Attachments: LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the 
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
> which hold for each ordinal it's (array #1) youngest child and (array #2) 
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not 
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in 
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable, 
> and open it for future improvements (i.e memory footprint and NRT cost) - 
> without changing any of the internals. 
> A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Reply via email to