[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Shai Erera (JIRA) Sun, 24 Nov 2013 13:03:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831050#comment-13831050
 ]


Shai Erera commented on LUCENE-5316:
------------------------------------

bq. we could hold in an int[][]

* If we do that, we still add an array pointer for every ordinal, which is much 
bigger than an int (I think it's 8 bytes?)
* We'll still call taxoReader.getChildren(ord) for all the leaf nodes from the 
calling rollup code, so yes - we'll save some ifs and hash, but I'm not sure if 
overall we'll still win...

bq. We could compute the children upon taxonomy opening and be done with it

I don't object to it. I think usually if you work w/ a TaxoReader, you intend 
to do faceted search or traverse the taxonomy, so it makes sense to initialize 
it in the ctor.

bq. Today (trunk) reopen is very costly as it's in the O(TaxonomySize)

That's not entirely true, let's clarify it. While reopening a taxonomy and 
computing the new PTA, we do two things:

* grow the parents[], children[] arrays - means arraycopy.
* process the parents of the new categories only

So it's not really O(TaxoSize). I don't think arraycopy is that expensive, 
compared to computing the parents[] / children[] / siblings[] (mostly 
parents[], since it reads a posting list). The expensive part is done only on 
the new segments in the taxonomy (the new category documents).  So the taxonomy 
is NRT is almost truly NRT.

> Taxonomy tree traversing improvement
> ------------------------------------
>
>                 Key: LUCENE-5316
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5316
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Gilad Barkai
>            Priority: Minor
>         Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, 
> LUCENE-5316.patch, LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the 
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
> which hold for each ordinal it's (array #1) youngest child and (array #2) 
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not 
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in 
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable, 
> and open it for future improvements (i.e memory footprint and NRT cost) - 
> without changing any of the internals. 
> A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Reply via email to