[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Shai Erera (JIRA) Sun, 10 Nov 2013 05:52:47 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818443#comment-13818443
 ]


Shai Erera commented on LUCENE-5316:
------------------------------------

Mike, I looked at the different runs results and the QPS column (e.g. QPS Base) 
varies dramatically between runs. I'm not talking about "base" vs "comp", but 
"base" vs itself in all runs. E.g. in the last run when you compared 
ALL_BUT_DIM and NO_PARENTS, AndHighLow of ALL_BUT_DIM was 70.5 vs 41.14 
respectively. In the ALL_BUT_DIM run after Gilad's patch with getChildren() 
returning null, it's 103.19.

Can you explain the great differences? Note that I don't compare that to the 
"easy" run (w/ 7 dims only) as it does not run the same thing. But I wonder if 
the changes in absolute QPS may hint at some instability (maybe temporal) with 
the machine or the test? Still, when "comp" is slowest than "base" so 
ultimately I think it shows the abstraction hurts us, but I'd feel better if 
the test was more stable across runs.

Separately, I'm torn about what we should do here. On one hand the abstraction 
hurts us, but on the other hand, it eliminates any chance of doing anything 
smart in the taxonomy in-memory representation. For example, if a dimension is 
flat and some taxonomy implementation manages to assign successive ordinals to 
its children, we don't even need to materialize all children in an int[], and 
rather hold a start/end range (a'la SortedSetDocValuesReaderState.OrdRange) and 
implement ChildrenIterator on top. If we commit to an int[] on the API, it 
immediately kills any chance to further optimize that in the future (e.g. 
PackedInts even).

I know Gilad is making progress w/ returning an int[] per ord, so I wonder what 
the performance will be with it. I really wish we can make that API abstraction 
without losing much - it feels the right thing to do ... and I'd hate to do it, 
knowing that we lose :).

> Taxonomy tree traversing improvement
> ------------------------------------
>
>                 Key: LUCENE-5316
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5316
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Gilad Barkai
>            Priority: Minor
>         Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the 
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
> which hold for each ordinal it's (array #1) youngest child and (array #2) 
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not 
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in 
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable, 
> and open it for future improvements (i.e memory footprint and NRT cost) - 
> without changing any of the internals. 
> A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Reply via email to