[
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812315#comment-13812315
]
Shai Erera commented on LUCENE-5316:
------------------------------------
{quote}
I'm not sure .getChildren() should be encapsulated with a TaxononyTree object
if it's the only API.
Do you think other API should be put there as well?
{quote}
When I wrote it, I thought it should have a parent(int ord) API as well, but
then I changed my mind. I agree that we don't need a TaxoTree with a single API
at the moment. Let's stick w/ getChildren. If we'll want to allow overriding
how children are represented, we may want to have an internal TaxoTree object,
that's not exposed on the public API, but apps can provide their own so we
delegate getChildren to it.
{quote}
I feel that this issue is big enough to stand for itself, while the children
map is also large enough? It would contain code for the iterations but also
some code in the taxo-reopen phase, that will replace (hopefully ) the PTA
reopen code.
{quote}
I'm not sure, but I don't weigh the issues by the sizes of their patch. If
after all this hassle, it turns out that even the map approach severely hurts
performance, there's no reason to make any changes to the API. I hope that this
won't be the case, but judging from the benchmark Mike ran, this change doesn't
look good. I still hope it will look better when we test NO_PARENTS only for
the hierarchical dimensions, but we shouldn't make changes to the API if
net/net we'll only lose. The RAM consumption of the arrays is not that
worrying, since taxonomies are not that large (even a huge taxonomy w/ 2.5M
nodes consumes only 30MB!).
So IMO this issue should be about (and it can be done in steps):
# Abstract the children API and verify this alone doesn't hurt performance. If
so, that's great. We should also remove PTA entirely from the API in that case.
# Experiment with a different in-memory representation - this can be done as a
hack even in the current TaxoIndexArrays -- when computeChildrenSiblings is
called, build a the map. That way you don't need to interfere w/ any reopen
logic, it's all internal to TaxoIndexArrays.
#* If we discover that this rep loses performance, we can decide if we want to
keep the API abstraction or drop it altogether. But I feel that we need to at
least experiment w/ an alternative representation before we change the API.
Those API abstractions hurt the performance in the past - let's not make the
same mistake again.
> Taxonomy tree traversing improvement
> ------------------------------------
>
> Key: LUCENE-5316
> URL: https://issues.apache.org/jira/browse/LUCENE-5316
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Gilad Barkai
> Priority: Minor
> Attachments: LUCENE-5316.patch, LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays
> which hold for each ordinal it's (array #1) youngest child and (array #2)
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable,
> and open it for future improvements (i.e memory footprint and NRT cost) -
> without changing any of the internals.
> A later issue(s?) could be opened to address the gaps once this one is done.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]