[
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530205#comment-13530205
]
Michael McCandless commented on LUCENE-4619:
--------------------------------------------
Maybe if we rename CDB to FacetsDocumentBuilder, move it to oal.document, make
it a single method call for the user (FDB.addFields), that's good enough
progress for the common case for now?
I still don't like this field/dimension duality: it feels like the facet module
is "hiding" what should be separate fields, within a single Lucene field. If I
need to store these fields (because I want to present them in the the UI), I'm
already adding them as separate fields.
I think doc.add(new FacetField(...)) is more intuitive than fdb.addFields(doc,
....) for a the common/basic use case... but at least improving CDB here would
be progress.
> Create a specialized path for facets counting
> ---------------------------------------------
>
> Key: LUCENE-4619
> URL: https://issues.apache.org/jira/browse/LUCENE-4619
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Shai Erera
> Attachments: LUCENE-4619.patch
>
>
> Mike and I have been discussing that on several issues (LUCENE-4600,
> LUCENE-4602) and on GTalk ... it looks like the current API abstractions may
> be responsible for some of the performance loss that we see, compared to
> specialized code.
> During our discussion, we've decided to target a specific use case - facets
> counting and work on it, top-to-bottom by reusing as much code as possible.
> Specifically, we'd like to implement a FacetsCollector/Accumulator which can
> do only counting (i.e. respects only CountFacetRequest), no sampling,
> partitions and complements. The API allows us to do so very cleanly, and in
> the context of that issue, we'd like to do the following:
> * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams
> and CategoryPath (List, Iterable, whatever) and adds the needed information
> to both the taxonomy index as well as the search index.
> ** That API is similar in nature to CategoryDocumentBuilder, only easier to
> consume -- it's just another field that you add to the Document.
> ** We'll have two extensions for it: PayloadFacetsField and
> DocValuesFacetsField, so that we can benchmark the two approaches.
> Eventually, one of them we believe, will be eliminated, and we'll remain w/
> just one (hopefully the DV one).
> * Implement either a FacetsAccumulator/Collector which takes a bunch of
> CountFacetRequests and returns the top-counts.
> ** Aggregations are done in-collection, rather than post. Note that we have
> LUCENE-4600 open for exploring that. Either we finish this exploration here,
> or do it there. Just FYI that the issue exists.
> ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open
> a separate issue to explore improving that API to be bulk, and then we can
> decide if this specialized Collector should use those abstractions, or be
> really optimized for the facet counting case.
> * At the moment, this path will assume that a document holds multiple
> dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike
> for a document), and therefore use OrdPolicy.NO_PARENTS.
> ** Later, we'd like to explore how to have this specialized path handle the
> ALL_PARENTS case too, as it shouldn't be so hard to do.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]