[
https://issues.apache.org/jira/browse/LUCENE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541894#comment-13541894
]
Shai Erera edited comment on LUCENE-4647 at 1/1/13 6:15 PM:
------------------------------------------------------------
Attached patch kinda overhauls how facets are indexed. The concept stays the
same, but I sort of rewrote it all. Here's how the code in the patch works:
* {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an
{{Iterable<CategoryPath>}} and indexes them in two steps:
*# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}},
{{$facets:Author/Bob}}
*# {{CategoryListBuilder}} encodes the category ordinals into a
{{Map<String,BytesRef>}} (more on this later), which is later set as the
payload of {{$facets:$fulltree$}}
** Both these steps work per {{CategoryListParams}} (in case the application
wishes to index groups of facets in different category lists)
* {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}})
extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which
implements {{Iterable<CategoryPath>}}) and holds a mapping from a
{{CategoryPath}} to {{CategoryAssociation}}
** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds
association values to the drill-down tokens' payloads
** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and
encodes {{<category,association-value>}} pairs into their own {{BytesRef}}
* {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the
usage (end-user wise) a lot !
** Two implementations {{CategoryIntAssociation}} and
{{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}}
+ {{AssociationInt/FloatProperty}} and allow associating an int/float value
with a category
** Every {{CategoryAssociation}} impl is responsible for serialization of its
value to a {{ByteArrayDataOutput}} (and de-serialize from
{{ByteArrayDataInput}})
** Every implementation needs to specify its {{categoryListID}}, since it
determines the term payload under which the association values are encoded
** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and
{{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather
than the enhancement
** {{EnhancementsIndexingParams}} were removed, and togeher with them all the
'attributes', 'enhancements' and 'streaming' packages
* The {{Map<String,BytesRef>}} easily supports partitions and associations:
** When simple categories are indexed, no partitions, a single entry exists in
the map
** When simple categories are indexed with partitions, an entry per partition
exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc.
** When associations are indexed, the map contains the ordinals list (as
described above) and the association values per
{{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is
encoded into a different {{BytesRef}} than a float one
I chose to implement it all from scratch because the code was very intertwined
and complicated, much because of a very complicated desing for enhancements. At
least to me, the code is now much simpler. Migrating facets from this code to
DocValues should be an easy task - all that needs to be done is to write the
output of {{CategoryListBuilder}} to a DocValues field, rather than a
TokenStream payload.
The patch is huge, but mostly because of all the code that's been removed. When
you review it, focus on the classes mentioned above.
*NOTE:* the new associations code breaks backwards compatibility with old
indexes:
# Previously both the int and float associations were written to the same
associations list as integers, and the float one used {{Float.intBitsToFloat}}
and vice versa. Now they are written to two separate lists
# Previously the code supported multiple enhancements which affected how they
were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always
had only one enhancement ({{AssociationEnhancement}}), so that encoding was
really redundant.
# In order to support multiple {{CategoryAssociations}} per {{CategoryPath}},
one can easily write a {{CompoundAssociation}} and take care of its
serialization.
Since enhancements/associations are quite an advanced feature, I think this
break makes sense. We can always add a backwards layer later if someone
complains (and cannot reindex). Keeping the previous code, which was prepared
to handle multiple {{CategoryEnhancement}} types, while only one enhancement
was available to our users did not make sense to me.
was (Author: shaie):
Attached patch kinda overhauls how facets are indexed. The concept stays
the same, but I sort of rewrote it all. Here's how the code in the patch works:
* {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an
{{Iterable<CategoryPath>}} and indexes them in two steps:
*# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}},
{{$facets:Author/Bob}}
*# {{CategoryListBuilder}} encodes the category ordinals into a
{{Map<String,BytesRef>}} (more on this later), which is later set as the
payload of {{$facets:$fulltree$}}
** Both these steps work per {{CategoryListParams}} (in case the application
wishes to index groups of facets in different category lists)
* {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}})
extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which
implements {{Iterable<CategoryPath>}}) and holds a mapping from a
{{CategoryPath}} to {{CategoryAssociation}}
** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds
association values to the drill-down tokens' payloads
** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and
encodes {{category,association-value}} pairs into their own {{BytesRef}}
* {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the
usage (end-user wise) a lot !
** Two implementations {{CategoryIntAssociation}} and
{{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}}
+ {{AssociationInt/FloatProperty}} and allow associating an int/float value
with a category
** Every {{CategoryAssociation}} impl is responsible for serialization of its
value to a {{ByteArrayDataOutput}} (and de-serialize from {{ByteArrayDataInput}}
** Every implementation needs to specify its {{categoryListID}}, since it
determines the term payload under which the association values are encoded
** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and
{{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather
than the enhancement
** {{EnhancementsIndexingParams}} were removed, and togeher with them all the
'attributes', 'enhancements' and 'streaming' packages
* The {{Map<String,BytesRef>}} easily supports partitions and associations:
** When simple categories are indexed, no partitions, a single entry exists in
the map
** When simple categories are indexed with partitions, an entry per partition
exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc.
** When associations are indexed, the map contains the ordinals list (as
described above) and the association values per
{{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is
encoded into a different {{BytesRef}} than a float one
I chose to implement it all from scratch because the code was very intertwined
and complicated, much because of a very complicated desing for enhancements. At
least to me, the code is now much simpler. Migrating facets from this code to
DocValues should be an easy task - all that needs to be done is to write the
output of {{CategoryListBuilder}} to a DocValues field, rather than a
TokenStream payload.
The patch is huge, but mostly because of all the code that's been removed. When
you review it, focus on the classes mentioned above.
*NOTE:* the new associations code breaks backwards compatibility with old
indexes:
# Previously both the int and float associations were written to the same
associations list as integers, and the float one used {{Float.intBitsToFloat}}
and vice versa. Now they are written to two separate lists
# Previously the code supported multiple enhancements which affected how they
were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always
had only one enhancement ({{AssociationEnhancement}}), so that encoding was
really redundant.
# In order to support multiple {{CategoryAssociations}} per {{CategoryPath}},
one can easily write a {{CompoundAssociation}} and take care of its
serialization.
Since enhancements/associations are quite an advanced feature, I think this
break makes sense. We can always add a backwards layer later if someone
complains (and cannot reindex). Keeping the previous code, which was prepared
to handle multiple {{CategoryEnhancement}} types, while only one enhancement
was available to our users did not make sense to me.
> Simplify CategoryDocumentBuilder
> --------------------------------
>
> Key: LUCENE-4647
> URL: https://issues.apache.org/jira/browse/LUCENE-4647
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Shai Erera
> Assignee: Shai Erera
> Attachments: LUCENE-4647.patch
>
>
> CategoryDocumentBuilder is used to add facet fields to a document. Today the
> usage is not so straightforward, and I'd like to simplify it. First, to
> improve usage but also to make cutover to DocValues easier.
> This clsas does two operations: (1) adds drill-down terms and (2) creates the
> fulltree payload. Today, since it does it all on terms, there's a hairy
> TokenStream which does both these operations in one go. For simplicity, I'd
> like to break this into two steps:
> # Write a TokenStream which adds the drill-down terms
> #* For no associations, terms should be indexed w/ DOCS_ONLY (see
> LUCENE-4623).
> #* With associations, drill-down terms have payload too.
> #* So EnhancementsDocumentBuilder should be able to extend this stream.
> # Write some API which can be used to build the fulltree payload (i.e.
> populate a byte[]). Currently that byte[] will be written to a payload and
> later to DV.
> Hopefully, I'd like to have FacetsDocumentBuilder (maybe just FacetsBuilder?)
> which only handles things with no associations, and EnhancementsDocBuilder
> which extends whatever is needed to add associations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]