[ https://issues.apache.org/jira/browse/LUCENE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541894#comment-13541894 ]
Shai Erera edited comment on LUCENE-4647 at 1/1/13 6:15 PM: ------------------------------------------------------------ Attached patch kinda overhauls how facets are indexed. The concept stays the same, but I sort of rewrote it all. Here's how the code in the patch works: * {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an {{Iterable<CategoryPath>}} and indexes them in two steps: *# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}}, {{$facets:Author/Bob}} *# {{CategoryListBuilder}} encodes the category ordinals into a {{Map<String,BytesRef>}} (more on this later), which is later set as the payload of {{$facets:$fulltree$}} ** Both these steps work per {{CategoryListParams}} (in case the application wishes to index groups of facets in different category lists) * {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}}) extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which implements {{Iterable<CategoryPath>}}) and holds a mapping from a {{CategoryPath}} to {{CategoryAssociation}} ** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds association values to the drill-down tokens' payloads ** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and encodes {{<category,association-value>}} pairs into their own {{BytesRef}} * {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the usage (end-user wise) a lot ! ** Two implementations {{CategoryIntAssociation}} and {{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}} + {{AssociationInt/FloatProperty}} and allow associating an int/float value with a category ** Every {{CategoryAssociation}} impl is responsible for serialization of its value to a {{ByteArrayDataOutput}} (and de-serialize from {{ByteArrayDataInput}}) ** Every implementation needs to specify its {{categoryListID}}, since it determines the term payload under which the association values are encoded ** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and {{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather than the enhancement ** {{EnhancementsIndexingParams}} were removed, and togeher with them all the 'attributes', 'enhancements' and 'streaming' packages * The {{Map<String,BytesRef>}} easily supports partitions and associations: ** When simple categories are indexed, no partitions, a single entry exists in the map ** When simple categories are indexed with partitions, an entry per partition exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc. ** When associations are indexed, the map contains the ordinals list (as described above) and the association values per {{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is encoded into a different {{BytesRef}} than a float one I chose to implement it all from scratch because the code was very intertwined and complicated, much because of a very complicated desing for enhancements. At least to me, the code is now much simpler. Migrating facets from this code to DocValues should be an easy task - all that needs to be done is to write the output of {{CategoryListBuilder}} to a DocValues field, rather than a TokenStream payload. The patch is huge, but mostly because of all the code that's been removed. When you review it, focus on the classes mentioned above. *NOTE:* the new associations code breaks backwards compatibility with old indexes: # Previously both the int and float associations were written to the same associations list as integers, and the float one used {{Float.intBitsToFloat}} and vice versa. Now they are written to two separate lists # Previously the code supported multiple enhancements which affected how they were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always had only one enhancement ({{AssociationEnhancement}}), so that encoding was really redundant. # In order to support multiple {{CategoryAssociations}} per {{CategoryPath}}, one can easily write a {{CompoundAssociation}} and take care of its serialization. Since enhancements/associations are quite an advanced feature, I think this break makes sense. We can always add a backwards layer later if someone complains (and cannot reindex). Keeping the previous code, which was prepared to handle multiple {{CategoryEnhancement}} types, while only one enhancement was available to our users did not make sense to me. was (Author: shaie): Attached patch kinda overhauls how facets are indexed. The concept stays the same, but I sort of rewrote it all. Here's how the code in the patch works: * {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an {{Iterable<CategoryPath>}} and indexes them in two steps: *# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}}, {{$facets:Author/Bob}} *# {{CategoryListBuilder}} encodes the category ordinals into a {{Map<String,BytesRef>}} (more on this later), which is later set as the payload of {{$facets:$fulltree$}} ** Both these steps work per {{CategoryListParams}} (in case the application wishes to index groups of facets in different category lists) * {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}}) extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which implements {{Iterable<CategoryPath>}}) and holds a mapping from a {{CategoryPath}} to {{CategoryAssociation}} ** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds association values to the drill-down tokens' payloads ** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and encodes {{category,association-value}} pairs into their own {{BytesRef}} * {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the usage (end-user wise) a lot ! ** Two implementations {{CategoryIntAssociation}} and {{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}} + {{AssociationInt/FloatProperty}} and allow associating an int/float value with a category ** Every {{CategoryAssociation}} impl is responsible for serialization of its value to a {{ByteArrayDataOutput}} (and de-serialize from {{ByteArrayDataInput}} ** Every implementation needs to specify its {{categoryListID}}, since it determines the term payload under which the association values are encoded ** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and {{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather than the enhancement ** {{EnhancementsIndexingParams}} were removed, and togeher with them all the 'attributes', 'enhancements' and 'streaming' packages * The {{Map<String,BytesRef>}} easily supports partitions and associations: ** When simple categories are indexed, no partitions, a single entry exists in the map ** When simple categories are indexed with partitions, an entry per partition exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc. ** When associations are indexed, the map contains the ordinals list (as described above) and the association values per {{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is encoded into a different {{BytesRef}} than a float one I chose to implement it all from scratch because the code was very intertwined and complicated, much because of a very complicated desing for enhancements. At least to me, the code is now much simpler. Migrating facets from this code to DocValues should be an easy task - all that needs to be done is to write the output of {{CategoryListBuilder}} to a DocValues field, rather than a TokenStream payload. The patch is huge, but mostly because of all the code that's been removed. When you review it, focus on the classes mentioned above. *NOTE:* the new associations code breaks backwards compatibility with old indexes: # Previously both the int and float associations were written to the same associations list as integers, and the float one used {{Float.intBitsToFloat}} and vice versa. Now they are written to two separate lists # Previously the code supported multiple enhancements which affected how they were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always had only one enhancement ({{AssociationEnhancement}}), so that encoding was really redundant. # In order to support multiple {{CategoryAssociations}} per {{CategoryPath}}, one can easily write a {{CompoundAssociation}} and take care of its serialization. Since enhancements/associations are quite an advanced feature, I think this break makes sense. We can always add a backwards layer later if someone complains (and cannot reindex). Keeping the previous code, which was prepared to handle multiple {{CategoryEnhancement}} types, while only one enhancement was available to our users did not make sense to me. > Simplify CategoryDocumentBuilder > -------------------------------- > > Key: LUCENE-4647 > URL: https://issues.apache.org/jira/browse/LUCENE-4647 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Shai Erera > Assignee: Shai Erera > Attachments: LUCENE-4647.patch > > > CategoryDocumentBuilder is used to add facet fields to a document. Today the > usage is not so straightforward, and I'd like to simplify it. First, to > improve usage but also to make cutover to DocValues easier. > This clsas does two operations: (1) adds drill-down terms and (2) creates the > fulltree payload. Today, since it does it all on terms, there's a hairy > TokenStream which does both these operations in one go. For simplicity, I'd > like to break this into two steps: > # Write a TokenStream which adds the drill-down terms > #* For no associations, terms should be indexed w/ DOCS_ONLY (see > LUCENE-4623). > #* With associations, drill-down terms have payload too. > #* So EnhancementsDocumentBuilder should be able to extend this stream. > # Write some API which can be used to build the fulltree payload (i.e. > populate a byte[]). Currently that byte[] will be written to a payload and > later to DV. > Hopefully, I'd like to have FacetsDocumentBuilder (maybe just FacetsBuilder?) > which only handles things with no associations, and EnhancementsDocBuilder > which extends whatever is needed to add associations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org