[ 
https://issues.apache.org/jira/browse/LUCENE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541894#comment-13541894
 ] 

Shai Erera edited comment on LUCENE-4647 at 1/1/13 6:15 PM:
------------------------------------------------------------

Attached patch kinda overhauls how facets are indexed. The concept stays the 
same, but I sort of rewrote it all. Here's how the code in the patch works:

* {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an 
{{Iterable<CategoryPath>}} and indexes them in two steps:
*# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}}, 
{{$facets:Author/Bob}}
*# {{CategoryListBuilder}} encodes the category ordinals into a 
{{Map<String,BytesRef>}} (more on this later), which is later set as the 
payload of {{$facets:$fulltree$}}
** Both these steps work per {{CategoryListParams}} (in case the application 
wishes to index groups of facets in different category lists)

* {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}}) 
extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which 
implements {{Iterable<CategoryPath>}}) and holds a mapping from a 
{{CategoryPath}} to {{CategoryAssociation}}
** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds 
association values to the drill-down tokens' payloads
** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and 
encodes {{<category,association-value>}} pairs into their own {{BytesRef}}

* {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the 
usage (end-user wise) a lot !
** Two implementations {{CategoryIntAssociation}} and 
{{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}} 
+ {{AssociationInt/FloatProperty}} and allow associating an int/float value 
with a category
** Every {{CategoryAssociation}} impl is responsible for serialization of its 
value to a {{ByteArrayDataOutput}} (and de-serialize from 
{{ByteArrayDataInput}})
** Every implementation needs to specify its {{categoryListID}}, since it 
determines the term payload under which the association values are encoded
** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and 
{{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather 
than the enhancement
** {{EnhancementsIndexingParams}} were removed, and togeher with them all the 
'attributes', 'enhancements' and 'streaming' packages

* The {{Map<String,BytesRef>}} easily supports partitions and associations:
** When simple categories are indexed, no partitions, a single entry exists in 
the map
** When simple categories are indexed with partitions, an entry per partition 
exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc.
** When associations are indexed, the map contains the ordinals list (as 
described above) and the association values per 
{{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is 
encoded into a different {{BytesRef}} than a float one

I chose to implement it all from scratch because the code was very intertwined 
and complicated, much because of a very complicated desing for enhancements. At 
least to me, the code is now much simpler. Migrating facets from this code to 
DocValues should be an easy task - all that needs to be done is to write the 
output of {{CategoryListBuilder}} to a DocValues field, rather than a 
TokenStream payload.

The patch is huge, but mostly because of all the code that's been removed. When 
you review it, focus on the classes mentioned above.

*NOTE:* the new associations code breaks backwards compatibility with old 
indexes:
# Previously both the int and float associations were written to the same 
associations list as integers, and the float one used {{Float.intBitsToFloat}} 
and vice versa. Now they are written to two separate lists
# Previously the code supported multiple enhancements which affected how they 
were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always 
had only one enhancement ({{AssociationEnhancement}}), so that encoding was 
really redundant.
# In order to support multiple {{CategoryAssociations}} per {{CategoryPath}}, 
one can easily write a {{CompoundAssociation}} and take care of its 
serialization.

Since enhancements/associations are quite an advanced feature, I think this 
break makes sense. We can always add a backwards layer later if someone 
complains (and cannot reindex). Keeping the previous code, which was prepared 
to handle multiple {{CategoryEnhancement}} types, while only one enhancement 
was available to our users did not make sense to me.
                
      was (Author: shaie):
    Attached patch kinda overhauls how facets are indexed. The concept stays 
the same, but I sort of rewrote it all. Here's how the code in the patch works:

* {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an 
{{Iterable<CategoryPath>}} and indexes them in two steps:
*# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}}, 
{{$facets:Author/Bob}}
*# {{CategoryListBuilder}} encodes the category ordinals into a 
{{Map<String,BytesRef>}} (more on this later), which is later set as the 
payload of {{$facets:$fulltree$}}
** Both these steps work per {{CategoryListParams}} (in case the application 
wishes to index groups of facets in different category lists)

* {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}}) 
extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which 
implements {{Iterable<CategoryPath>}}) and holds a mapping from a 
{{CategoryPath}} to {{CategoryAssociation}}
** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds 
association values to the drill-down tokens' payloads
** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and 
encodes {{category,association-value}} pairs into their own {{BytesRef}}

* {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the 
usage (end-user wise) a lot !
** Two implementations {{CategoryIntAssociation}} and 
{{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}} 
+ {{AssociationInt/FloatProperty}} and allow associating an int/float value 
with a category
** Every {{CategoryAssociation}} impl is responsible for serialization of its 
value to a {{ByteArrayDataOutput}} (and de-serialize from {{ByteArrayDataInput}}
** Every implementation needs to specify its {{categoryListID}}, since it 
determines the term payload under which the association values are encoded
** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and 
{{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather 
than the enhancement
** {{EnhancementsIndexingParams}} were removed, and togeher with them all the 
'attributes', 'enhancements' and 'streaming' packages

* The {{Map<String,BytesRef>}} easily supports partitions and associations:
** When simple categories are indexed, no partitions, a single entry exists in 
the map
** When simple categories are indexed with partitions, an entry per partition 
exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc.
** When associations are indexed, the map contains the ordinals list (as 
described above) and the association values per 
{{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is 
encoded into a different {{BytesRef}} than a float one

I chose to implement it all from scratch because the code was very intertwined 
and complicated, much because of a very complicated desing for enhancements. At 
least to me, the code is now much simpler. Migrating facets from this code to 
DocValues should be an easy task - all that needs to be done is to write the 
output of {{CategoryListBuilder}} to a DocValues field, rather than a 
TokenStream payload.

The patch is huge, but mostly because of all the code that's been removed. When 
you review it, focus on the classes mentioned above.

*NOTE:* the new associations code breaks backwards compatibility with old 
indexes:
# Previously both the int and float associations were written to the same 
associations list as integers, and the float one used {{Float.intBitsToFloat}} 
and vice versa. Now they are written to two separate lists
# Previously the code supported multiple enhancements which affected how they 
were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always 
had only one enhancement ({{AssociationEnhancement}}), so that encoding was 
really redundant.
# In order to support multiple {{CategoryAssociations}} per {{CategoryPath}}, 
one can easily write a {{CompoundAssociation}} and take care of its 
serialization.

Since enhancements/associations are quite an advanced feature, I think this 
break makes sense. We can always add a backwards layer later if someone 
complains (and cannot reindex). Keeping the previous code, which was prepared 
to handle multiple {{CategoryEnhancement}} types, while only one enhancement 
was available to our users did not make sense to me.
                  
> Simplify CategoryDocumentBuilder
> --------------------------------
>
>                 Key: LUCENE-4647
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4647
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-4647.patch
>
>
> CategoryDocumentBuilder is used to add facet fields to a document. Today the 
> usage is not so straightforward, and I'd like to simplify it. First, to 
> improve usage but also to make cutover to DocValues easier.
> This clsas does two operations: (1) adds drill-down terms and (2) creates the 
> fulltree payload. Today, since it does it all on terms, there's a hairy 
> TokenStream which does both these operations in one go. For simplicity, I'd 
> like to break this into two steps:
> # Write a TokenStream which adds the drill-down terms
> #* For no associations, terms should be indexed w/ DOCS_ONLY (see 
> LUCENE-4623).
> #* With associations, drill-down terms have payload too.
> #* So EnhancementsDocumentBuilder should be able to extend this stream.
> # Write some API which can be used to build the fulltree payload (i.e. 
> populate a byte[]). Currently that byte[] will be written to a payload and 
> later to DV.
> Hopefully, I'd like to have FacetsDocumentBuilder (maybe just FacetsBuilder?) 
> which only handles things with no associations, and EnhancementsDocBuilder 
> which extends whatever is needed to add associations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to