Thanks Mike/Gautam/Rob!

I've created LUCENE-9952 to track the work of making FacetResult#value
consistently report doc count (as the taxonomy-based implementations
do).

Rob (or anyone else), do you think there's value in _also_ reporting
"field count?" Sounds like you may have some use-cases for this.
Should we cut a separate issue to track adding a "field count" concept
to FacetResult?

Cheers,
-Greg

On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <[email protected]> wrote:
>
> Hi all,
>
> We use the facets a lot to generate all kinds of nice aggregates on our data, 
> and we alse needed to make the distinction between FIELD_COUNT and 
> DOCUMENT_COUNT, where the former increases for each multi-value, and the 
> latter only once for each document that contains that field at least once.
>
> Maybe that distinction can be worded/implemented somehow to make it all more 
> consistent?
>
> On Mon, May 10, 2021 at 5:02 PM Gautam Worah <[email protected]> wrote:
>>
>> Hi Greg,
>>
>> I think your understanding is correct. I tried to create test cases for 
>> FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and 
>> LongValueFacetCounts.
>>
>> FastTaxonomyFacetCounts treats common values in a document as a single 
>> entity and returns the count of a dim+path as the number of documents that 
>> contain these fields.
>> On the other hand, LongValueFacetCounts treats values as unique and returns 
>> the number of instances of the dim+path value (each doc can be counted more 
>> than once).
>>
>> > In the case of single-value docs, this would
>> also represent the total number of documents containing a value for
>> the given dim+path, which seems fairly useful
>> +1
>>
>> I think the <each doc can be counted more than once> logic also has merit.
>> For example, you could probably use it for counting the number of times a 
>> movie has been watched in a person->list of movies watched schema.
>>
>> I don't have any specific thoughts on the inconsistency issue because it 
>> seems that LongValueFacetCounts and IntTaxonomyFacets were designed for 
>> different purposes?
>> The latter supports hierarchical values, needs an explicit specification for 
>> multi values and supports the getSpecificValue API.
>> It does seem odd that different groups of taxonomy classes treat counts 
>> slightly differently.
>>
>> As a side note:
>> I think we can make the 
>> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test 
>> case more robust by forcing it to use atleast one duplicate multi-value?
>>
>> Thanks
>> - Gautam
>>
>>
>> On Sun, May 9, 2021 at 9:40 AM Greg Miller <[email protected]> wrote:
>>>
>>> Hi folks-
>>>
>>> I'm trying to make sure I have a proper understanding of what
>>> FacetResult#value is meant to represent, particularly in multi-valued
>>> doc scenarios. Apologies if I'm missing something obvious, but it
>>> seems that either my understanding is incorrect, or we have a bug in
>>> how we count multi-value docs. This is particularly relevant to me at
>>> the moment since I'm working on a couple facet-related changes, and I
>>> want to make sure I've got a proper understanding of this field.
>>> Thanks!
>>>
>>> From the Javadocs:
>>> /**
>>>  * Total value for this path (sum of all child counts, or sum of all
>>> child values), even those not
>>>  * included in the topN.
>>>  */
>>> public final Number value;
>>>
>>> So from the Javadocs, it seems this is simply the sum of all values
>>> for the given dim+path. In the case of single-value docs, this would
>>> also represent the total number of documents containing a value for
>>> the given dim+path, which seems fairly useful (i.e., it might be nice
>>> to know how many documents contain a value for a given facet
>>> dim+path). On the other hand, if docs can be multi-valued, this seems
>>> somewhat less useful. If this is truly the sum of the values for the
>>> given dim+path, each document can contribute more than one count, so
>>> the user can no longer interpret this as the number of documents that
>>> have at least one value for the facet dim+path. It seems as though it
>>> would be more useful to provide the number of documents with a given
>>> dim+path value instead of just the total count, but this is where I'm
>>> probably just misunderstanding something.
>>>
>>> Finally, looking at the way taxonomy facets are counted, it looks like
>>> this value is populated with the total number of documents, and
>>> populated with -1 in multi-value cases where an accurate doc count
>>> can't be provided (see IntTaxonomyFacets L:228 for example). This
>>> isn't consistent with the implementation in LongValueFacetCounts
>>> though, which will always populate the total of all values, ignoring
>>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
>>> appears the implementation in SortedSetDocValuesFacetCounts will also
>>> "double count" multi-value cases similar to LongValueFacetCounts.
>>>
>>> So... which do we think it is? Is it meant to be the total number of
>>> docs, or the total of all values? Can anyone shed some light on this?
>>> Thanks a bunch!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to