Hi all,

We use the facets a lot to generate all kinds of nice aggregates on our
data, and we alse needed to make the distinction between FIELD_COUNT and
DOCUMENT_COUNT, where the former increases for each multi-value, and the
latter only once for each document that contains that field at least once.

Maybe that distinction can be worded/implemented somehow to make it all
more consistent?

On Mon, May 10, 2021 at 5:02 PM Gautam Worah <worah.gau...@gmail.com> wrote:

> Hi Greg,
>
> I think your understanding is correct. I tried to create test cases
> <https://github.com/gautamworah96/lucene/commit/042878117308f76629a27b0bcf83e25f074dc8b1>
> for FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and
> LongValueFacetCounts.
>
> FastTaxonomyFacetCounts treats common values in a document as a single
> entity and returns the count of a dim+path as the number of documents that
> contain these fields.
> On the other hand, LongValueFacetCounts treats values as unique and
> returns the number of instances of the dim+path value (each doc can be
> counted more than once).
>
> > In the case of single-value docs, this would
> also represent the total number of documents containing a value for
> the given dim+path, which seems fairly useful
> +1
>
> I think the <each doc can be counted more than once> logic also has merit.
> For example, you could probably use it for counting the number of times a
> movie has been watched in a person->list of movies watched schema.
>
> I don't have any specific thoughts on the inconsistency issue because it
> seems that LongValueFacetCounts and IntTaxonomyFacets were designed for
> different purposes?
> The latter supports hierarchical values, needs an explicit specification
> for multi values and supports the getSpecificValue API.
> It does seem odd that different groups of taxonomy classes treat counts
> slightly differently.
>
> As a side note:
> I think we can make the
> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test
> case more robust by forcing it to use atleast one duplicate multi-value?
>
> Thanks
> - Gautam
>
>
> On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmil...@gmail.com> wrote:
>
>> Hi folks-
>>
>> I'm trying to make sure I have a proper understanding of what
>> FacetResult#value is meant to represent, particularly in multi-valued
>> doc scenarios. Apologies if I'm missing something obvious, but it
>> seems that either my understanding is incorrect, or we have a bug in
>> how we count multi-value docs. This is particularly relevant to me at
>> the moment since I'm working on a couple facet-related changes, and I
>> want to make sure I've got a proper understanding of this field.
>> Thanks!
>>
>> From the Javadocs:
>> /**
>>  * Total value for this path (sum of all child counts, or sum of all
>> child values), even those not
>>  * included in the topN.
>>  */
>> public final Number value;
>>
>> So from the Javadocs, it seems this is simply the sum of all values
>> for the given dim+path. In the case of single-value docs, this would
>> also represent the total number of documents containing a value for
>> the given dim+path, which seems fairly useful (i.e., it might be nice
>> to know how many documents contain a value for a given facet
>> dim+path). On the other hand, if docs can be multi-valued, this seems
>> somewhat less useful. If this is truly the sum of the values for the
>> given dim+path, each document can contribute more than one count, so
>> the user can no longer interpret this as the number of documents that
>> have at least one value for the facet dim+path. It seems as though it
>> would be more useful to provide the number of documents with a given
>> dim+path value instead of just the total count, but this is where I'm
>> probably just misunderstanding something.
>>
>> Finally, looking at the way taxonomy facets are counted, it looks like
>> this value is populated with the total number of documents, and
>> populated with -1 in multi-value cases where an accurate doc count
>> can't be provided (see IntTaxonomyFacets L:228 for example). This
>> isn't consistent with the implementation in LongValueFacetCounts
>> though, which will always populate the total of all values, ignoring
>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
>> appears the implementation in SortedSetDocValuesFacetCounts will also
>> "double count" multi-value cases similar to LongValueFacetCounts.
>>
>> So... which do we think it is? Is it meant to be the total number of
>> docs, or the total of all values? Can anyone shed some light on this?
>> Thanks a bunch!
>>
>> Cheers,
>> -Greg
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Reply via email to