Thanks Mike/Gautam/Rob! I've created LUCENE-9952 to track the work of making FacetResult#value consistently report doc count (as the taxonomy-based implementations do).
Rob (or anyone else), do you think there's value in _also_ reporting "field count?" Sounds like you may have some use-cases for this. Should we cut a separate issue to track adding a "field count" concept to FacetResult? Cheers, -Greg On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <[email protected]> wrote: > > Hi all, > > We use the facets a lot to generate all kinds of nice aggregates on our data, > and we alse needed to make the distinction between FIELD_COUNT and > DOCUMENT_COUNT, where the former increases for each multi-value, and the > latter only once for each document that contains that field at least once. > > Maybe that distinction can be worded/implemented somehow to make it all more > consistent? > > On Mon, May 10, 2021 at 5:02 PM Gautam Worah <[email protected]> wrote: >> >> Hi Greg, >> >> I think your understanding is correct. I tried to create test cases for >> FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and >> LongValueFacetCounts. >> >> FastTaxonomyFacetCounts treats common values in a document as a single >> entity and returns the count of a dim+path as the number of documents that >> contain these fields. >> On the other hand, LongValueFacetCounts treats values as unique and returns >> the number of instances of the dim+path value (each doc can be counted more >> than once). >> >> > In the case of single-value docs, this would >> also represent the total number of documents containing a value for >> the given dim+path, which seems fairly useful >> +1 >> >> I think the <each doc can be counted more than once> logic also has merit. >> For example, you could probably use it for counting the number of times a >> movie has been watched in a person->list of movies watched schema. >> >> I don't have any specific thoughts on the inconsistency issue because it >> seems that LongValueFacetCounts and IntTaxonomyFacets were designed for >> different purposes? >> The latter supports hierarchical values, needs an explicit specification for >> multi values and supports the getSpecificValue API. >> It does seem odd that different groups of taxonomy classes treat counts >> slightly differently. >> >> As a side note: >> I think we can make the >> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test >> case more robust by forcing it to use atleast one duplicate multi-value? >> >> Thanks >> - Gautam >> >> >> On Sun, May 9, 2021 at 9:40 AM Greg Miller <[email protected]> wrote: >>> >>> Hi folks- >>> >>> I'm trying to make sure I have a proper understanding of what >>> FacetResult#value is meant to represent, particularly in multi-valued >>> doc scenarios. Apologies if I'm missing something obvious, but it >>> seems that either my understanding is incorrect, or we have a bug in >>> how we count multi-value docs. This is particularly relevant to me at >>> the moment since I'm working on a couple facet-related changes, and I >>> want to make sure I've got a proper understanding of this field. >>> Thanks! >>> >>> From the Javadocs: >>> /** >>> * Total value for this path (sum of all child counts, or sum of all >>> child values), even those not >>> * included in the topN. >>> */ >>> public final Number value; >>> >>> So from the Javadocs, it seems this is simply the sum of all values >>> for the given dim+path. In the case of single-value docs, this would >>> also represent the total number of documents containing a value for >>> the given dim+path, which seems fairly useful (i.e., it might be nice >>> to know how many documents contain a value for a given facet >>> dim+path). On the other hand, if docs can be multi-valued, this seems >>> somewhat less useful. If this is truly the sum of the values for the >>> given dim+path, each document can contribute more than one count, so >>> the user can no longer interpret this as the number of documents that >>> have at least one value for the facet dim+path. It seems as though it >>> would be more useful to provide the number of documents with a given >>> dim+path value instead of just the total count, but this is where I'm >>> probably just misunderstanding something. >>> >>> Finally, looking at the way taxonomy facets are counted, it looks like >>> this value is populated with the total number of documents, and >>> populated with -1 in multi-value cases where an accurate doc count >>> can't be provided (see IntTaxonomyFacets L:228 for example). This >>> isn't consistent with the implementation in LongValueFacetCounts >>> though, which will always populate the total of all values, ignoring >>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It >>> appears the implementation in SortedSetDocValuesFacetCounts will also >>> "double count" multi-value cases similar to LongValueFacetCounts. >>> >>> So... which do we think it is? Is it meant to be the total number of >>> docs, or the total of all values? Can anyone shed some light on this? >>> Thanks a bunch! >>> >>> Cheers, >>> -Greg >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
