Hi all, We use the facets a lot to generate all kinds of nice aggregates on our data, and we alse needed to make the distinction between FIELD_COUNT and DOCUMENT_COUNT, where the former increases for each multi-value, and the latter only once for each document that contains that field at least once.
Maybe that distinction can be worded/implemented somehow to make it all more consistent? On Mon, May 10, 2021 at 5:02 PM Gautam Worah <worah.gau...@gmail.com> wrote: > Hi Greg, > > I think your understanding is correct. I tried to create test cases > <https://github.com/gautamworah96/lucene/commit/042878117308f76629a27b0bcf83e25f074dc8b1> > for FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and > LongValueFacetCounts. > > FastTaxonomyFacetCounts treats common values in a document as a single > entity and returns the count of a dim+path as the number of documents that > contain these fields. > On the other hand, LongValueFacetCounts treats values as unique and > returns the number of instances of the dim+path value (each doc can be > counted more than once). > > > In the case of single-value docs, this would > also represent the total number of documents containing a value for > the given dim+path, which seems fairly useful > +1 > > I think the <each doc can be counted more than once> logic also has merit. > For example, you could probably use it for counting the number of times a > movie has been watched in a person->list of movies watched schema. > > I don't have any specific thoughts on the inconsistency issue because it > seems that LongValueFacetCounts and IntTaxonomyFacets were designed for > different purposes? > The latter supports hierarchical values, needs an explicit specification > for multi values and supports the getSpecificValue API. > It does seem odd that different groups of taxonomy classes treat counts > slightly differently. > > As a side note: > I think we can make the > org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test > case more robust by forcing it to use atleast one duplicate multi-value? > > Thanks > - Gautam > > > On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmil...@gmail.com> wrote: > >> Hi folks- >> >> I'm trying to make sure I have a proper understanding of what >> FacetResult#value is meant to represent, particularly in multi-valued >> doc scenarios. Apologies if I'm missing something obvious, but it >> seems that either my understanding is incorrect, or we have a bug in >> how we count multi-value docs. This is particularly relevant to me at >> the moment since I'm working on a couple facet-related changes, and I >> want to make sure I've got a proper understanding of this field. >> Thanks! >> >> From the Javadocs: >> /** >> * Total value for this path (sum of all child counts, or sum of all >> child values), even those not >> * included in the topN. >> */ >> public final Number value; >> >> So from the Javadocs, it seems this is simply the sum of all values >> for the given dim+path. In the case of single-value docs, this would >> also represent the total number of documents containing a value for >> the given dim+path, which seems fairly useful (i.e., it might be nice >> to know how many documents contain a value for a given facet >> dim+path). On the other hand, if docs can be multi-valued, this seems >> somewhat less useful. If this is truly the sum of the values for the >> given dim+path, each document can contribute more than one count, so >> the user can no longer interpret this as the number of documents that >> have at least one value for the facet dim+path. It seems as though it >> would be more useful to provide the number of documents with a given >> dim+path value instead of just the total count, but this is where I'm >> probably just misunderstanding something. >> >> Finally, looking at the way taxonomy facets are counted, it looks like >> this value is populated with the total number of documents, and >> populated with -1 in multi-value cases where an accurate doc count >> can't be provided (see IntTaxonomyFacets L:228 for example). This >> isn't consistent with the implementation in LongValueFacetCounts >> though, which will always populate the total of all values, ignoring >> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It >> appears the implementation in SortedSetDocValuesFacetCounts will also >> "double count" multi-value cases similar to LongValueFacetCounts. >> >> So... which do we think it is? Is it meant to be the total number of >> docs, or the total of all values? Can anyone shed some light on this? >> Thanks a bunch! >> >> Cheers, >> -Greg >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>