Thanks Rob! For now, I've focused on making the javadoc a bit more clear and fixing the bug, but it's an interesting thought to see if there's a better way to make this extensible. I don't think it would be particularly challenging to do this, so if there's enough interest in that functionality (i.e., making the calculation of FacetResult#value more extensible) I think we could open a separate issue to track?
Cheers, -Greg On Tue, May 11, 2021 at 12:23 AM Rob Audenaerde <[email protected]> wrote: > > Hi Greg, > > Honestly, I think the implementation should just focus on getting the > document-counts correct and consistent in all places, and having the javadoc > explain what it happering. (That said, it would be nice if the > implementations are easy to extend, so that ppl who want to implement > field_count can implement this easily themselves :) > > In our solution, we use the AssociatedFacetFields a lot and do aggregates on > them ( like sum, max, avg), but I extended the base Facets class to implement > this. > > -Rob > > On Mon, May 10, 2021 at 6:36 PM Greg Miller <[email protected]> wrote: >> >> Thanks Mike/Gautam/Rob! >> >> I've created LUCENE-9952 to track the work of making FacetResult#value >> consistently report doc count (as the taxonomy-based implementations >> do). >> >> Rob (or anyone else), do you think there's value in _also_ reporting >> "field count?" Sounds like you may have some use-cases for this. >> Should we cut a separate issue to track adding a "field count" concept >> to FacetResult? >> >> Cheers, >> -Greg >> >> On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <[email protected]> >> wrote: >> > >> > Hi all, >> > >> > We use the facets a lot to generate all kinds of nice aggregates on our >> > data, and we alse needed to make the distinction between FIELD_COUNT and >> > DOCUMENT_COUNT, where the former increases for each multi-value, and the >> > latter only once for each document that contains that field at least once. >> > >> > Maybe that distinction can be worded/implemented somehow to make it all >> > more consistent? >> > >> > On Mon, May 10, 2021 at 5:02 PM Gautam Worah <[email protected]> >> > wrote: >> >> >> >> Hi Greg, >> >> >> >> I think your understanding is correct. I tried to create test cases for >> >> FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and >> >> LongValueFacetCounts. >> >> >> >> FastTaxonomyFacetCounts treats common values in a document as a single >> >> entity and returns the count of a dim+path as the number of documents >> >> that contain these fields. >> >> On the other hand, LongValueFacetCounts treats values as unique and >> >> returns the number of instances of the dim+path value (each doc can be >> >> counted more than once). >> >> >> >> > In the case of single-value docs, this would >> >> also represent the total number of documents containing a value for >> >> the given dim+path, which seems fairly useful >> >> +1 >> >> >> >> I think the <each doc can be counted more than once> logic also has merit. >> >> For example, you could probably use it for counting the number of times a >> >> movie has been watched in a person->list of movies watched schema. >> >> >> >> I don't have any specific thoughts on the inconsistency issue because it >> >> seems that LongValueFacetCounts and IntTaxonomyFacets were designed for >> >> different purposes? >> >> The latter supports hierarchical values, needs an explicit specification >> >> for multi values and supports the getSpecificValue API. >> >> It does seem odd that different groups of taxonomy classes treat counts >> >> slightly differently. >> >> >> >> As a side note: >> >> I think we can make the >> >> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued >> >> test case more robust by forcing it to use atleast one duplicate >> >> multi-value? >> >> >> >> Thanks >> >> - Gautam >> >> >> >> >> >> On Sun, May 9, 2021 at 9:40 AM Greg Miller <[email protected]> wrote: >> >>> >> >>> Hi folks- >> >>> >> >>> I'm trying to make sure I have a proper understanding of what >> >>> FacetResult#value is meant to represent, particularly in multi-valued >> >>> doc scenarios. Apologies if I'm missing something obvious, but it >> >>> seems that either my understanding is incorrect, or we have a bug in >> >>> how we count multi-value docs. This is particularly relevant to me at >> >>> the moment since I'm working on a couple facet-related changes, and I >> >>> want to make sure I've got a proper understanding of this field. >> >>> Thanks! >> >>> >> >>> From the Javadocs: >> >>> /** >> >>> * Total value for this path (sum of all child counts, or sum of all >> >>> child values), even those not >> >>> * included in the topN. >> >>> */ >> >>> public final Number value; >> >>> >> >>> So from the Javadocs, it seems this is simply the sum of all values >> >>> for the given dim+path. In the case of single-value docs, this would >> >>> also represent the total number of documents containing a value for >> >>> the given dim+path, which seems fairly useful (i.e., it might be nice >> >>> to know how many documents contain a value for a given facet >> >>> dim+path). On the other hand, if docs can be multi-valued, this seems >> >>> somewhat less useful. If this is truly the sum of the values for the >> >>> given dim+path, each document can contribute more than one count, so >> >>> the user can no longer interpret this as the number of documents that >> >>> have at least one value for the facet dim+path. It seems as though it >> >>> would be more useful to provide the number of documents with a given >> >>> dim+path value instead of just the total count, but this is where I'm >> >>> probably just misunderstanding something. >> >>> >> >>> Finally, looking at the way taxonomy facets are counted, it looks like >> >>> this value is populated with the total number of documents, and >> >>> populated with -1 in multi-value cases where an accurate doc count >> >>> can't be provided (see IntTaxonomyFacets L:228 for example). This >> >>> isn't consistent with the implementation in LongValueFacetCounts >> >>> though, which will always populate the total of all values, ignoring >> >>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It >> >>> appears the implementation in SortedSetDocValuesFacetCounts will also >> >>> "double count" multi-value cases similar to LongValueFacetCounts. >> >>> >> >>> So... which do we think it is? Is it meant to be the total number of >> >>> docs, or the total of all values? Can anyone shed some light on this? >> >>> Thanks a bunch! >> >>> >> >>> Cheers, >> >>> -Greg >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [email protected] >> >>> For additional commands, e-mail: [email protected] >> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
