Re: FacetResult#value semantics?

Greg Miller Tue, 11 May 2021 12:03:39 -0700

Thanks Rob! For now, I've focused on making the javadoc a bit more
clear and fixing the bug, but it's an interesting thought to see if
there's a better way to make this extensible. I don't think it would
be particularly challenging to do this, so if there's enough interest
in that functionality (i.e., making the calculation of
FacetResult#value more extensible) I think we could open a separate
issue to track?


Cheers,
-Greg

On Tue, May 11, 2021 at 12:23 AM Rob Audenaerde
<[email protected]> wrote:
>
> Hi Greg,
>
> Honestly, I think the implementation should just focus on getting the 
> document-counts correct and consistent in all places, and having the javadoc 
> explain what it happering. (That said, it would be nice if the 
> implementations are easy to extend, so that ppl who want to implement 
> field_count can implement this easily themselves :)
>
> In our solution, we use the AssociatedFacetFields a lot and do aggregates on 
> them ( like sum, max, avg), but I extended the base Facets class to implement 
> this.
>
> -Rob
>
> On Mon, May 10, 2021 at 6:36 PM Greg Miller <[email protected]> wrote:
>>
>> Thanks Mike/Gautam/Rob!
>>
>> I've created LUCENE-9952 to track the work of making FacetResult#value
>> consistently report doc count (as the taxonomy-based implementations
>> do).
>>
>> Rob (or anyone else), do you think there's value in _also_ reporting
>> "field count?" Sounds like you may have some use-cases for this.
>> Should we cut a separate issue to track adding a "field count" concept
>> to FacetResult?
>>
>> Cheers,
>> -Greg
>>
>> On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <[email protected]> 
>> wrote:
>> >
>> > Hi all,
>> >
>> > We use the facets a lot to generate all kinds of nice aggregates on our 
>> > data, and we alse needed to make the distinction between FIELD_COUNT and 
>> > DOCUMENT_COUNT, where the former increases for each multi-value, and the 
>> > latter only once for each document that contains that field at least once.
>> >
>> > Maybe that distinction can be worded/implemented somehow to make it all 
>> > more consistent?
>> >
>> > On Mon, May 10, 2021 at 5:02 PM Gautam Worah <[email protected]> 
>> > wrote:
>> >>
>> >> Hi Greg,
>> >>
>> >> I think your understanding is correct. I tried to create test cases for 
>> >> FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and 
>> >> LongValueFacetCounts.
>> >>
>> >> FastTaxonomyFacetCounts treats common values in a document as a single 
>> >> entity and returns the count of a dim+path as the number of documents 
>> >> that contain these fields.
>> >> On the other hand, LongValueFacetCounts treats values as unique and 
>> >> returns the number of instances of the dim+path value (each doc can be 
>> >> counted more than once).
>> >>
>> >> > In the case of single-value docs, this would
>> >> also represent the total number of documents containing a value for
>> >> the given dim+path, which seems fairly useful
>> >> +1
>> >>
>> >> I think the <each doc can be counted more than once> logic also has merit.
>> >> For example, you could probably use it for counting the number of times a 
>> >> movie has been watched in a person->list of movies watched schema.
>> >>
>> >> I don't have any specific thoughts on the inconsistency issue because it 
>> >> seems that LongValueFacetCounts and IntTaxonomyFacets were designed for 
>> >> different purposes?
>> >> The latter supports hierarchical values, needs an explicit specification 
>> >> for multi values and supports the getSpecificValue API.
>> >> It does seem odd that different groups of taxonomy classes treat counts 
>> >> slightly differently.
>> >>
>> >> As a side note:
>> >> I think we can make the 
>> >> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued 
>> >> test case more robust by forcing it to use atleast one duplicate 
>> >> multi-value?
>> >>
>> >> Thanks
>> >> - Gautam
>> >>
>> >>
>> >> On Sun, May 9, 2021 at 9:40 AM Greg Miller <[email protected]> wrote:
>> >>>
>> >>> Hi folks-
>> >>>
>> >>> I'm trying to make sure I have a proper understanding of what
>> >>> FacetResult#value is meant to represent, particularly in multi-valued
>> >>> doc scenarios. Apologies if I'm missing something obvious, but it
>> >>> seems that either my understanding is incorrect, or we have a bug in
>> >>> how we count multi-value docs. This is particularly relevant to me at
>> >>> the moment since I'm working on a couple facet-related changes, and I
>> >>> want to make sure I've got a proper understanding of this field.
>> >>> Thanks!
>> >>>
>> >>> From the Javadocs:
>> >>> /**
>> >>>  * Total value for this path (sum of all child counts, or sum of all
>> >>> child values), even those not
>> >>>  * included in the topN.
>> >>>  */
>> >>> public final Number value;
>> >>>
>> >>> So from the Javadocs, it seems this is simply the sum of all values
>> >>> for the given dim+path. In the case of single-value docs, this would
>> >>> also represent the total number of documents containing a value for
>> >>> the given dim+path, which seems fairly useful (i.e., it might be nice
>> >>> to know how many documents contain a value for a given facet
>> >>> dim+path). On the other hand, if docs can be multi-valued, this seems
>> >>> somewhat less useful. If this is truly the sum of the values for the
>> >>> given dim+path, each document can contribute more than one count, so
>> >>> the user can no longer interpret this as the number of documents that
>> >>> have at least one value for the facet dim+path. It seems as though it
>> >>> would be more useful to provide the number of documents with a given
>> >>> dim+path value instead of just the total count, but this is where I'm
>> >>> probably just misunderstanding something.
>> >>>
>> >>> Finally, looking at the way taxonomy facets are counted, it looks like
>> >>> this value is populated with the total number of documents, and
>> >>> populated with -1 in multi-value cases where an accurate doc count
>> >>> can't be provided (see IntTaxonomyFacets L:228 for example). This
>> >>> isn't consistent with the implementation in LongValueFacetCounts
>> >>> though, which will always populate the total of all values, ignoring
>> >>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
>> >>> appears the implementation in SortedSetDocValuesFacetCounts will also
>> >>> "double count" multi-value cases similar to LongValueFacetCounts.
>> >>>
>> >>> So... which do we think it is? Is it meant to be the total number of
>> >>> docs, or the total of all values? Can anyone shed some light on this?
>> >>> Thanks a bunch!
>> >>>
>> >>> Cheers,
>> >>> -Greg
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: FacetResult#value semantics?

Reply via email to