vasiliy-mikhailov opened a new pull request, #693:
URL: https://github.com/apache/commons-collections/pull/693

   ## Problem
   
   `SetOperations.cosineSimilarity` computes the product of the two 
cardinalities as an `int` before passing it to `Math.sqrt`:
   
   ```java
   // Given that the cardinality is an int then the product as a double will not
   // overflow, we can use one sqrt:
   return numerator == 0 ? 0 : numerator / Math.sqrt(cardinality(first) * 
cardinality(second));
   ```
   
   The comment states the product is meant to be computed as a `double` so it 
will not overflow, but without a cast the multiplication happens in `int`. For 
large inputs (each cardinality > 46341) the product exceeds `Integer.MAX_VALUE` 
and wraps to a negative value, so `Math.sqrt` returns `NaN` and both 
`cosineSimilarity` and `cosineDistance` become `NaN`.
   
   For example, two identical extractors of cardinality 50000 give `50000 * 
50000 = 2_500_000_000`, which overflows to a negative `int`; the similarity 
should be `1.0` but is `NaN`.
   
   ## Fix
   
   Cast the first operand to `double` so the product is computed in `double`, 
as the comment already intended. Both the `BitMapExtractor` and `BloomFilter` 
overloads had the same issue and are fixed.
   
   ## Test
   
   Adds `testCosineSimilarityLargeCardinalityNoOverflow` to 
`SetOperationsTest`, building extractors with cardinality 50000 and asserting 
`cosineSimilarity == 1.0` / `cosineDistance == 0.0`. It fails on `master` 
(`expected: <1.0> but was: <NaN>`) and passes with this change; the existing 
`SetOperationsTest` suite stays green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to