>I’d favour dropping the round and adding it to the Changes.xml via a Jira
>ticket so it is noted if someone upgrades. They can always restore
>functionality to as-it-was by doing a round on the output of the class.
+1
>I’ve already made the test using the python distance.jaccard function from the
>distance library in the PR for Text-155. So changing the test is simple. It’s
>just the decision on whether to do it.
I think we can aim at implementing this for 1.7 (which from the looks of it
will have several bug fixes & improvements!).
CheersBruno
On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert
<[email protected]> wrote:
Hi Bruno,
> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <[email protected]> wrote:
>
> Hi Alex,
> Can't recall why it was done that way. When the initial code for the edit
> distances was created, some Java libraries like Simmetrics,
> java-string-similarity, Lucene, and also R/Python code were used to verify
> the output of the edit distances.
> Maybe we used Math.round just to get a test passing, which I agree it had to
> be documented.
> But even better if we just drop the Math.round and instead update the tests
> with that assertEquals(expected, actual, threshold) method, with a good
> enough threshold.
> What do you think?
I’d favour dropping the round and adding it to the Changes.xml via a Jira
ticket so it is noted if someone upgrades. They can always restore
functionality to as-it-was by doing a round on the output of the class.
If I understand the metric correctly (intersect over union) to have a
difference in the 3rd decimal place would require the union of the two
character sets to be above 200, i.e. a string containing over 200 unique
characters, e.g.
A) 0/200 = 0
B) 1/200 = 0.005
C) 2/200 = 0.01
In this case result A and C can be distinguished but not B and C due to round
up.
So in practical terms it would not make a difference unless using a large
character set. For ASCII strings there is no difference.
I’ve already made the test using the python distance.jaccard function from the
distance library in the PR for Text-155. So changing the test is simple. It’s
just the decision on whether to do it.
Alex
> CheersBruno
>
> On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert
><[email protected]> wrote:
>
> A quick question about the JaccardSimilarity class:
>
> Q. Why does it round the similarity to 2 decimal places?
>
> This is not documented.
>
> It is also done in the complimentary JaccardDistance class.
>
> Looking at the history in git it seems to have always been that way.
> First commit was 2016-11-27.
>
> Thanks,
>
> Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]