It makes perfect sense to use Jaccard distances for float values Cf for example http://www.ncbi.nlm.nih.gov/pubmed/16794951
Nevertheless the problem is just an implementation, the time spent should be comparable with the one with Euclidean. The problem I mention is that the nice implementation used in packages Distances is a problem for this distance as a simple loop is really faster. I presume there is an optimization issue as the difference in time with Euclidean is many orders of magnitude larger than what can be expected from the complexity. The funny thing is that min and max seems also part of the problem as can be seen in the following: function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) num = 0. den = 0. for I in 1:length(a) @inbounds ai = a[I] @inbounds bi = b[I] num = num + min(ai,bi) den = den + max(ai,bi) end 1. - num/den end function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1}) for i in 1:50000 myjaccard2(v1,v2) end end @time testDistances2(v1,v2) machine 3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time) function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1}) num = 0. den = 0. for I in 1:length(a) @inbounds ai = a[I] @inbounds bi = b[I] abs_m = abs(ai-bi) abs_p = abs(ai+bi) num += abs_p - abs_m den += abs_p + abs_m end 1. - num/den end function testDistances5(a::Array{Float64,1}, b::Array{Float64,1}) for i in 1:5000 myjaccard5(a,b) end end end julia> @time testDistances5(v1,v2) 0.166979 seconds (4 allocations: 160 bytes) We see that using abs is faster. I do not do a pull request beccause I would expect a good implementation to be 2 or 3 times slower than Euclidean, and I have not that yet. Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit : > > It seems weird to me that you guys want to call Jaccard distance with > float arrays. AFAIK Jaccard distance measures the distance between two > distinct samples from a pair of sets so basically between two Vector{Bool}, > see: > http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html > > "Computes the Jaccard-Needham dissimilarity between two boolean 1-D > arrays." > > Is there some more general formulation of it that extends to vectors in a > continuous vector space? > > And, to note, Jaccard is type stable for inputs of Vector{Bool} in > Distances.jl. > > On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote: >> >> >> >> I encountered in my application with Distances.Jaccard compared with >> Distances.Euclidean >> It was very slow. >> >> For example with 2 vecteurs Float64 of size 11520 >> >> I get the following >> julia> D=Euclidean() >> Distances.Euclidean() >> julia> @time for i in 1:500 >> evaluate(D,v1,v2) >> end >> 0.002553 seconds (500 allocations: 7.813 KB) >> >> and with Jaccard >> >> julia> D=Jaccard() >> Distances.Jaccard() >> @time for i in 1:500 >> evaluate(D,v1,v2) >> end >> 1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time) >> >> With a simple loop for computing jaccard : >> >> >> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >> num = 0 >> den = 0 >> for i in 1:length(a) >> num = num + min(a[i],b[i]) >> den = den + max(a[i],b[i]) >> end >> 1. - num/den >> end >> myjaccard2 (generic function with 1 method) >> >> julia> @time for i in 1:500 >> myjaccard2(v1,v2) >> end >> 0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time) >> >> I do not see the problem in jaccard distance implementation in the >> Distances packages >> >