Hi, The fix is really great, thank you for the analysis and the fix. Thanks you
Le lundi 13 juin 2016 20:19:26 UTC+2, Kristoffer Carlsson a écrit : > > Please try https://github.com/JuliaStats/Distances.jl/pull/44 > > On Monday, June 13, 2016 at 8:14:01 PM UTC+2, Mauro wrote: >> >> > function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >> > num = 0. >> > den = 0. >> > for I in 1:length(a) >> > @inbounds ai = a[I] >> > @inbounds bi = b[I] >> > num = num + min(ai,bi) >> > den = den + max(ai,bi) >> > end >> > 1. - num/den >> > end >> > >> > >> > >> > function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1}) >> > for i in 1:50000 >> > myjaccard2(v1,v2) >> > end >> > end >> >> I recommend using the values returned for something, otherwise the >> compiler sometimes eliminates the loop (but not here): >> >> julia> function testDistances2(v1::Array{Float64,1}, >> v2::Array{Float64,1}) >> out = 0.0 >> for i in 1:50000 >> out += myjaccard2(v1,v2) >> end >> out >> end >> >> > @time testDistances2(v1,v2) >> > machine 3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc >> time) >> >> I cannot reproduce this, when I run it I get no allocations: >> >> julia> v2 = rand(10^4); >> >> # warm-up >> julia> @time testDistances2(v1,v2) >> 3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time) >> 24999.00112162811 >> >> julia> @time testDistances2(v1,v2) >> 3.647563 seconds (5 allocations: 176 bytes) >> 24999.00112162811 >> >> What version of Julia are you running. Me 0.4.5. >> >> > function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1}) >> > num = 0. >> > den = 0. >> > for I in 1:length(a) >> > @inbounds ai = a[I] >> > @inbounds bi = b[I] >> > abs_m = abs(ai-bi) >> > abs_p = abs(ai+bi) >> > num += abs_p - abs_m >> > den += abs_p + abs_m >> > end >> > 1. - num/den >> > end >> > >> > >> > function testDistances5(a::Array{Float64,1}, b::Array{Float64,1}) >> > for i in 1:5000 >> > myjaccard5(a,b) >> > end >> > end >> > >> > end >> > >> > >> > julia> @time testDistances5(v1,v2) >> > 0.166979 seconds (4 allocations: 160 bytes) >> > >> > >> > >> > We see that using abs is faster. >> > >> > I do not do a pull request beccause >> > >> > I would expect a good implementation to be 2 or 3 times slower than >> > Euclidean, and I have not >> > that yet. >> > >> > Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit : >> >> >> >> It seems weird to me that you guys want to call Jaccard distance with >> >> float arrays. AFAIK Jaccard distance measures the distance between two >> >> distinct samples from a pair of sets so basically between two >> Vector{Bool}, >> >> see: >> >> >> http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html >> >> >> >> >> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D >> >> arrays." >> >> >> >> Is there some more general formulation of it that extends to vectors >> in a >> >> continuous vector space? >> >> >> >> And, to note, Jaccard is type stable for inputs of Vector{Bool} in >> >> Distances.jl. >> >> >> >> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote: >> >>> >> >>> >> >>> >> >>> I encountered in my application with Distances.Jaccard compared with >> >>> Distances.Euclidean >> >>> It was very slow. >> >>> >> >>> For example with 2 vecteurs Float64 of size 11520 >> >>> >> >>> I get the following >> >>> julia> D=Euclidean() >> >>> Distances.Euclidean() >> >>> julia> @time for i in 1:500 >> >>> evaluate(D,v1,v2) >> >>> end >> >>> 0.002553 seconds (500 allocations: 7.813 KB) >> >>> >> >>> and with Jaccard >> >>> >> >>> julia> D=Jaccard() >> >>> Distances.Jaccard() >> >>> @time for i in 1:500 >> >>> evaluate(D,v1,v2) >> >>> end >> >>> 1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time) >> >>> >> >>> With a simple loop for computing jaccard : >> >>> >> >>> >> >>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >> >>> num = 0 >> >>> den = 0 >> >>> for i in 1:length(a) >> >>> num = num + min(a[i],b[i]) >> >>> den = den + max(a[i],b[i]) >> >>> end >> >>> 1. - num/den >> >>> end >> >>> myjaccard2 (generic function with 1 method) >> >>> >> >>> julia> @time for i in 1:500 >> >>> myjaccard2(v1,v2) >> >>> end >> >>> 0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time) >> >>> >> >>> I do not see the problem in jaccard distance implementation in the >> >>> Distances packages >> >>> >> >> >> >