Ok, I merged it. Enjoy.

On Tuesday, June 14, 2016 at 8:53:26 PM UTC+2, jean-pierre both wrote:
>
> Hi,
>
> The fix is really great, thank you for the analysis and the fix.
> Thanks you
>
> Le lundi 13 juin 2016 20:19:26 UTC+2, Kristoffer Carlsson a écrit :
>>
>> Please try https://github.com/JuliaStats/Distances.jl/pull/44
>>
>> On Monday, June 13, 2016 at 8:14:01 PM UTC+2, Mauro wrote:
>>>
>>> > function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) 
>>> >     num = 0. 
>>> >     den = 0. 
>>> >     for I in 1:length(a) 
>>> >         @inbounds ai = a[I] 
>>> >         @inbounds bi = b[I] 
>>> >         num = num + min(ai,bi) 
>>> >         den = den + max(ai,bi) 
>>> >     end 
>>> >     1. - num/den 
>>> > end 
>>> > 
>>> > 
>>> > 
>>> > function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1}) 
>>> >     for i in 1:50000 
>>> >         myjaccard2(v1,v2) 
>>> >     end 
>>> > end 
>>>
>>> I recommend using the values returned for something, otherwise the 
>>> compiler sometimes eliminates the loop (but not here): 
>>>
>>> julia> function testDistances2(v1::Array{Float64,1}, 
>>> v2::Array{Float64,1}) 
>>>            out = 0.0 
>>>            for i in 1:50000 
>>>                out += myjaccard2(v1,v2) 
>>>            end 
>>>            out 
>>>        end 
>>>
>>> > @time testDistances2(v1,v2) 
>>> > machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc 
>>> time) 
>>>
>>> I cannot reproduce this, when I run it I get no allocations: 
>>>
>>> julia> v2 = rand(10^4); 
>>>
>>> # warm-up 
>>> julia> @time testDistances2(v1,v2) 
>>>   3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time) 
>>> 24999.00112162811 
>>>
>>> julia> @time testDistances2(v1,v2) 
>>>   3.647563 seconds (5 allocations: 176 bytes) 
>>> 24999.00112162811 
>>>
>>> What version of Julia are you running. Me 0.4.5. 
>>>
>>> > function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1}) 
>>> >     num = 0. 
>>> >     den = 0. 
>>> >     for I in 1:length(a) 
>>> >         @inbounds ai = a[I] 
>>> >         @inbounds bi = b[I] 
>>> >         abs_m = abs(ai-bi) 
>>> >         abs_p = abs(ai+bi) 
>>> >         num += abs_p - abs_m 
>>> >         den += abs_p + abs_m 
>>> >     end 
>>> >     1. - num/den 
>>> > end 
>>> > 
>>> > 
>>> > function testDistances5(a::Array{Float64,1}, b::Array{Float64,1}) 
>>> >     for i in 1:5000 
>>> >         myjaccard5(a,b) 
>>> >     end 
>>> > end 
>>> > 
>>> > end 
>>> > 
>>> > 
>>> > julia> @time testDistances5(v1,v2) 
>>> >   0.166979 seconds (4 allocations: 160 bytes) 
>>> > 
>>> > 
>>> > 
>>> > We see that using abs is faster. 
>>> > 
>>> > I do not do a pull request beccause 
>>> > 
>>> > I would expect a good implementation to be 2 or 3 times slower than 
>>> > Euclidean, and I have not 
>>> > that yet. 
>>> > 
>>> > Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit : 
>>> >> 
>>> >> It seems weird to me that you guys want to call Jaccard distance with 
>>> >> float arrays. AFAIK Jaccard distance measures the distance between 
>>> two 
>>> >> distinct samples from a pair of sets so basically between two 
>>> Vector{Bool}, 
>>> >> see: 
>>> >> 
>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html
>>>  
>>> >> 
>>> >> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D 
>>> >> arrays." 
>>> >> 
>>> >> Is there some more general formulation of it that extends to vectors 
>>> in a 
>>> >> continuous vector space? 
>>> >> 
>>> >> And, to note, Jaccard is type stable for inputs of Vector{Bool} in 
>>> >> Distances.jl. 
>>> >> 
>>> >> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote: 
>>> >>> 
>>> >>> 
>>> >>> 
>>> >>> I encountered in my application with Distances.Jaccard compared with 
>>> >>> Distances.Euclidean 
>>> >>> It was very slow. 
>>> >>> 
>>> >>> For example with 2 vecteurs Float64 of size 11520 
>>> >>> 
>>> >>> I get the following 
>>> >>> julia> D=Euclidean() 
>>> >>> Distances.Euclidean() 
>>> >>> julia> @time for i in 1:500 
>>> >>>        evaluate(D,v1,v2) 
>>> >>>        end 
>>> >>>   0.002553 seconds (500 allocations: 7.813 KB) 
>>> >>> 
>>> >>> and with Jaccard 
>>> >>> 
>>> >>> julia> D=Jaccard() 
>>> >>> Distances.Jaccard() 
>>> >>> @time for i in 1:500 
>>> >>>               evaluate(D,v1,v2) 
>>> >>>               end 
>>> >>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time) 
>>> >>> 
>>> >>> With a simple loop for computing jaccard : 
>>> >>> 
>>> >>> 
>>> >>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) 
>>> >>>            num = 0 
>>> >>>            den = 0 
>>> >>>            for i in 1:length(a) 
>>> >>>                    num = num + min(a[i],b[i]) 
>>> >>>                    den = den + max(a[i],b[i]) 
>>> >>>            end 
>>> >>>                1. - num/den 
>>> >>>        end 
>>> >>> myjaccard2 (generic function with 1 method) 
>>> >>> 
>>> >>> julia> @time for i in 1:500 
>>> >>>               myjaccard2(v1,v2) 
>>> >>>               end 
>>> >>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time) 
>>> >>> 
>>> >>> I do not see the problem in jaccard distance implementation in the 
>>> >>> Distances packages 
>>> >>> 
>>> >> 
>>>
>>

Reply via email to