Profiling shows incrementing integers by 1 (i += 1) being the bottleneck.

Within the same loop are other statements that do take much less time.

In my performance optimizing zeal, I over typed the hell out of everything 
to attempt squeezing performance to the last once.
Some of this zeal did help in other parts of the code, but now struggling 
making sense at spending most of the time incrementing by 1.
I suspect the problem is over typing zeal because I seem to recall having a 
version not so strongly typed that ran consistently 2-3 times faster for 
default Int (but not for other Int types).  It was late at night so I don't 
recall the details!

I am pretty confident the increment variables are typed so there should not 
be any undue cast.

Any idea?

Here is how my code conceptually looks like:

# Global static type declaration ahead seems to have helped (as opposed to 
> deriving from eltype of underlying array at the beginning of function being 
> profiled).
> IdType = Int # Int64
> DType = Int
> function my_fct(dt1, dt2)
>   # Convert is for sure unnecessary for default Int types but more 
> rigorous and necessary in some parts of code when experimenting with other 
> IdType & DType types.
>   const oneIdType = convert(IdType, 1) # Used to make sure I increment 
> with a value of the proper type, again useless with IdType = Int.
>   const zeroIdType = convert(IdType, 0)
>   i::IdType = zeroIdType; i2Match::IdType = zeroIdType; i2Lower::IdType = 
> zeroIdType; i2Upper::IdType = oneIdType;
>   ...
>     # Critical loop.
>     i2Match = i2Lower
>     while i2Match < i2Upper
>       @inbounds i2MatchD2 = dt2D2[i2Match]
>       if i1D <= i2MatchD2
>         i += oneIdType # SLOW!
>         @inbounds i2MatchD1 = dt2D1[i2Match]
>         @inbounds resid1[i] = i1id1
>         ...
>       end
>       i2Match += oneIdType # SLOW!
>     end
>   ...
> end


The undeclared types are 1-dim arrays of the appropriate type -- basically 
all Int in this configuration.

Enclosed is the full stand-alone code if anyone cares to try.
On my machines, one function call is in the range of 0.05 to 0.1 sec, 
highly depending upon garbage collection, so profiling with 100 runs is 
done in about 10 sec.

Thanks.

Patrick

Attachment: crossJoinFilter.jl
Description: Binary data

Reply via email to