Re: Slow image convolution

2012-11-10 Thread Cedric Greevey
On Fri, Nov 9, 2012 at 11:57 AM, Yakovlev Roman  wrote:

> what a mess if it is a function it's huuge did you try split it to useful
> chunks ? it's just unreadable and that's why you cann't spot anything i
> guess..
>

I can't split it without ending up with boxing at the function call
boundaries, since I'd have to pass more than four parameters to the
sub-functions.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Slow image convolution

2012-11-09 Thread Jim foo.bar
I thought 'definline' addresses exactly that...cases where one might 
want to pull out small pieces of first class functionality without 
sacrificing performance and without macros (which are not 
first-class)...am I mistaken?


Jim

ps: the only similarly big function that I've ever written (still 
smaller than this one) was for a GUI of mine and basically the Java 
interop caused it to be that big - it's not doing anything complicated 
like this one. I could not maintain such a function like the one shown 
here



On 09/11/12 17:32, Andy Fingerhut wrote:
Having worked on Clojure benchmarks on the Computer Language 
Benchmarks Game web site, that is sometimes the kind of Clojure code 
one needs to write if you want it to be as fast as it can be.


http://shootout.alioth.debian.org/u64q/compare.php?lang=clojure

Also, splitting it out into smaller chunks, if by that you mean 
functions, requires care in Clojure to avoid boxing and unboxing of 
primitive numbers.  It is possible, but can be an additional hurdle 
when trying to optimize code.  Macros can be better in that regard.


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Slow image convolution

2012-11-09 Thread Andy Fingerhut
Having worked on Clojure benchmarks on the Computer Language Benchmarks Game 
web site, that is sometimes the kind of Clojure code one needs to write if you 
want it to be as fast as it can be.

http://shootout.alioth.debian.org/u64q/compare.php?lang=clojure

Also, splitting it out into smaller chunks, if by that you mean functions, 
requires care in Clojure to avoid boxing and unboxing of primitive numbers.  It 
is possible, but can be an additional hurdle when trying to optimize code.  
Macros can be better in that regard.

Another comment, Cedric: Consider writing that part of your code in Java and 
call it from Clojure.  Java is to Clojure as in-line assembler is to C, except 
that it isn't in-line :-)

Andy

On Nov 9, 2012, at 8:57 AM, Yakovlev Roman wrote:

> what a mess if it is a function it's huuge did you try split it to useful 
> chunks ? it's just unreadable and that's why you cann't spot anything i 
> guess..
> 
> пятница, 9 ноября 2012 г., 0:48:20 UTC+4 пользователь Cedric Greevey написал:
> I have the following code to perform a complicated image convolution. It 
> takes 10-15 seconds with output dimensions 256x256 and samples 6. No 
> reflection warnings, and using unchecked math doesn't speed it up any. I've 
> tried to ensure it uses primitive math inside the loops, aside from 
> generating the outer loop's values. What cached-load-chunk does shouldn't 
> matter much, but in most cases it should boil down to a map lookup inside a 
> swap! and a couple of atom derefs and function calls. The bottleneck is 
> likely in the math somewhere, and likely something is being boxed, though 
> I've primitive-coerced every numerical let and loop value and avoided more 
> than two arguments per arithmetic op.
> 
> Can anyone spot anything I haven't that could be causing boxed arithmetic 
> inside the loops?
> 
>   (let [^BufferedImage out (BufferedImage. width height 
> BufferedImage/TYPE_INT_BGR)
> half-w (double (/ width 2.0))
> half-h (double (/ height 2.0))
> lnk (double (- (* (Math/log 0.1) (inc l10-mag))
>(/ (Math/log (+ (* half-w half-w) (* half-h half-h))) 
> 2)))
> cconv (double (/ chunk-width (* 2.0 Math/PI)))
> c-base-ln (double (* chunk-base-mag (Math/log 0.1)))
> x-offset (double (* chunk-width (+ 0.5 (/ deg-rot 360
> fmtstr (str "%0" chunk-name-digits "d")
> cached-load-chunk (cached-objects
> (fn [chunk-num]
>   (maybe-load-image
> (str chunk-name-base
>  (format fmtstr (inc chunk-num)) 
> ".png"]
> (doseq [oy (range height) ox (range width)]
>   (let [ox (int ox)
> oy (int oy)
> ix (double (- ox half-w))
> iy (double (- oy half-h))
> samples (double samples)
> smo (int (dec samples))]
> (loop [sx (int 0) sy (int 0) ns (double 0)
>r (double 0) g (double 0) b (double 0)]
>   (if (== sy samples)
> (.setRGB out ox oy (int (+
>   (int (/ b ns))
>   (+ (* 256 (int (/ g ns)))
>  (* 65536 (int (/ r ns)))
> (let [nsx (int (if (== sx smo) 0 (inc sx)))
>   nsy (int (if (zero? nsx) (inc sy) sy))
>   ix (double (+ ix (/ (+ sx (rand)) samples)))
>   iy (double (+ iy (/ (+ sy (rand)) samples)))]
>   (if (and (== ix 0) (== iy 0))
> (recur nsx nsy ns r g b)
> (let [lnz-P (double (+ lnk (/ (Math/log (+ (* ix ix) (* iy 
> iy))) 2)))
>   argz-P (double (+ Math/PI (Math/atan2 iy ix)))
>   ; atan2 output range is -pi...pi and oy increases down 
> the screen
>   x (double (* argz-P cconv))
>   y (double (* (- c-base-ln lnz-P) cconv))
>   ;
>   chunk-num (int (/ y chunk-height))
>   chunk-x (int (rem (int (+ x x-offset)) chunk-width))
>   chunk-y (int (- y (* chunk-height chunk-num)))
>   ^BufferedImage chunk (cached-load-chunk chunk-num)
>   pixel (int (if chunk (.getRGB chunk chunk-x chunk-y) 0))
>   r (double (+ r (/ (bit-and pixel 0x00ff) 65536)))
>   g (double (+ g (/ (bit-and pixel 0xff00) 256)))
>   b (double (+ b (bit-and pixel 0x00ff)))]
>   (recur nsx nsy (inc ns) r g b
> out))
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from t

Re: Slow image convolution

2012-11-09 Thread Yakovlev Roman
what a mess if it is a function it's huuge did you try split it to useful 
chunks ? it's just unreadable and that's why you cann't spot anything i 
guess..

пятница, 9 ноября 2012 г., 0:48:20 UTC+4 пользователь Cedric Greevey 
написал:
>
> I have the following code to perform a complicated image convolution. It 
> takes 10-15 seconds with output dimensions 256x256 and samples 6. No 
> reflection warnings, and using unchecked math doesn't speed it up any. I've 
> tried to ensure it uses primitive math inside the loops, aside from 
> generating the outer loop's values. What cached-load-chunk does shouldn't 
> matter much, but in most cases it should boil down to a map lookup inside a 
> swap! and a couple of atom derefs and function calls. The bottleneck is 
> likely in the math somewhere, and likely something is being boxed, though 
> I've primitive-coerced every numerical let and loop value and avoided more 
> than two arguments per arithmetic op.
>
> Can anyone spot anything I haven't that could be causing boxed arithmetic 
> inside the loops?
>
>   (let [^BufferedImage out (BufferedImage. width height 
> BufferedImage/TYPE_INT_BGR)
> half-w (double (/ width 2.0))
> half-h (double (/ height 2.0))
> lnk (double (- (* (Math/log 0.1) (inc l10-mag))
>(/ (Math/log (+ (* half-w half-w) (* half-h 
> half-h))) 2)))
> cconv (double (/ chunk-width (* 2.0 Math/PI)))
> c-base-ln (double (* chunk-base-mag (Math/log 0.1)))
> x-offset (double (* chunk-width (+ 0.5 (/ deg-rot 360
> fmtstr (str "%0" chunk-name-digits "d")
> cached-load-chunk (cached-objects
> (fn [chunk-num]
>   (maybe-load-image
> (str chunk-name-base
>  (format fmtstr (inc chunk-num)) 
> ".png"]
> (doseq [oy (range height) ox (range width)]
>   (let [ox (int ox)
> oy (int oy)
> ix (double (- ox half-w))
> iy (double (- oy half-h))
> samples (double samples)
> smo (int (dec samples))]
> (loop [sx (int 0) sy (int 0) ns (double 0)
>r (double 0) g (double 0) b (double 0)]
>   (if (== sy samples)
> (.setRGB out ox oy (int (+
>   (int (/ b ns))
>   (+ (* 256 (int (/ g ns)))
>  (* 65536 (int (/ r ns)))
> (let [nsx (int (if (== sx smo) 0 (inc sx)))
>   nsy (int (if (zero? nsx) (inc sy) sy))
>   ix (double (+ ix (/ (+ sx (rand)) samples)))
>   iy (double (+ iy (/ (+ sy (rand)) samples)))]
>   (if (and (== ix 0) (== iy 0))
> (recur nsx nsy ns r g b)
> (let [lnz-P (double (+ lnk (/ (Math/log (+ (* ix ix) (* iy 
> iy))) 2)))
>   argz-P (double (+ Math/PI (Math/atan2 iy ix)))
>   ; atan2 output range is -pi...pi and oy increases 
> down the screen
>   x (double (* argz-P cconv))
>   y (double (* (- c-base-ln lnz-P) cconv))
>   ;
>   chunk-num (int (/ y chunk-height))
>   chunk-x (int (rem (int (+ x x-offset)) chunk-width))
>   chunk-y (int (- y (* chunk-height chunk-num)))
>   ^BufferedImage chunk (cached-load-chunk chunk-num)
>   pixel (int (if chunk (.getRGB chunk chunk-x chunk-y) 
> 0))
>   r (double (+ r (/ (bit-and pixel 0x00ff) 65536)))
>   g (double (+ g (/ (bit-and pixel 0xff00) 256)))
>   b (double (+ b (bit-and pixel 0x00ff)))]
>   (recur nsx nsy (inc ns) r g b
> out))
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Slow image convolution

2012-11-09 Thread Andy Fingerhut
I haven't determined exactly why that particular code is slow for you, but I 
have looked at it for a few minutes and have some suggestions:

1. In Clojure 1.3 and later, the default integer type for primitive arithmetic 
is long, not int.  Declaring things int means that they are really longs in the 
byte code, with type casts between long and int where needed.  Type casts 
between long and int are not slow, I'd guess, but wanted to mention this FYI.

2. I haven't followed all of the changes you have made to the code, but looked 
at the version in your original message of the thread.  There chunk-width and 
chunk-height are used in the inner loop, but have no let binding to cause them 
to be primitives, therefore are more likely to be boxed numbers and slow down 
the operations in which they are involved.  Maybe you already found and fixed 
this in your "I've now checked for Var lookups" message, but I can't tell from 
your message.

3. I have some hope that if I can find a decent Java decompiler, it will help 
in tracking these things down, by decompiling the .class files that Clojure can 
be compiled to back into Java source code.  If one is experienced in reading 
Java byte code, then a simpler Java disassembler is sufficient, but I'm not 
sure I want to learn to read Java byte codes that well.

I've tried the free JD-GUI, which produces output that looks like Java, but the 
tool seems to not handle Clojure loops very well.  I've gotten better results 
in a couple of small tests with the commercial AndroChef Java Decompiler 1.0, 
using a few of its 10 free trials.  I may purchase a copy for $24.99 after a 
few more trials if it looks good.  Seems to be Windows and GUI only.

>From a *brief* look at the Java decompiled by AndroChef for your original 
>code, wrapped in a function with untyped arguments for all of the unbound 
>symbols (see https://gist.github.com/4046744 ), it appears that perhaps 
>operations on a mix of doubles and longs don't use primitive arithmetic.  That 
>is a guess, mind you.  I haven't tried out smaller experiments to see whether 
>that is true in general, or whether it is something about this particular 
>function that causes that.

Andy


On Nov 8, 2012, at 6:50 PM, Cedric Greevey wrote:

> I can now add that the 30 taken by the cache lookup + .getRGB call is almost 
> entirely consumed by the latter. BufferedImage.getRGB is 50% slower than 
> rand. There's still about 380 billion cycles being spent on only around 3 
> billion numerical ops plus 100 million each of log and atan, though. Seems 
> excessive.
> 
> 
> 
> On Thu, Nov 8, 2012 at 8:38 PM, Cedric Greevey  wrote:
> So 100 million (rand) calls take 20 seconds. From temporarily changing 
> ^BufferedImage chunk (chunk-cache chunk-num) to ^BufferedImage chunk nil, I 
> determined that 100 million of the *combination* of the cache lookup (with no 
> misses) and the .getRGB call took 30. That still leaves just over two minutes 
> with only the arithmetic left in the loop. Either something's getting boxed 
> or it's the trig calls.
> 
> 
> 
> On Thu, Nov 8, 2012 at 8:18 PM, Cedric Greevey  wrote:
> (rand) is expensive -- removing the two (rand)s knocks about 40 seconds off 
> it, nearly 1/5 the total time. I'll try replacing them with lookup from a 
> precalculated grid of randoms -- long-range correlations shouldn't matter 
> here.
> 
> 
> 
> 
> On Thu, Nov 8, 2012 at 8:00 PM, Cedric Greevey  wrote:
> On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey  wrote:
> I have the following code to perform a complicated image convolution. It 
> takes 10-15 seconds with output dimensions 256x256 and samples 6. No 
> reflection warnings, and using unchecked math doesn't speed it up any. I've 
> tried to ensure it uses primitive math inside the loops, aside from 
> generating the outer loop's values. What cached-load-chunk does shouldn't 
> matter much, but in most cases it should boil down to a map lookup inside a 
> swap! and a couple of atom derefs and function calls. The bottleneck is 
> likely in the math somewhere, and likely something is being boxed, though 
> I've primitive-coerced every numerical let and loop value and avoided more 
> than two arguments per arithmetic op.
> 
> Can anyone spot anything I haven't that could be causing boxed arithmetic 
> inside the loops?
> 
> I've now checked for Var lookups (none outside the caching function, and now 
> none inside either) and checked the caching code itself (there's a .get on a 
> closed-over ConcurrentHashMap, a null check, a .get on a SoftReference, and 
> another null check, on each lookup, if there isn't a cache miss on that 
> lookup; plus a couple more method calls for the IFn invokes and an ivar fetch 
> to get the ConcurrentHashMap reference).
> 
> In the absence of cache misses I'm still seeing ~3.5 *minutes* at 1280x720 
> with 10 samples (= about 100 million iterations total of the inner loop). The 
> arithmetic in there is 23 floating-point ops, five compares, a log, a

Re: Slow image convolution

2012-11-08 Thread Cedric Greevey
So 100 million (rand) calls take 20 seconds. From temporarily changing
^BufferedImage chunk (chunk-cache chunk-num) to ^BufferedImage chunk nil, I
determined that 100 million of the *combination* of the cache lookup (with
no misses) and the .getRGB call took 30. That still leaves just over two
minutes with only the arithmetic left in the loop. Either something's
getting boxed or it's the trig calls.


On Thu, Nov 8, 2012 at 8:18 PM, Cedric Greevey  wrote:

> (rand) is expensive -- removing the two (rand)s knocks about 40 seconds
> off it, nearly 1/5 the total time. I'll try replacing them with lookup from
> a precalculated grid of randoms -- long-range correlations shouldn't matter
> here.
>
>
>
>
> On Thu, Nov 8, 2012 at 8:00 PM, Cedric Greevey  wrote:
>
>> On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey wrote:
>>
>>> I have the following code to perform a complicated image convolution. It
>>> takes 10-15 seconds with output dimensions 256x256 and samples 6. No
>>> reflection warnings, and using unchecked math doesn't speed it up any. I've
>>> tried to ensure it uses primitive math inside the loops, aside from
>>> generating the outer loop's values. What cached-load-chunk does shouldn't
>>> matter much, but in most cases it should boil down to a map lookup inside a
>>> swap! and a couple of atom derefs and function calls. The bottleneck is
>>> likely in the math somewhere, and likely something is being boxed, though
>>> I've primitive-coerced every numerical let and loop value and avoided more
>>> than two arguments per arithmetic op.
>>>
>>> Can anyone spot anything I haven't that could be causing boxed
>>> arithmetic inside the loops?
>>>
>>
>> I've now checked for Var lookups (none outside the caching function, and
>> now none inside either) and checked the caching code itself (there's a .get
>> on a closed-over ConcurrentHashMap, a null check, a .get on a
>> SoftReference, and another null check, on each lookup, if there isn't a
>> cache miss on that lookup; plus a couple more method calls for the IFn
>> invokes and an ivar fetch to get the ConcurrentHashMap reference).
>>
>> In the absence of cache misses I'm still seeing ~3.5 *minutes* at
>> 1280x720 with 10 samples (= about 100 million iterations total of the inner
>> loop). The arithmetic in there is 23 floating-point ops, five compares, a
>> log, an atan, and three bitwise ANDs. Without the log and atan a hundred
>> million of that inner loop should take a second on this box. I very much
>> doubt the log and the atan are 209 times slower than the rest of it
>> combined. So there's three likely culprits: boxing, two calls to (rand),
>> and the BufferedImage .getRGB method call (on a 6000x2198 24bpp image,
>> though its size should matter not), unless trig is really that slow.
>>
>>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Slow image convolution

2012-11-08 Thread Cedric Greevey
(rand) is expensive -- removing the two (rand)s knocks about 40 seconds off
it, nearly 1/5 the total time. I'll try replacing them with lookup from a
precalculated grid of randoms -- long-range correlations shouldn't matter
here.



On Thu, Nov 8, 2012 at 8:00 PM, Cedric Greevey  wrote:

> On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey  wrote:
>
>> I have the following code to perform a complicated image convolution. It
>> takes 10-15 seconds with output dimensions 256x256 and samples 6. No
>> reflection warnings, and using unchecked math doesn't speed it up any. I've
>> tried to ensure it uses primitive math inside the loops, aside from
>> generating the outer loop's values. What cached-load-chunk does shouldn't
>> matter much, but in most cases it should boil down to a map lookup inside a
>> swap! and a couple of atom derefs and function calls. The bottleneck is
>> likely in the math somewhere, and likely something is being boxed, though
>> I've primitive-coerced every numerical let and loop value and avoided more
>> than two arguments per arithmetic op.
>>
>> Can anyone spot anything I haven't that could be causing boxed arithmetic
>> inside the loops?
>>
>
> I've now checked for Var lookups (none outside the caching function, and
> now none inside either) and checked the caching code itself (there's a .get
> on a closed-over ConcurrentHashMap, a null check, a .get on a
> SoftReference, and another null check, on each lookup, if there isn't a
> cache miss on that lookup; plus a couple more method calls for the IFn
> invokes and an ivar fetch to get the ConcurrentHashMap reference).
>
> In the absence of cache misses I'm still seeing ~3.5 *minutes* at 1280x720
> with 10 samples (= about 100 million iterations total of the inner loop).
> The arithmetic in there is 23 floating-point ops, five compares, a log, an
> atan, and three bitwise ANDs. Without the log and atan a hundred million of
> that inner loop should take a second on this box. I very much doubt the log
> and the atan are 209 times slower than the rest of it combined. So there's
> three likely culprits: boxing, two calls to (rand), and the BufferedImage
> .getRGB method call (on a 6000x2198 24bpp image, though its size should
> matter not), unless trig is really that slow.
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Slow image convolution

2012-11-08 Thread Cedric Greevey
On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey  wrote:

> I have the following code to perform a complicated image convolution. It
> takes 10-15 seconds with output dimensions 256x256 and samples 6. No
> reflection warnings, and using unchecked math doesn't speed it up any. I've
> tried to ensure it uses primitive math inside the loops, aside from
> generating the outer loop's values. What cached-load-chunk does shouldn't
> matter much, but in most cases it should boil down to a map lookup inside a
> swap! and a couple of atom derefs and function calls. The bottleneck is
> likely in the math somewhere, and likely something is being boxed, though
> I've primitive-coerced every numerical let and loop value and avoided more
> than two arguments per arithmetic op.
>
> Can anyone spot anything I haven't that could be causing boxed arithmetic
> inside the loops?
>

I've now checked for Var lookups (none outside the caching function, and
now none inside either) and checked the caching code itself (there's a .get
on a closed-over ConcurrentHashMap, a null check, a .get on a
SoftReference, and another null check, on each lookup, if there isn't a
cache miss on that lookup; plus a couple more method calls for the IFn
invokes and an ivar fetch to get the ConcurrentHashMap reference).

In the absence of cache misses I'm still seeing ~3.5 *minutes* at 1280x720
with 10 samples (= about 100 million iterations total of the inner loop).
The arithmetic in there is 23 floating-point ops, five compares, a log, an
atan, and three bitwise ANDs. Without the log and atan a hundred million of
that inner loop should take a second on this box. I very much doubt the log
and the atan are 209 times slower than the rest of it combined. So there's
three likely culprits: boxing, two calls to (rand), and the BufferedImage
.getRGB method call (on a 6000x2198 24bpp image, though its size should
matter not), unless trig is really that slow.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en