Re: Slow image convolution
On Fri, Nov 9, 2012 at 11:57 AM, Yakovlev Roman wrote: > what a mess if it is a function it's huuge did you try split it to useful > chunks ? it's just unreadable and that's why you cann't spot anything i > guess.. > I can't split it without ending up with boxing at the function call boundaries, since I'd have to pass more than four parameters to the sub-functions. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Slow image convolution
I thought 'definline' addresses exactly that...cases where one might want to pull out small pieces of first class functionality without sacrificing performance and without macros (which are not first-class)...am I mistaken? Jim ps: the only similarly big function that I've ever written (still smaller than this one) was for a GUI of mine and basically the Java interop caused it to be that big - it's not doing anything complicated like this one. I could not maintain such a function like the one shown here On 09/11/12 17:32, Andy Fingerhut wrote: Having worked on Clojure benchmarks on the Computer Language Benchmarks Game web site, that is sometimes the kind of Clojure code one needs to write if you want it to be as fast as it can be. http://shootout.alioth.debian.org/u64q/compare.php?lang=clojure Also, splitting it out into smaller chunks, if by that you mean functions, requires care in Clojure to avoid boxing and unboxing of primitive numbers. It is possible, but can be an additional hurdle when trying to optimize code. Macros can be better in that regard. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Slow image convolution
Having worked on Clojure benchmarks on the Computer Language Benchmarks Game web site, that is sometimes the kind of Clojure code one needs to write if you want it to be as fast as it can be. http://shootout.alioth.debian.org/u64q/compare.php?lang=clojure Also, splitting it out into smaller chunks, if by that you mean functions, requires care in Clojure to avoid boxing and unboxing of primitive numbers. It is possible, but can be an additional hurdle when trying to optimize code. Macros can be better in that regard. Another comment, Cedric: Consider writing that part of your code in Java and call it from Clojure. Java is to Clojure as in-line assembler is to C, except that it isn't in-line :-) Andy On Nov 9, 2012, at 8:57 AM, Yakovlev Roman wrote: > what a mess if it is a function it's huuge did you try split it to useful > chunks ? it's just unreadable and that's why you cann't spot anything i > guess.. > > пятница, 9 ноября 2012 г., 0:48:20 UTC+4 пользователь Cedric Greevey написал: > I have the following code to perform a complicated image convolution. It > takes 10-15 seconds with output dimensions 256x256 and samples 6. No > reflection warnings, and using unchecked math doesn't speed it up any. I've > tried to ensure it uses primitive math inside the loops, aside from > generating the outer loop's values. What cached-load-chunk does shouldn't > matter much, but in most cases it should boil down to a map lookup inside a > swap! and a couple of atom derefs and function calls. The bottleneck is > likely in the math somewhere, and likely something is being boxed, though > I've primitive-coerced every numerical let and loop value and avoided more > than two arguments per arithmetic op. > > Can anyone spot anything I haven't that could be causing boxed arithmetic > inside the loops? > > (let [^BufferedImage out (BufferedImage. width height > BufferedImage/TYPE_INT_BGR) > half-w (double (/ width 2.0)) > half-h (double (/ height 2.0)) > lnk (double (- (* (Math/log 0.1) (inc l10-mag)) >(/ (Math/log (+ (* half-w half-w) (* half-h half-h))) > 2))) > cconv (double (/ chunk-width (* 2.0 Math/PI))) > c-base-ln (double (* chunk-base-mag (Math/log 0.1))) > x-offset (double (* chunk-width (+ 0.5 (/ deg-rot 360 > fmtstr (str "%0" chunk-name-digits "d") > cached-load-chunk (cached-objects > (fn [chunk-num] > (maybe-load-image > (str chunk-name-base > (format fmtstr (inc chunk-num)) > ".png"] > (doseq [oy (range height) ox (range width)] > (let [ox (int ox) > oy (int oy) > ix (double (- ox half-w)) > iy (double (- oy half-h)) > samples (double samples) > smo (int (dec samples))] > (loop [sx (int 0) sy (int 0) ns (double 0) >r (double 0) g (double 0) b (double 0)] > (if (== sy samples) > (.setRGB out ox oy (int (+ > (int (/ b ns)) > (+ (* 256 (int (/ g ns))) > (* 65536 (int (/ r ns))) > (let [nsx (int (if (== sx smo) 0 (inc sx))) > nsy (int (if (zero? nsx) (inc sy) sy)) > ix (double (+ ix (/ (+ sx (rand)) samples))) > iy (double (+ iy (/ (+ sy (rand)) samples)))] > (if (and (== ix 0) (== iy 0)) > (recur nsx nsy ns r g b) > (let [lnz-P (double (+ lnk (/ (Math/log (+ (* ix ix) (* iy > iy))) 2))) > argz-P (double (+ Math/PI (Math/atan2 iy ix))) > ; atan2 output range is -pi...pi and oy increases down > the screen > x (double (* argz-P cconv)) > y (double (* (- c-base-ln lnz-P) cconv)) > ; > chunk-num (int (/ y chunk-height)) > chunk-x (int (rem (int (+ x x-offset)) chunk-width)) > chunk-y (int (- y (* chunk-height chunk-num))) > ^BufferedImage chunk (cached-load-chunk chunk-num) > pixel (int (if chunk (.getRGB chunk chunk-x chunk-y) 0)) > r (double (+ r (/ (bit-and pixel 0x00ff) 65536))) > g (double (+ g (/ (bit-and pixel 0xff00) 256))) > b (double (+ b (bit-and pixel 0x00ff)))] > (recur nsx nsy (inc ns) r g b > out)) > > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from t
Re: Slow image convolution
what a mess if it is a function it's huuge did you try split it to useful chunks ? it's just unreadable and that's why you cann't spot anything i guess.. пятница, 9 ноября 2012 г., 0:48:20 UTC+4 пользователь Cedric Greevey написал: > > I have the following code to perform a complicated image convolution. It > takes 10-15 seconds with output dimensions 256x256 and samples 6. No > reflection warnings, and using unchecked math doesn't speed it up any. I've > tried to ensure it uses primitive math inside the loops, aside from > generating the outer loop's values. What cached-load-chunk does shouldn't > matter much, but in most cases it should boil down to a map lookup inside a > swap! and a couple of atom derefs and function calls. The bottleneck is > likely in the math somewhere, and likely something is being boxed, though > I've primitive-coerced every numerical let and loop value and avoided more > than two arguments per arithmetic op. > > Can anyone spot anything I haven't that could be causing boxed arithmetic > inside the loops? > > (let [^BufferedImage out (BufferedImage. width height > BufferedImage/TYPE_INT_BGR) > half-w (double (/ width 2.0)) > half-h (double (/ height 2.0)) > lnk (double (- (* (Math/log 0.1) (inc l10-mag)) >(/ (Math/log (+ (* half-w half-w) (* half-h > half-h))) 2))) > cconv (double (/ chunk-width (* 2.0 Math/PI))) > c-base-ln (double (* chunk-base-mag (Math/log 0.1))) > x-offset (double (* chunk-width (+ 0.5 (/ deg-rot 360 > fmtstr (str "%0" chunk-name-digits "d") > cached-load-chunk (cached-objects > (fn [chunk-num] > (maybe-load-image > (str chunk-name-base > (format fmtstr (inc chunk-num)) > ".png"] > (doseq [oy (range height) ox (range width)] > (let [ox (int ox) > oy (int oy) > ix (double (- ox half-w)) > iy (double (- oy half-h)) > samples (double samples) > smo (int (dec samples))] > (loop [sx (int 0) sy (int 0) ns (double 0) >r (double 0) g (double 0) b (double 0)] > (if (== sy samples) > (.setRGB out ox oy (int (+ > (int (/ b ns)) > (+ (* 256 (int (/ g ns))) > (* 65536 (int (/ r ns))) > (let [nsx (int (if (== sx smo) 0 (inc sx))) > nsy (int (if (zero? nsx) (inc sy) sy)) > ix (double (+ ix (/ (+ sx (rand)) samples))) > iy (double (+ iy (/ (+ sy (rand)) samples)))] > (if (and (== ix 0) (== iy 0)) > (recur nsx nsy ns r g b) > (let [lnz-P (double (+ lnk (/ (Math/log (+ (* ix ix) (* iy > iy))) 2))) > argz-P (double (+ Math/PI (Math/atan2 iy ix))) > ; atan2 output range is -pi...pi and oy increases > down the screen > x (double (* argz-P cconv)) > y (double (* (- c-base-ln lnz-P) cconv)) > ; > chunk-num (int (/ y chunk-height)) > chunk-x (int (rem (int (+ x x-offset)) chunk-width)) > chunk-y (int (- y (* chunk-height chunk-num))) > ^BufferedImage chunk (cached-load-chunk chunk-num) > pixel (int (if chunk (.getRGB chunk chunk-x chunk-y) > 0)) > r (double (+ r (/ (bit-and pixel 0x00ff) 65536))) > g (double (+ g (/ (bit-and pixel 0xff00) 256))) > b (double (+ b (bit-and pixel 0x00ff)))] > (recur nsx nsy (inc ns) r g b > out)) > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Slow image convolution
I haven't determined exactly why that particular code is slow for you, but I have looked at it for a few minutes and have some suggestions: 1. In Clojure 1.3 and later, the default integer type for primitive arithmetic is long, not int. Declaring things int means that they are really longs in the byte code, with type casts between long and int where needed. Type casts between long and int are not slow, I'd guess, but wanted to mention this FYI. 2. I haven't followed all of the changes you have made to the code, but looked at the version in your original message of the thread. There chunk-width and chunk-height are used in the inner loop, but have no let binding to cause them to be primitives, therefore are more likely to be boxed numbers and slow down the operations in which they are involved. Maybe you already found and fixed this in your "I've now checked for Var lookups" message, but I can't tell from your message. 3. I have some hope that if I can find a decent Java decompiler, it will help in tracking these things down, by decompiling the .class files that Clojure can be compiled to back into Java source code. If one is experienced in reading Java byte code, then a simpler Java disassembler is sufficient, but I'm not sure I want to learn to read Java byte codes that well. I've tried the free JD-GUI, which produces output that looks like Java, but the tool seems to not handle Clojure loops very well. I've gotten better results in a couple of small tests with the commercial AndroChef Java Decompiler 1.0, using a few of its 10 free trials. I may purchase a copy for $24.99 after a few more trials if it looks good. Seems to be Windows and GUI only. >From a *brief* look at the Java decompiled by AndroChef for your original >code, wrapped in a function with untyped arguments for all of the unbound >symbols (see https://gist.github.com/4046744 ), it appears that perhaps >operations on a mix of doubles and longs don't use primitive arithmetic. That >is a guess, mind you. I haven't tried out smaller experiments to see whether >that is true in general, or whether it is something about this particular >function that causes that. Andy On Nov 8, 2012, at 6:50 PM, Cedric Greevey wrote: > I can now add that the 30 taken by the cache lookup + .getRGB call is almost > entirely consumed by the latter. BufferedImage.getRGB is 50% slower than > rand. There's still about 380 billion cycles being spent on only around 3 > billion numerical ops plus 100 million each of log and atan, though. Seems > excessive. > > > > On Thu, Nov 8, 2012 at 8:38 PM, Cedric Greevey wrote: > So 100 million (rand) calls take 20 seconds. From temporarily changing > ^BufferedImage chunk (chunk-cache chunk-num) to ^BufferedImage chunk nil, I > determined that 100 million of the *combination* of the cache lookup (with no > misses) and the .getRGB call took 30. That still leaves just over two minutes > with only the arithmetic left in the loop. Either something's getting boxed > or it's the trig calls. > > > > On Thu, Nov 8, 2012 at 8:18 PM, Cedric Greevey wrote: > (rand) is expensive -- removing the two (rand)s knocks about 40 seconds off > it, nearly 1/5 the total time. I'll try replacing them with lookup from a > precalculated grid of randoms -- long-range correlations shouldn't matter > here. > > > > > On Thu, Nov 8, 2012 at 8:00 PM, Cedric Greevey wrote: > On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey wrote: > I have the following code to perform a complicated image convolution. It > takes 10-15 seconds with output dimensions 256x256 and samples 6. No > reflection warnings, and using unchecked math doesn't speed it up any. I've > tried to ensure it uses primitive math inside the loops, aside from > generating the outer loop's values. What cached-load-chunk does shouldn't > matter much, but in most cases it should boil down to a map lookup inside a > swap! and a couple of atom derefs and function calls. The bottleneck is > likely in the math somewhere, and likely something is being boxed, though > I've primitive-coerced every numerical let and loop value and avoided more > than two arguments per arithmetic op. > > Can anyone spot anything I haven't that could be causing boxed arithmetic > inside the loops? > > I've now checked for Var lookups (none outside the caching function, and now > none inside either) and checked the caching code itself (there's a .get on a > closed-over ConcurrentHashMap, a null check, a .get on a SoftReference, and > another null check, on each lookup, if there isn't a cache miss on that > lookup; plus a couple more method calls for the IFn invokes and an ivar fetch > to get the ConcurrentHashMap reference). > > In the absence of cache misses I'm still seeing ~3.5 *minutes* at 1280x720 > with 10 samples (= about 100 million iterations total of the inner loop). The > arithmetic in there is 23 floating-point ops, five compares, a log, a
Re: Slow image convolution
So 100 million (rand) calls take 20 seconds. From temporarily changing ^BufferedImage chunk (chunk-cache chunk-num) to ^BufferedImage chunk nil, I determined that 100 million of the *combination* of the cache lookup (with no misses) and the .getRGB call took 30. That still leaves just over two minutes with only the arithmetic left in the loop. Either something's getting boxed or it's the trig calls. On Thu, Nov 8, 2012 at 8:18 PM, Cedric Greevey wrote: > (rand) is expensive -- removing the two (rand)s knocks about 40 seconds > off it, nearly 1/5 the total time. I'll try replacing them with lookup from > a precalculated grid of randoms -- long-range correlations shouldn't matter > here. > > > > > On Thu, Nov 8, 2012 at 8:00 PM, Cedric Greevey wrote: > >> On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey wrote: >> >>> I have the following code to perform a complicated image convolution. It >>> takes 10-15 seconds with output dimensions 256x256 and samples 6. No >>> reflection warnings, and using unchecked math doesn't speed it up any. I've >>> tried to ensure it uses primitive math inside the loops, aside from >>> generating the outer loop's values. What cached-load-chunk does shouldn't >>> matter much, but in most cases it should boil down to a map lookup inside a >>> swap! and a couple of atom derefs and function calls. The bottleneck is >>> likely in the math somewhere, and likely something is being boxed, though >>> I've primitive-coerced every numerical let and loop value and avoided more >>> than two arguments per arithmetic op. >>> >>> Can anyone spot anything I haven't that could be causing boxed >>> arithmetic inside the loops? >>> >> >> I've now checked for Var lookups (none outside the caching function, and >> now none inside either) and checked the caching code itself (there's a .get >> on a closed-over ConcurrentHashMap, a null check, a .get on a >> SoftReference, and another null check, on each lookup, if there isn't a >> cache miss on that lookup; plus a couple more method calls for the IFn >> invokes and an ivar fetch to get the ConcurrentHashMap reference). >> >> In the absence of cache misses I'm still seeing ~3.5 *minutes* at >> 1280x720 with 10 samples (= about 100 million iterations total of the inner >> loop). The arithmetic in there is 23 floating-point ops, five compares, a >> log, an atan, and three bitwise ANDs. Without the log and atan a hundred >> million of that inner loop should take a second on this box. I very much >> doubt the log and the atan are 209 times slower than the rest of it >> combined. So there's three likely culprits: boxing, two calls to (rand), >> and the BufferedImage .getRGB method call (on a 6000x2198 24bpp image, >> though its size should matter not), unless trig is really that slow. >> >> > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Slow image convolution
(rand) is expensive -- removing the two (rand)s knocks about 40 seconds off it, nearly 1/5 the total time. I'll try replacing them with lookup from a precalculated grid of randoms -- long-range correlations shouldn't matter here. On Thu, Nov 8, 2012 at 8:00 PM, Cedric Greevey wrote: > On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey wrote: > >> I have the following code to perform a complicated image convolution. It >> takes 10-15 seconds with output dimensions 256x256 and samples 6. No >> reflection warnings, and using unchecked math doesn't speed it up any. I've >> tried to ensure it uses primitive math inside the loops, aside from >> generating the outer loop's values. What cached-load-chunk does shouldn't >> matter much, but in most cases it should boil down to a map lookup inside a >> swap! and a couple of atom derefs and function calls. The bottleneck is >> likely in the math somewhere, and likely something is being boxed, though >> I've primitive-coerced every numerical let and loop value and avoided more >> than two arguments per arithmetic op. >> >> Can anyone spot anything I haven't that could be causing boxed arithmetic >> inside the loops? >> > > I've now checked for Var lookups (none outside the caching function, and > now none inside either) and checked the caching code itself (there's a .get > on a closed-over ConcurrentHashMap, a null check, a .get on a > SoftReference, and another null check, on each lookup, if there isn't a > cache miss on that lookup; plus a couple more method calls for the IFn > invokes and an ivar fetch to get the ConcurrentHashMap reference). > > In the absence of cache misses I'm still seeing ~3.5 *minutes* at 1280x720 > with 10 samples (= about 100 million iterations total of the inner loop). > The arithmetic in there is 23 floating-point ops, five compares, a log, an > atan, and three bitwise ANDs. Without the log and atan a hundred million of > that inner loop should take a second on this box. I very much doubt the log > and the atan are 209 times slower than the rest of it combined. So there's > three likely culprits: boxing, two calls to (rand), and the BufferedImage > .getRGB method call (on a 6000x2198 24bpp image, though its size should > matter not), unless trig is really that slow. > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Slow image convolution
On Thu, Nov 8, 2012 at 3:48 PM, Cedric Greevey wrote: > I have the following code to perform a complicated image convolution. It > takes 10-15 seconds with output dimensions 256x256 and samples 6. No > reflection warnings, and using unchecked math doesn't speed it up any. I've > tried to ensure it uses primitive math inside the loops, aside from > generating the outer loop's values. What cached-load-chunk does shouldn't > matter much, but in most cases it should boil down to a map lookup inside a > swap! and a couple of atom derefs and function calls. The bottleneck is > likely in the math somewhere, and likely something is being boxed, though > I've primitive-coerced every numerical let and loop value and avoided more > than two arguments per arithmetic op. > > Can anyone spot anything I haven't that could be causing boxed arithmetic > inside the loops? > I've now checked for Var lookups (none outside the caching function, and now none inside either) and checked the caching code itself (there's a .get on a closed-over ConcurrentHashMap, a null check, a .get on a SoftReference, and another null check, on each lookup, if there isn't a cache miss on that lookup; plus a couple more method calls for the IFn invokes and an ivar fetch to get the ConcurrentHashMap reference). In the absence of cache misses I'm still seeing ~3.5 *minutes* at 1280x720 with 10 samples (= about 100 million iterations total of the inner loop). The arithmetic in there is 23 floating-point ops, five compares, a log, an atan, and three bitwise ANDs. Without the log and atan a hundred million of that inner loop should take a second on this box. I very much doubt the log and the atan are 209 times slower than the rest of it combined. So there's three likely culprits: boxing, two calls to (rand), and the BufferedImage .getRGB method call (on a 6000x2198 24bpp image, though its size should matter not), unless trig is really that slow. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en