Re: need help making shell script use two CPUs/cores
Carl Johnson put forth on 1/24/2011 5:07 PM: Stan Hoeppner s...@hardwarefreak.com writes: Now we have 4 CPUs on two memory channels. If not for caches, you'd see no speedup past 2 Imagemagick processes. Which is pretty much the behavior identified by another OP with an Athlon II x4 system--almost zeo speedup from 2 to 4 processes. I think you are referring to the data that I posted for my Athlon II x4 system, but that is *NOT* what the data showed. I thought that the data clearly showed pretty good scaling up to 4 processors, so I don't know what you are seeing that everybody else is missing. I will copy some of the data below, but basically it showed that total time almost cut in half when it went from 1 to 2 processors, and again when it went from 2 to 4 processors. Processors Time (seconds) P1 66 P2 36 P4 20 Perfect scaling here would be a run time of 16.5 seconds with 4 processes/cores with this particular sample set of photos. 20 is 1/4th of 80, and closer to 1/3rd of 66. This isn't close to linear scaling, although it is a little better than I expected from this particular CPU. One can clearly see the effects of memory contention at only 2 processes, and that trend continues out to 4 processes, getting progressively worse, as expected. Past four processes, likely at 5, and on from there, you'll see little, and then no scaling at all. I must admit I am a bit surprised that a quad core AMD with only 512KB L2 per core, and no L3, scales as well as it does to 4 processes. The images in this sample test are relatively small though, so cache size probably isn't that much in play. With larger image sizes I'm guessing we'd see less scaling than we do here, as many more reads to main memory will be required to fetch the pixel data, and thus memory contention among the 4 cores will be much higher. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d3edc1a.3090...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner wrote: Bob Proulx put forth: Here is some raw data from another test using GraphicsMagic from Debian Sid on an Intel Core2 Quad CPU Q9400 @ 2.66GHz. #CPUs real user sys 1 ... 32.17 100.15 2.29 2 ... 28.02 102.09 2.25 3 ... 26.96 101.41 2.02 4 ... 26.18 99.85 2.10 5 ... 26.03 98.58 2.27 6 ... 27.07 97.32 2.17 7 ... 27.74 100.09 2.03 8 ... 26.76 97.83 1.99 9 ... 27.24 97.31 2.88 10 ... 26.27 99.05 2.76 11 ... 26.35 99.30 1.84 12 ... 25.91 97.63 2.08 So, I'm not understanding how we have a quad core CPU with 12 CPUs. Is #CPUs here your xargs -P argument in the script you posted in response to my question that started this thread? Sorry. Yes I made a mistake in posting those headings. Yes it was the xargs -P parallelization argument listed where I said #CPUs. Running a different number of conversion processes in parallel. Why bother going up to 12 processes with a quad core chip? Anything over 4 processes/threads won't gain you anything, as your results above demonstrate. I went to 12 because it would demonstrate the behavior three times past the number of cores. If I had only a dual core I would have only chosen to go to 6. But I would have gone to 6 for one core too since three doesn't generate a smooth enough scatter plot for me. But I didn't want to spend too much time analyzing the problem to set up a statistically designed experiment. I just wanted to quickly perform the test. So plucked in 12 there and moved on. Surely that would be enough. I didn't think I would need to rigorously defend that quick choice against a panel. At some point by doing more parallelism things will actually be slowed down by it. I didn't reach that point. And the same thing using ImageMagick on the same system. #CPUs real user sys 1 ... 24.69 62.60 2.87 2 ... 19.28 63.17 2.50 3 ... 17.82 60.34 2.65 4 ... 17.48 58.86 2.55 5 ... 16.60 58.11 2.34 6 ... 15.85 58.03 2.38 7 ... 15.61 58.09 2.44 8 ... 15.36 57.68 2.48 9 ... 15.48 57.76 2.38 10 ... 15.38 57.76 2.28 11 ... 15.36 57.97 2.27 12 ... 15.73 58.76 2.17 Watching the individual cpu load I observe that while the 1 cpu case did consume one cpu fully that the other three were also showing quite a bit of activity too. Imagemagick will use threads on larger images. To keep it from threading, in order for your testing to make more sense, use smaller images. I couldn't find anything in the ImageMagick documentation that described its threading behavior. Where did I miss that useful information? For images I used your set of benchmark photos that we have been discussing in this thread. three running all four cpus were looking pretty much 100% consumed. I was timing all of the shell's for loop, the xargs and the convert processes all together. If you are converting images large enough that the threading kicks in, there's little reason to use multiple processes at that point. We'd already discussed this. Were you simply trying to confirm that with these tests? I expected that on this machine that the memory backplane wouldn't have enough memory bandwidth to support all four processors. I expect it to brown out before getting to four. Having a quad-core sounds great but just having four cores doesn't mean all of them can be used at the same time to advantage. I expect that the extra cores will get starved. And so the curve will drop off sooner than four. I also tried running this same test on some slower hardware. I have gotten spoiled by the faster machine. The benchmark is still running on my slower machines. :-) I am not going to wait for it to finish. What are the CPU specs of this older machine? I tested this on an Intel Celeron 2.4GHz machine with 2.5G ram. Unfortunately I see now that I have lost the saved data from that test. (Drat! I know what I did but I would need to run the test again to regenerate.) But an entire run to six parallel conversions there as I recall took over thirty minutes of total time to complete and as I recall worked out to being twenty times slower. Don't hold me to those numbers as I would need to capture the actual data again to be sure and I don't want to spend the time to do that. But it was slower, much slower. (This is actually my main web server and normally does image conversions when I upload photos. This information is probably going to motivate me to set up a task queue to speed up my image conversions there.) Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
Bob Proulx put forth on 1/24/2011 12:21 PM: Stan Hoeppner wrote: Why bother going up to 12 processes with a quad core chip? Anything over 4 processes/threads won't gain you anything, as your results above demonstrate. I went to 12 because it would demonstrate the behavior three times past the number of cores. If I had only a dual core I would have only chosen to go to 6. But I would have gone to 6 for one core too since three doesn't generate a smooth enough scatter plot for me. But I didn't want to spend too much time analyzing the problem to set up a statistically designed experiment. I just wanted to quickly perform the test. So plucked in 12 there and moved on. Surely that would be enough. I didn't think I would need to rigorously defend that quick choice against a panel. But you'll run out of memory bandwidth before you hit 4 processes, especially if your 4-way chip has no L3 cache, such as the Athlon II x4 chips. Going all the way out to 12 processes seems a bit silly. Even with something like one of Intel's Core i7s with a monster L3 cache, you'll exhaust your memory and cache b/w well before you have (#cores*1.5) processes. At some point by doing more parallelism things will actually be slowed down by it. I didn't reach that point. This will probably only occur if you run out of memory and have to swap. The overhead of the Linux task scheduler is tiny--we're talking microseconds per task switch. And as I mentioned, you're already thrashing your caches at 4 processes, so beyond that point everything is purely memory b/w constrained. This bandwidth is finite, static. So no matter how many processes you run (unless you run more processes than you have images) you probably won't see any slowdown past 4 processes. Imagemagick will use threads on larger images. To keep it from threading, in order for your testing to make more sense, use smaller images. I couldn't find anything in the ImageMagick documentation that described its threading behavior. Where did I miss that useful information? See: http://www.imagemagick.org/script/architecture.php For images I used your set of benchmark photos that we have been discussing in this thread. Hmmm. If you were seeing threading with a single process with those images, this would lead me to believe the Lenny Imagemagick version doesn't support threads. You're running the Squeeze package, correct? I'm running: $ identify -version Version: ImageMagick 6.3.7 11/17/10 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2008 ImageMagick Studio LLC According to the docs I should see something like: $ identify -version Features: OpenMP OpenCL but I don't. I expected that on this machine that the memory backplane wouldn't have enough memory bandwidth to support all four processors. I expect None of them do. Recall when the first socket 939 AMD chips hit the market, with all mobos having dual channel memory as the controller was on the CPU? One core with dual memory channels, and many applications saw huge performance gains. Now we have 4 CPUs on two memory channels. If not for caches, you'd see no speedup past 2 Imagemagick processes. Which is pretty much the behavior identified by another OP with an Athlon II x4 system--almost zeo speedup from 2 to 4 processes. it to brown out before getting to four. Having a quad-core sounds great but just having four cores doesn't mean all of them can be used at the same time to advantage. I expect that the extra cores will get starved. And so the curve will drop off sooner than four. This is always the case. No multicore CPU has enough memory channels to keep all cores fed on a byte/OP basis. This is no secret. It's been well discussed for many years now. I also tried running this same test on some slower hardware. I have gotten spoiled by the faster machine. The benchmark is still running on my slower machines. :-) I am not going to wait for it to finish. What are the CPU specs of this older machine? I tested this on an Intel Celeron 2.4GHz machine with 2.5G ram. My test server is a dual Celeron 550 with only 384MB and it doesn't take anywhere near 30 minutes for that set of test images. IIRC it only took a few minutes. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d3dcfd0.5090...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner s...@hardwarefreak.com writes: Now we have 4 CPUs on two memory channels. If not for caches, you'd see no speedup past 2 Imagemagick processes. Which is pretty much the behavior identified by another OP with an Athlon II x4 system--almost zeo speedup from 2 to 4 processes. I think you are referring to the data that I posted for my Athlon II x4 system, but that is *NOT* what the data showed. I thought that the data clearly showed pretty good scaling up to 4 processors, so I don't know what you are seeing that everybody else is missing. I will copy some of the data below, but basically it showed that total time almost cut in half when it went from 1 to 2 processors, and again when it went from 2 to 4 processors. Processors Time (seconds) P1 66 P2 36 P4 20 -- Carl Johnsonca...@peak.org -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87y669zysj.fsf@oak.localnet
Re: need help making shell script use two CPUs/cores
Carl Johnson wrote: #CPUs time theoretical time-theoreticalgain/CPU(theoretical) 1 66 2 3666/2 = 33 36-33 = 3 (+9%) 1 -1/2 = 1/2 3 2566/3 = 22 25-22 = 3 (+14%)1/2-1/3 = 1/6 4 2066/4 = 16.5 20-16.5 = 3.5 (+21%)1/3-1/4 = 1/12 I liked that analysis. Here is some raw data from another test using GraphicsMagic from Debian Sid on an Intel Core2 Quad CPU Q9400 @ 2.66GHz. #CPUs real user sys 1 ... 32.17 100.15 2.29 2 ... 28.02 102.09 2.25 3 ... 26.96 101.41 2.02 4 ... 26.18 99.85 2.10 5 ... 26.03 98.58 2.27 6 ... 27.07 97.32 2.17 7 ... 27.74 100.09 2.03 8 ... 26.76 97.83 1.99 9 ... 27.24 97.31 2.88 10 ... 26.27 99.05 2.76 11 ... 26.35 99.30 1.84 12 ... 25.91 97.63 2.08 And the same thing using ImageMagick on the same system. #CPUs real user sys 1 ... 24.69 62.60 2.87 2 ... 19.28 63.17 2.50 3 ... 17.82 60.34 2.65 4 ... 17.48 58.86 2.55 5 ... 16.60 58.11 2.34 6 ... 15.85 58.03 2.38 7 ... 15.61 58.09 2.44 8 ... 15.36 57.68 2.48 9 ... 15.48 57.76 2.38 10 ... 15.38 57.76 2.28 11 ... 15.36 57.97 2.27 12 ... 15.73 58.76 2.17 Watching the individual cpu load I observe that while the 1 cpu case did consume one cpu fully that the other three were also showing quite a bit of activity too. There was already quite a bit of parallelism happening before adding the second cpu, and third, and so forth. With three running all four cpus were looking pretty much 100% consumed. I was timing all of the shell's for loop, the xargs and the convert processes all together. I also tried running this same test on some slower hardware. I have gotten spoiled by the faster machine. The benchmark is still running on my slower machines. :-) I am not going to wait for it to finish. Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
Bob Proulx put forth on 1/23/2011 8:16 PM: Apparently I've missed some of the thread since my earlier participation. Carl Johnson wrote: #CPUs time theoretical time-theoreticalgain/CPU(theoretical) 1 66 2 3666/2 = 33 36-33 = 3 (+9%) 1 -1/2 = 1/2 3 2566/3 = 22 25-22 = 3 (+14%)1/2-1/3 = 1/6 4 2066/4 = 16.5 20-16.5 = 3.5 (+21%)1/3-1/4 = 1/12 I liked that analysis. Here is some raw data from another test using GraphicsMagic from Debian Sid on an Intel Core2 Quad CPU Q9400 @ 2.66GHz. #CPUs real user sys 1 ... 32.17 100.15 2.29 2 ... 28.02 102.09 2.25 3 ... 26.96 101.41 2.02 4 ... 26.18 99.85 2.10 5 ... 26.03 98.58 2.27 6 ... 27.07 97.32 2.17 7 ... 27.74 100.09 2.03 8 ... 26.76 97.83 1.99 9 ... 27.24 97.31 2.88 10 ... 26.27 99.05 2.76 11 ... 26.35 99.30 1.84 12 ... 25.91 97.63 2.08 So, I'm not understanding how we have a quad core CPU with 12 CPUs. Is #CPUs here your xargs -P argument in the script you posted in response to my question that started this thread? Why bother going up to 12 processes with a quad core chip? Anything over 4 processes/threads won't gain you anything, as your results above demonstrate. And the same thing using ImageMagick on the same system. #CPUs real user sys 1 ... 24.69 62.60 2.87 2 ... 19.28 63.17 2.50 3 ... 17.82 60.34 2.65 4 ... 17.48 58.86 2.55 5 ... 16.60 58.11 2.34 6 ... 15.85 58.03 2.38 7 ... 15.61 58.09 2.44 8 ... 15.36 57.68 2.48 9 ... 15.48 57.76 2.38 10 ... 15.38 57.76 2.28 11 ... 15.36 57.97 2.27 12 ... 15.73 58.76 2.17 Watching the individual cpu load I observe that while the 1 cpu case did consume one cpu fully that the other three were also showing quite a bit of activity too. Imagemagick will use threads on larger images. To keep it from threading, in order for your testing to make more sense, use smaller images. There was already quite a bit of parallelism happening before adding the second cpu, and third, and so forth. With See above. BTW, you weren't adding the 2nd CPU. You were merely spawning more _processes_ no? three running all four cpus were looking pretty much 100% consumed. I was timing all of the shell's for loop, the xargs and the convert processes all together. If you are converting images large enough that the threading kicks in, there's little reason to use multiple processes at that point. We'd already discussed this. Were you simply trying to confirm that with these tests? I also tried running this same test on some slower hardware. I have gotten spoiled by the faster machine. The benchmark is still running on my slower machines. :-) I am not going to wait for it to finish. What are the CPU specs of this older machine? -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d3ce747.5050...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner s...@hardwarefreak.com writes: Carl Johnson put forth on 1/13/2011 11:34 AM: Processors Time (seconds) P1 66 P2 36 P3 25 P4 20 P5 20 P6 20 P7 20 P8 20 Your numbers bear out exactly what I predicted. Look at the decrease in run time from 1 to 2, 2 to 3, and from 3 to 4 processes: #CPUs Decremental run timeFractional gain per CPU 2 30s 1/2 3 11s 1/6th 4 5s 1/13th You can clearly see the effects of serious memory contention when 3 cores are pegged. Bringing the 4th core into the mix yields almost nothing compared to three cores, cutting only 5 seconds from a 66 second run time. I seem to be looking at it in a different way, because the numbers don't seem that much different that what I would expect. #CPUs time theoretical time-theoreticalgain/CPU(theoretical) 1 66 2 3666/2 = 33 36-33 = 3 (+9%) 1 -1/2 = 1/2 3 2566/3 = 22 25-22 = 3 (+14%)1/2-1/3 = 1/6 4 2066/4 = 16.5 20-16.5 = 3.5 (+21%)1/3-1/4 = 1/12 -- Carl Johnsonca...@peak.org -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87ei8e2foa.fsf@oak.localnet
Re: need help making shell script use two CPUs/cores
On Sun, Jan 09, 2011 at 10:05:43AM -0600, Stan Hoeppner wrote: I'm not very skilled at writing shell scripts. #! /bin/sh for k in $(ls *.JPG); do convert $k -resize 1024 $k; done I use the above script to batch re-size digital camera photos after I dump them to my web server. It takes a very long time with lots of new photos as the server is fairly old, even though it is a 2-way SMP, because the script only runs one convert process at a time serially, only taking advantage of one CPU. The convert program is part of the imagemagick toolkit. How can I best modify this script so that it splits the overall job in half, running two simultaneous convert processes, one on each CPU? Having such a script should cut the total run time in half, or nearly so, which would really be great. Not really 2 but ... Either use make to run controlled number of process or just do something along for k in $(ls *.JPG); do convert $k -resize 1024 $k log.txt 2log2.txt ; done -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110114154648.ga16...@debian.org
Re: need help making shell script use two CPUs/cores
Carl Johnson put forth on 1/13/2011 11:34 AM: Processors Time (seconds) P1 66 P2 36 P3 25 P4 20 P5 20 P6 20 P7 20 P8 20 I am sure the time would have increased if the system had run out of memory and had to start swapping. The system is not completely idle since I am running a KDE 4.4.5 desktop and VirtualBox with two guest OSs (Debian and NetBSD). I suspect it would have closer to linear scaling if the system had been completely idle. Your numbers bear out exactly what I predicted. Look at the decrease in run time from 1 to 2, 2 to 3, and from 3 to 4 processes: #CPUs Decremental run timeFractional gain per CPU 2 30s 1/2 3 11s 1/6th 45s 1/13th You can clearly see the effects of serious memory contention when 3 cores are pegged. Bringing the 4th core into the mix yields almost nothing compared to three cores, cutting only 5 seconds from a 66 second run time. I'm anxious to see someone's results for a Phenom II X2 with the 6MB L2 cache to verify my prediction there. That's a tougher prediction though as I haven't modeled the cache behavior of Imagemagick's convert program. And the data above shows it seems to be very memory b/w heavy. Such a test would definitely be very revealing of the effectiveness of the Phenom II X2's L3 cache, given what we've seen so far. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d30d0f0.7000...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Bob Proulx put forth on 1/12/2011 2:48 PM: That makes a lot of sense to me. And also when cpu time divides by 1/N where N is the number of processes then if you have more convert processes running then effectively that task will get more total time than will the other tasks. A little bit more here and a little bit less there on the other tasks running. If you had two converts running and one mail task then the mail task would get 1/3rd and the two converts would get 2/3rds. As opposed to one convert and one mail task with 1/2 and 1/2. The math isn't quite that simple as it's a 2-way SMP box, and Linux can't perfectly schedule compute intensive and non compute intensive processes across CPUs. And context switching on a 550 MHz CPU with only 128K L2 cache is going to be expensive when two compute intensive tasks are running. I commend you on keeping that machine running. My main mail and web server was, until the motherboard died very recently, a 400 MHz P2. I was sad to see it go since it had been such a good performer for so many years. I'll be _very_ sad when this one dies. The Abit BP6 is the only dual Celeron motherboard ever made. It is legendary among over clockers due to the SMP nature, the fact that Celerons were 1/3rd the price of PIIs at the time, and that 333s were easily bumped to 500, and 366s easily bumped to 550--a 50% increase in clock speed, usually achievable with stock heat sinks. No modern chip will do that AFAIK. Thus, you got _more_ performance than an equivalent dual PII workstation which topped out at 450 MHz, for less than 1/3rd the price. This single board prompted Intel to disable the SMP circuitry on all future Celerons. They'd left it enabled assuming no one would actually build such a board. Abit did, and using the venerable Intel 440BX northbridge no less. :) The BP6 also has a lot of features most other boards were lacking at that time (1999), including jumperless BIOS configuration of CPU FSB and multiplier, independent CPU voltage adjustment, a Winbond voltage, thermal, and fan speed monitoring chip, a dual channel 4 port HighPoint UDMA/66 chip yielding a board with 8 HDD capability with 4 at UDMA/66 which no other board offered, and an SMBus header, which _NO_ other consumer board had at that time. The BP6 was the most exotic high end board on the market for at least a couple of years. It used a low end chip, but was faster than anything else at the time. Too bad Abit is no more: http://www.theinquirer.net/inquirer/news/1051283/the-abit-obit -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2ec415.9080...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner s...@hardwarefreak.com writes: Depending on the size of the photos one is converting, if they're relatively small like my 8.3MP 1.8MB jpegs, I'd think something like a dual core Phenom II X2 w/ 6MB L3 cache and 21.4 GB/s memory b/w would likely continue to scale with reduced overall script run time up to 4 parallel convert processes, maybe more, due to the excess of L3 cache and the 10.7 GB/s available to each core. Conversely, I'd think that a quad core Athlon II X4 with no L3 cache and only 512KB L2 cache per core, with each core receiving effectively only 5.3 GB/s of b/w, would not scale effectively to core_count*2 parallel processes as the Phenom II X2 would. In fact, due to 4 cores with little cache sharing the same 21.4 GB/s of memory b/w, the quad core Athlon II would probably start seeing a decline in reduced run time going from 2 processes to 4 as twice as many cores compete for memory access, and tailing off dramatically as the process count is increased to 5 and up. Just a guess. Anyone have such systems to test with? :) I have an Athlon II X4 620 (2.6 GHz), so I ran your test. It is somewhat different since I am currently running FreeBSD and didn't want to reboot to get back into debian, and I have GraphicsMagick instead of ImageMagick, but that shouldn't change the basic results. The results were that the time decreased up to 4 processes, but remained unchanged after that. Processors Time (seconds) P1 66 P2 36 P3 25 P4 20 P5 20 P6 20 P7 20 P8 20 I am sure the time would have increased if the system had run out of memory and had to start swapping. The system is not completely idle since I am running a KDE 4.4.5 desktop and VirtualBox with two guest OSs (Debian and NetBSD). I suspect it would have closer to linear scaling if the system had been completely idle. -- Carl Johnsonca...@peak.org -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87oc7k4si5.fsf@oak.localnet
Re: need help making shell script use two CPUs/cores
On Tue, 11 Jan 2011 15:58:45 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/11/2011 9:38 AM: I supposed you wouldn't care much in getting a script to run faster with all the available core occupied if you had a modern (4 years) cpu and plenty of speedy ram because the routine you wanted to run it should not take many time... unless you were going to process thousand of images :-) That's a bit ironic. You're suggesting the solution is to upgrade to a new system with a faster processor and memory. Why did you get that impression? No, I said I thought you were running a resource-scarce machine so in order to simulate your environment I made the tests under my VM... nothing more. However, all the newer processors have 2, 4, 6, 8, or 12 cores. So upgrading simply for single process throughput would waste all the other cores, which was the exact situation I found myself in. But of course! I would not even think in upgrade the whole computer just to get one concrete task done a few more seconds faster. The ironic part is that parallelizing the script to maximize performance on my system will also do the same for the newer chips, but to an even greater degree on those with 4, 6, 8, or 12 cores. Due to the fact that convert doesn't eat 100% of a core's time during its run, and the idle time in between one process finishing and xargs starting another, one could probably run 16-18 parallel convert processes on a 12 core Magny Cours with this script before run times stop decreasing. I think the script should also work very well with single-core cpus. The script works. It cut my run time by over 50%. I'm happy. As I said, this system's processing power is complete overkill 99% of the time. It works beautifully with pretty much everything I've thrown at it, for 8 years now. If I _really_ wanted to maximize the speed of this photo resizing task I'd install Win32 ImageMagick on my 2GHz Athlon XP workstation with dual channel memory nForce2 mobo, convert them on the workstation, and copy them to the server. However, absolute maximum performance of this task was not, and is not my goal. My goal was to make use of the second CPU, which was sitting idle in the server, to speed up the task completion. That goal was accomplished. :) Yeah, and tests are there to demonstrate the gain. Running more processes than real cores seems fine, did you try it? Define fine. Fine = system not hogging all resources. I had run 4 (2 core machine) and run time was a few seconds faster than 2 processes, 3 seconds IIRC. Running 8 processes pushed the system into swap and run time increased dramatically. Given that 4 processes only have a few seconds faster than two, yet consumed twice as much memory, the best overall number of processes to run on this system is two. Maybe the best number of processes is system-dependant (old processors could work better with a conservative value but newer ones can get some extra seconds with a higher one and without experiencing any significant penalty). Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.01.12.09.56...@gmail.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner wrote: Bob Proulx put forth: when otherwise it would be waiting for the disk. I believe what you are seeing above is the result of being able to compute during that small block on I/O wait for the disk interval. That's gotta be a very small iowait interval. So small, in fact, it doesn't show up in top at all. I've watched top a few times during these runs and I never see iowait. I would expect it to be very small. So small that you won't see it by eye when looking at it with top. Motion pictures run at 24 frames per second. That is quite good enough for your eye to see it as continuous motion. But to a computer 1/24th of a second is a long time. I don't think you will be able to observe this by looking at it with top and a one second update interval. I assumed the gain was simply because, watching top, each convert process doesn't actually fully peg the cpu during the entire process run life. Running one or two more processes in parallel with the first two simply gives the kernel scheduler the opportunity to run another process during those idle ticks. Uhm... But that is pretty much exactly what I said! :-) Doesn't actually fully peg the cpu is because eventually it will need to block on I/O from the disk. The process will run until it either blocks or is interrupted at the end of its timeslice. Do you propose other reasons for the process not to fully peg the cpu than for I/O waits? There is also the time gap between a process exiting and xargs starting up the next one. But what would be the cause of that gap? Waiting on disk to load the executable? (Actually it should be cached into filesystem buffer cache and and not have to wait for the disk.) AFAIK there isn't any gap there. (Actually as long as there is another convert process in memory then the next one will start very quickly by being able to reuse the same memory code pages.) I have no idea how much time that takes. But all the little bits add up in the total execution time of all 35 processes. Yes. All of the little bits add up and I believe accounts for the decrease in total wall-clock time from start to finish. A small but measurable value. And I think we were in agreement about everything else. :-) Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
Camaleón put forth on 1/12/2011 3:56 AM: On Tue, 11 Jan 2011 15:58:45 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/11/2011 9:38 AM: I supposed you wouldn't care much in getting a script to run faster with all the available core occupied if you had a modern (4 years) cpu and plenty of speedy ram because the routine you wanted to run it should not take many time... unless you were going to process thousand of images :-) That's a bit ironic. You're suggesting the solution is to upgrade to a new system with a faster processor and memory. Why did you get that impression? No, I said I thought you were running a resource-scarce machine so in order to simulate your environment I made the tests under my VM... nothing more. My bad Camaleón. I misunderstood what you said. My apologies. However, all the newer processors have 2, 4, 6, 8, or 12 cores. So upgrading simply for single process throughput would waste all the other cores, which was the exact situation I found myself in. But of course! I would not even think in upgrade the whole computer just to get one concrete task done a few more seconds faster. This depends on the task, of course. It my case it just wouldn't make sense, just as you say. I've managed some systems that we'd upgrade every two years because of a single application that never seemed to have enough horsepower under the hood. HPC compute centers seem to follow this trend. There's never enough cycles or enough nodes for many of them. The ironic part is that parallelizing the script to maximize performance on my system will also do the same for the newer chips, but to an even greater degree on those with 4, 6, 8, or 12 cores. Due to the fact that convert doesn't eat 100% of a core's time during its run, and the idle time in between one process finishing and xargs starting another, one could probably run 16-18 parallel convert processes on a 12 core Magny Cours with this script before run times stop decreasing. I think the script should also work very well with single-core cpus. This might depend on the hardware, but as I mentioned, it looks like the convert program doesn't use 100% CPU during its run, so yes, using the xargs script to fire up two concurrent convert processes with the kernel time slicing would probably decrease overall run time to some degree. Yeah, and tests are there to demonstrate the gain. Which is always a big plus. No guess work. :) I had run 4 (2 core machine) and run time was a few seconds faster than 2 processes, 3 seconds IIRC. Running 8 processes pushed the system into swap and run time increased dramatically. Given that 4 processes only have a few seconds faster than two, yet consumed twice as much memory, the best overall number of processes to run on this system is two. Maybe the best number of processes is system-dependant (old processors could work better with a conservative value but newer ones can get some extra seconds with a higher one and without experiencing any significant penalty). I don't have the machines here to confirm that hypothesis, but knowledge and experience tell me you're exactly correct. The reasons why you're correct are tied mostly to available L2/L3 cache bandwidth, and memory size and bandwidth. On my SUT, one convert process at its peak easily consumes more than half the memory bandwidth, which is why I only see a 50% reduction in run time using 2 processes, on running on each CPU, instead of a 100% reduction. Each 500 MHz Celeron CPU only has 128KB of L2 cache. System memory bandwidth of the 440BX chipset is only 800 MB/s. Depending on the size of the photos one is converting, if they're relatively small like my 8.3MP 1.8MB jpegs, I'd think something like a dual core Phenom II X2 w/ 6MB L3 cache and 21.4 GB/s memory b/w would likely continue to scale with reduced overall script run time up to 4 parallel convert processes, maybe more, due to the excess of L3 cache and the 10.7 GB/s available to each core. Conversely, I'd think that a quad core Athlon II X4 with no L3 cache and only 512KB L2 cache per core, with each core receiving effectively only 5.3 GB/s of b/w, would not scale effectively to core_count*2 parallel processes as the Phenom II X2 would. In fact, due to 4 cores with little cache sharing the same 21.4 GB/s of memory b/w, the quad core Athlon II would probably start seeing a decline in reduced run time going from 2 processes to 4 as twice as many cores compete for memory access, and tailing off dramatically as the process count is increased to 5 and up. Just a guess. Anyone have such systems to test with? :) -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2dfd15.8000...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Bob Proulx put forth on 1/12/2011 1:11 PM: Stan Hoeppner wrote: Bob Proulx put forth: when otherwise it would be waiting for the disk. I believe what you are seeing above is the result of being able to compute during that small block on I/O wait for the disk interval. That's gotta be a very small iowait interval. So small, in fact, it doesn't show up in top at all. I've watched top a few times during these runs and I never see iowait. I would expect it to be very small. So small that you won't see it by eye when looking at it with top. Motion pictures run at 24 frames per second. That is quite good enough for your eye to see it as continuous motion. But to a computer 1/24th of a second is a long time. I don't think you will be able to observe this by looking at it with top and a one second update interval. My point wasn't that not seeing it meant that it wasn't happening. I'm sure I'd have seen something had I run iostat. But being that small, with a total script run time of over a minute, how does the IO wait time come into play, to any significant degree, if the total IO wait is maybe 2 seconds? (apt analogy btw--good for others who may not have understood otherwise) I assumed the gain was simply because, watching top, each convert process doesn't actually fully peg the cpu during the entire process run life. Running one or two more processes in parallel with the first two simply gives the kernel scheduler the opportunity to run another process during those idle ticks. Uhm... But that is pretty much exactly what I said! :-) Doesn't actually fully peg the cpu is because eventually it will need to block on I/O from the disk. The process will run until it either blocks or is interrupted at the end of its timeslice. Do you propose other reasons for the process not to fully peg the cpu than for I/O waits? Yes, I do. I've not looked at the code so I can't say for sure. However, watching top (yes, not that accurate) during the runs showed periods of multiple seconds where each convert process was only running at 60% CPU. Then it would bump back to 100%. IIRC this happened multiple times. Considering this is an image processing program, I would _assume_ the entire image file is loaded into memory upon startup. After processing is complete the image file is written out. I don't see why the process would be accessing disk during its run, especially with these small 1.8 MB jpg files. Thus, I am guessing that there are a couple of code routines in the conversion process that just don't peg the CPU. Or, is it possible that memory contention between the two CPUs causes this less than 100% CPU usage reported in top, and when each is running 100% CPU that most of the workload is actually in that tiny 128KB L2 cache? I'm not a top expert. If a process blocks on memory wait does the kernel still report the process as 100% CPU or lower? Anyway, these are the two possible reasons I propose for the less than 100% CPU usage of the convert processes. I'm making educated guesses here, not stating fact. There is also the time gap between a process exiting and xargs starting up the next one. But what would be the cause of that gap? Waiting on disk to load the executable? (Actually it should be cached into filesystem buffer cache and and not have to wait for the disk.) AFAIK there isn't any gap there. (Actually as long as there is another convert process in memory then the next one will start very quickly by being able to reuse the same memory code pages.) As you said, top's 1 second interval, and the manner in which it displays what is happening, may be masking what's really going on. What I've stated was looking at the %CPU for each process, not the summary area %CPU. Likely what I described as a gap was merely one convert PID dying and another starting up at another location further up the screen. With each of these things occurring in a different frame that would explain the appearance of a time gap. So, I'd say I was wrong in describing that as a time gap. I'd have to do some testing with other tools to absolutely verify all of this. Frankly I'd rather not waste the time on it at this point. You solved my original problem Bob! Thank again. That was the important takeaway here. Now we're into minutia (which can be fun but I'm spending way too much time on debian-user email the last few days) I have no idea how much time that takes. But all the little bits add up in the total execution time of all 35 processes. Yes. All of the little bits add up and I believe accounts for the decrease in total wall-clock time from start to finish. A small but measurable value. And I think we were in agreement about everything else. :-) Yep. Chalk all this up to incorrect data due to insufficient frame rate. :) Ahh, something else I just realized. Feel free to slap me if you like. :) Given this is a production mx mail and web server, it's
Re: need help making shell script use two CPUs/cores
Stan Hoeppner wrote: Frankly I'd rather not waste the time on it at this point. You solved my original problem Bob! Thank again. That was the important takeaway here. Now we're into minutia (which can be fun but I'm spending way too much time on debian-user email the last few days) Glad to have been able to help with your original problem! And I agree, I am spending way too much time here too. Need to get other work done. :-) Ahh, something else I just realized. Feel free to slap me if you like. :) I missed that too. Given this is a production mx mail and web server, it's very likely that daemons awoke and ate some CPU without causing a highlight change in top. Since I was intensely watching the convert processes, I may not have noticed, or simply ignored them. That's a better explanation for the less than 100% CPU per convert process than anything else, and far more likely. smtpd, imapd, lighttpdd, etc are frequently firing and eating little bits of CPU. This is a personal server so the traffic is small, but nonetheless daemons are firing regularly. Postfix alone fires 3 or 4 daemons when mail arrives. None of these eat much CPU time, but they all add up. That makes a lot of sense to me. And also when cpu time divides by 1/N where N is the number of processes then if you have more convert processes running then effectively that task will get more total time than will the other tasks. A little bit more here and a little bit less there on the other tasks running. If you had two converts running and one mail task then the mail task would get 1/3rd and the two converts would get 2/3rds. As opposed to one convert and one mail task with 1/2 and 1/2. And context switching on a 550 MHz CPU with only 128K L2 cache is going to be expensive when two compute intensive tasks are running. I commend you on keeping that machine running. My main mail and web server was, until the motherboard died very recently, a 400 MHz P2. I was sad to see it go since it had been such a good performer for so many years. Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
Camaleón put forth on 1/10/2011 2:11 PM: Did'nt you run any test? Okay... (now downloading the sample images) Yes, or course. I just didn't capture results to file. And it's usually better if people see their own results instead of someone else' copy/paste. 2. On your dual processor, or dual core system, execute: for k in *.JPG; do echo $k; done | xargs -I{} -P2 convert {} -resize 3072 {} I used a VM to get the closest environment as you seem to have (a low resource machine) and the above command (timed) gives: I'm not sure what you mean by resources in this context. My box has plenty of resources for the task we're discussing. Each convert process, IIRC, was using 80MB on my system. Only two can run simultaneously. So why queue up 4 or more processes? That just eats memory uselessly for zero decrease in total run time. real 1m44.038s user 2m5.420s sys 1m17.561s It uses 2 convert proccesses so the files are being run on pairs. And you can even get the job done faster if using -P8: real 1m25.255s user 2m1.792s sys 0m43.563s That's an unexpected result. I would think running #cores*2^x with an increasing x value would start yielding lower total run times within a few multiples of #cores. No need to have a quad core with HT. Nice :-) Use some of the other convert options on large files and you'll want those extra two real cores. ;) Now, to compare the xargs -P parallel process performance to standard serial performance, clear the temp dir and copy the original files over again. Now execute: for k in *.JPG; do convert $k -resize 3072 $k; done This gives: real 2m30.007s user 2m11.908s sys 1m42.634s Which is ~0.46s. of plus delay. Not that bad. You mean 46s not 0.46s. 104s vs 150s = 44% decrease in run time. This _should_ be closer to a 90-100% decrease in a perfect world. In this case there is insufficient memory bandwidth to feed all the processors. I just made two runs on the same set of photos but downsized them to 800x600 to keep the run time down. (I had you upscale them to 3072x2048 as your CPUs are much newer) $ time for k in *.JPG; do convert $k -resize 800 $k; done real1m16.542s user1m11.872s sys 0m4.104s $ time for k in *.JPG; do echo $k; done | xargs -I{} -P2 convert {} -resize 800 {} real0m41.188s user1m14.837s sys 0m4.812s 41s vs 77s = 53% decrease in run time. In this case there is insufficient memory bandwidth as well. The Intel BX chipset supports a single channel of PC100 memory for a raw bandwidth of 800MB/s. Image manipulation programs will eat all available memory b/w. On my system, running two such processes allows ~400MB/s to each processor socket, starving the convert program of memory access. To get close to _linear_ scaling in this scenario, one would need something like an 8 core AMD Magny Cours system with quad memory channels, or whatever the Intel platform is with quad channels. One would run with xargs -P2, allowing each process ~12GB/s of memory bandwidth. This should yield a 90-100% decrease in run time. Running more processes than real cores seems fine, did you try it? Define fine. Please post the specs of your SUT, both CPU/mem subsystem and OS environment details (what hypervisor and guest). (SUT is IBM speak for System Under Test). Linux is pretty efficient at scheduling multiple processes among cores in multiprocessor and/or multi-core systems and achieving near linear performance scaling. This is one reason why fork and forget is such a popular method used for parallel programming. All you have to do is fork many children and the kernel takes care of scheduling the processes to run simultaneously. Yep. It handles the proccesses quite nice. Are you new to the concept of parallel processing and what CPU process scheduling is? -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2c578b.3090...@hardwarefreak.com
Re: [OT]: Re: need help making shell script use two CPUs/cores
Dan Serban put forth on 1/10/2011 7:52 PM: On Mon, 10 Jan 2011 12:04:19 -0600 Stan Hoeppner s...@hardwarefreak.com wrote: [snip] http://www.hardwarefreak.com/server-pics/ Which gallery system are you using? I quite like it. That's the result of Curator: http://furius.ca/curator/ I've been using it for 7+ years. Debian dropped the package sometime back, before Etch IIRC. Last time I installed it I grabbed it from SourceForge. It's a python app so you need python and you'll need the imagemagick tools. Unfortunately its functions are written in a manner that psyco can't optimize. It's plenty fast though if you're doing a directory structure with only a couple hundred pic files or less. My server is pretty old, 550MHz, and I've got a couple of dirs with thousands of image files. It takes over 12 hours to process them. It processes all subdirs under a dir. I've found no option to disable this. Thus, be mindful of the way you setup your directory structures. Even if nothing in a subdir has changed since the last run, curator will still process all subdirs. It's pretty fast at doing so, but if you have 100 subdirs with 100 files in each that's 10,000 image files to be looked at, and bumps up the run time. With any modern 2-3GHz x86 AMD/Intel CPU you prolly don't need to worry about the speed of curator. I've never run it on a modern chip, just my lowly, but uber cool, vintage Abit BP6 dual Celeron 3...@550 server, which is the server in those photos. I have a tendency to hang onto systems as long as they're still useful. At one time it was my workstation/gaming rig. Those dual Celerons are now idle 99%| of the time, and the machine is usually plenty fast for any interactive command line or batch work I need to do. Of note, if you've been reading this thread, you'll notice I use this script and ImageMagick's convert utility to resize my camera photos before running curator on them, since I can now resize them almost twice as fast, running 2 parallel convert processes. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2c74d8.6030...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
On Tue, 11 Jan 2011 07:13:47 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/10/2011 2:11 PM: I used a VM to get the closest environment as you seem to have (a low resource machine) and the above command (timed) gives: I'm not sure what you mean by resources in this context. My box has plenty of resources for the task we're discussing. Each convert process, IIRC, was using 80MB on my system. Only two can run simultaneously. So why queue up 4 or more processes? That just eats memory uselessly for zero decrease in total run time. I supposed you wouldn't care much in getting a script to run faster with all the available core occupied if you had a modern (4 years) cpu and plenty of speedy ram because the routine you wanted to run it should not take many time... unless you were going to process thousand of images :-) (...) I just made two runs on the same set of photos but downsized them to 800x600 to keep the run time down. (I had you upscale them to 3072x2048 as your CPUs are much newer) $ time for k in *.JPG; do convert $k -resize 800 $k; done real1m16.542s user1m11.872s sys 0m4.104s $ time for k in *.JPG; do echo $k; done | xargs -I{} -P2 convert {} -resize 800 {} real0m41.188s user1m14.837s sys 0m4.812s 41s vs 77s = 53% decrease in run time. In this case there is insufficient memory bandwidth as well. The Intel BX chipset supports a single channel of PC100 memory for a raw bandwidth of 800MB/s. Image manipulation programs will eat all available memory b/w. On my system, running two such processes allows ~400MB/s to each processor socket, starving the convert program of memory access. To get close to _linear_ scaling in this scenario, one would need something like an 8 core AMD Magny Cours system with quad memory channels, or whatever the Intel platform is with quad channels. One would run with xargs -P2, allowing each process ~12GB/s of memory bandwidth. This should yield a 90-100% decrease in run time. Running more processes than real cores seems fine, did you try it? Define fine. Fine = system not hogging all resources. Please post the specs of your SUT, both CPU/mem subsystem and OS environment details (what hypervisor and guest). (SUT is IBM speak for System Under Test). I didn't know the meaning of that SUT term... The test was run in a laptop (Toshiba Tecra A7) with an Intel Core Duo T2400 (in brief, 2M Cache, 1.83 GHz, 667 MHz FSB, full specs¹) and 4 GiB of ram (DDR2). VM is Virtualbox (4.0) with Windows XP Pro as host and Debian Squeeze as guest. VM was setup to use the 2 cores and 1.5 GiB of system ram. Disk controller is emulated via ich6. Linux is pretty efficient at scheduling multiple processes among cores in multiprocessor and/or multi-core systems and achieving near linear performance scaling. This is one reason why fork and forget is such a popular method used for parallel programming. All you have to do is fork many children and the kernel takes care of scheduling the processes to run simultaneously. Yep. It handles the proccesses quite nice. Are you new to the concept of parallel processing and what CPU process scheduling is? No... I guess this is quite similar to the way most of the daemons do when running in background and launch several instances (like amavisd- new does) but I didn't think there was a direct relation in the number of the running daemons/processes and the cores available in the CPU, I mean, I thought the kernel would automatically handle all the resources available the best it can, regardless of the number of cores in use. ¹http://ark.intel.com/Product.aspx?id=27235 Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.01.11.15.38...@gmail.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner wrote: Camaleón put forth: real1m44.038s user2m5.420s sys 1m17.561s It uses 2 convert proccesses so the files are being run on pairs. And you can even get the job done faster if using -P8: real1m25.255s user2m1.792s sys 0m43.563s That's an unexpected result. I would think running #cores*2^x with an increasing x value would start yielding lower total run times within a few multiples of #cores. If you have enough memory (which is critical) then increasing the number of processes above the number of compute units *a little bit* is okay and increases overall throughput. You are processing image data. That is a large amount of disk data and won't ever be completely cached. At some point the process will block on I/O waiting for the disk. Perhaps not often but enough. At that moment the cpu will be idle until the disk block becomes available. When you are runing four processes on your two cpu machine that means there will always be another process in the run queue ready to go while waiting for the disk. That allows processing to continue when otherwise it would be waiting for the disk. I believe what you are seeing above is the result of being able to compute during that small block on I/O wait for the disk interval. On the negative side having more processes in the run queue does consume a little more overhead for process scheduling. And launching a lot of processes consumes resources. So it definitely doesn't make sense to launch one process per image. But being above the number of cpus does help a small amount. Another negative is that other tasks then suffer. With excess compute capacity you always have some cpu time for the desktop side of life. Moving windows, rendering web pages, other user tasks, delivering email. Sometimes squeezing that last percentage point out of something can really kill your interactive experience and end up frustrating you more. So as a hint I wouldn't push too hard on it. No need to have a quad core with HT. Nice :-) My benchmarks show that hyperthreading (fake cpus) actually slow down single thread processes such as image conversions. HT seems like a marketing breakthrough to me. Although having the effective extra registers available may benefit a highly threaded application. I just don't have any performance critical highly threaded applications. I am sure they exist somewhere along with unicorns and other good sources of sparkles. Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
Camaleón wrote: No... I guess this is quite similar to the way most of the daemons do when running in background and launch several instances (like amavisd- new does) That is an optimization to help with the latency overhead associated with forking processes. In order to reduce the response time to react to an external event such as arrival of email or processing a web page many daemons such as those pre-fork copies ahead of time so that they will be ready and waiting. Those processes don't consume cpu time while waiting. They do consume memory and cpu scheduling queue resources. But pre-forked, ready to go, and waiting they just sit there waiting for something to do. But then when there is I/O and they have something to do then they can get going on it very quickly since they are already loaded in memory. This reduces response latency. Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
Camaleón put forth on 1/11/2011 9:38 AM: I supposed you wouldn't care much in getting a script to run faster with all the available core occupied if you had a modern (4 years) cpu and plenty of speedy ram because the routine you wanted to run it should not take many time... unless you were going to process thousand of images :-) That's a bit ironic. You're suggesting the solution is to upgrade to a new system with a faster processor and memory. However, all the newer processors have 2, 4, 6, 8, or 12 cores. So upgrading simply for single process throughput would waste all the other cores, which was the exact situation I found myself in. The ironic part is that parallelizing the script to maximize performance on my system will also do the same for the newer chips, but to an even greater degree on those with 4, 6, 8, or 12 cores. Due to the fact that convert doesn't eat 100% of a core's time during its run, and the idle time in between one process finishing and xargs starting another, one could probably run 16-18 parallel convert processes on a 12 core Magny Cours with this script before run times stop decreasing. The script works. It cut my run time by over 50%. I'm happy. As I said, this system's processing power is complete overkill 99% of the time. It works beautifully with pretty much everything I've thrown at it, for 8 years now. If I _really_ wanted to maximize the speed of this photo resizing task I'd install Win32 ImageMagick on my 2GHz Athlon XP workstation with dual channel memory nForce2 mobo, convert them on the workstation, and copy them to the server. However, absolute maximum performance of this task was not, and is not my goal. My goal was to make use of the second CPU, which was sitting idle in the server, to speed up the task completion. That goal was accomplished. :) Running more processes than real cores seems fine, did you try it? Define fine. Fine = system not hogging all resources. I had run 4 (2 core machine) and run time was a few seconds faster than 2 processes, 3 seconds IIRC. Running 8 processes pushed the system into swap and run time increased dramatically. Given that 4 processes only have a few seconds faster than two, yet consumed twice as much memory, the best overall number of processes to run on this system is two. I didn't know the meaning of that SUT term... I like using it. It's good short hand. I wish more people used it, or were familiar with it, so I wouldn't have to define it every time I use it. :) The test was run in a laptop (Toshiba Tecra A7) with an Intel Core Duo T2400 (in brief, 2M Cache, 1.83 GHz, 667 MHz FSB, full specs¹) and 4 GiB of ram (DDR2) VM is Virtualbox (4.0) with Windows XP Pro as host and Debian Squeeze as guest. VM was setup to use the 2 cores and 1.5 GiB of system ram. Disk controller is emulated via ich6. I wonder how much faster convert it would run on bare metal on that laptop. Are you new to the concept of parallel processing and what CPU process scheduling is? No... I guess this is quite similar to the way most of the daemons do when running in background and launch several instances (like amavisd- new does) but I didn't think there was a direct relation in the number of the running daemons/processes and the cores available in the CPU, I mean, I thought the kernel would automatically handle all the resources available the best it can, regardless of the number of cores in use. This is correct. But the kernel can't take a single process make it run across all cores, maximizing performance. For this, the process must be written to create threads, forks, or children. The kernel will then run each of these on a different processor core. This is why Imagemagick convert needs to be parallelized when batching many photos. If you don't parallelize it, the kernel can't schedule it across all cores. The docs say it will use threads but only with large files. Apparently 8.2 megapixel JPGs aren't large, as the threading has never kicked in for me. By using xargs for parallelization, we create x number of concurrent processes. The kernel then schedules each one on a different cpu core. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2cd295.30...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Bob writes: They do consume memory and cpu scheduling queue resources. Very little, due to shared memory and copy-on-write. -- John Hasler -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/871v4j6qf4@thumper.dhh.gt.org
Re: need help making shell script use two CPUs/cores
Bob writes: Another negative is that other tasks then suffer. That's what group scheduling is for. -- John Hasler -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87wrmb5bmt@thumper.dhh.gt.org
Re: [OT]: Re: need help making shell script use two CPUs/cores
On Tue, 11 Jan 2011 09:18:48 -0600 Stan Hoeppner s...@hardwarefreak.com wrote: Dan Serban put forth on 1/10/2011 7:52 PM: On Mon, 10 Jan 2011 12:04:19 -0600 Stan Hoeppner s...@hardwarefreak.com wrote: [snip] http://www.hardwarefreak.com/server-pics/ Which gallery system are you using? I quite like it. That's the result of Curator: http://furius.ca/curator/ I've been using it for 7+ years. Debian dropped the package sometime back, before Etch IIRC. Last time I installed it I grabbed it from SourceForge. It's a python app so you need python and you'll need the imagemagick tools. It's a nice looking interface, simple is what I like. Unfortunately its functions are written in a manner that psyco can't optimize. It's plenty fast though if you're doing a directory structure with only a couple hundred pic files or less. My server is pretty old, 550MHz, and I've got a couple of dirs with thousands of image files. It takes over 12 hours to process them. It processes all subdirs under a dir. I've found no option to disable this. Thus, be mindful of the way you setup your directory structures. Even if nothing in a subdir has changed since the last run, curator will still process all subdirs. It's pretty fast at doing so, but if you have 100 subdirs with 100 files in each that's 10,000 image files to be looked at, and bumps up the run time. Indeed, I find that simple services always seem to end up eating a lot more resources than originally thought. With any modern 2-3GHz x86 AMD/Intel CPU you prolly don't need to worry about the speed of curator. I've never run it on a modern chip, just my lowly, but uber cool, vintage Abit BP6 dual Celeron 3...@550 server, which is the server in those photos. I have a tendency to hang onto systems as long as they're still useful. At one time it was my workstation/gaming rig. Those dual Celerons are now idle 99%| of the time, and the machine is usually plenty fast for any interactive command line or batch work I need to do. I commend your spirit. I have collections of such hardware, but in my incessant need to have more power, and less power usage, half of this stuff gets retired. I wish I could find a good cause to give it to, but the linux/debian zealot in me refuses to just give it away to the dark side :/, if it'll run windows, I want you to give me money for it. Heh. I have a dual proc p3 1ghz motherboard. Pretty much worthless now, though it did a hell of a job running internal email and web/db services. Of note, if you've been reading this thread, you'll notice I use this script and ImageMagick's convert utility to resize my camera photos before running curator on them, since I can now resize them almost twice as fast, running 2 parallel convert processes. I certainly have followed the thread and have learned that xargs allows you to parallel process commands. Something my 20 years of linux adventures haven't taught me until yesterday. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/2011052102.7be6c...@ws82.int.tlc
Re: need help making shell script use two CPUs/cores
Bob Proulx put forth on 1/11/2011 3:08 PM: Stan Hoeppner wrote: Camaleón put forth: real1m44.038s user2m5.420s sys 1m17.561s It uses 2 convert proccesses so the files are being run on pairs. And you can even get the job done faster if using -P8: real1m25.255s user2m1.792s sys 0m43.563s That's an unexpected result. I would think running #cores*2^x with an increasing x value would start yielding lower total run times within a few multiples of #cores. If you have enough memory (which is critical) then increasing the number of processes above the number of compute units *a little bit* is okay and increases overall throughput. You are processing image data. That is a large amount of disk data and won't ever be completely cached. At some point the process will Not really. Each file, in my case, started as a 1.8MB jpeg. The disk throughput on my server is ~80MB/s. Read latency is about 15-20ms on average. In my recent example workload there were 35 such images. block on I/O waiting for the disk. Perhaps not often but enough. At that moment the cpu will be idle until the disk block becomes available. When you are runing four processes on your two cpu machine that means there will always be another process in the run queue ready to go while waiting for the disk. That allows processing to continue when otherwise it would be waiting for the disk. I believe what you are seeing above is the result of being able to compute during that small block on I/O wait for the disk interval. That's gotta be a very small iowait interval. So small, in fact, it doesn't show up in top at all. I've watched top a few times during these runs and I never see iowait. I assumed the gain was simply because, watching top, each convert process doesn't actually fully peg the cpu during the entire process run life. Running one or two more processes in parallel with the first two simply gives the kernel scheduler the opportunity to run another process during those idle ticks. There is also the time gap between a process exiting and xargs starting up the next one. I have no idea how much time that takes. But all the little bits add up in the total execution time of all 35 processes. On the negative side having more processes in the run queue does consume a little more overhead for process scheduling. And launching a lot of processes consumes resources. So it definitely doesn't make sense to launch one process per image. But being above the number of cpus does help a small amount. Totally agree. That amount of decreased run time is small enough on my system that I don't bother with 3 processes. I only parallelize 2, as the extra ~80MB of memory consumed by the 3rd is better consumed by smtpd, imapd, httpd than saving me 5-10 seconds of execution time for the batch photo resize. This is a server after all. ;) Another negative is that other tasks then suffer. With excess compute capacity you always have some cpu time for the desktop side of life. Moving windows, rendering web pages, other user tasks, delivering email. Sometimes squeezing that last percentage point out of something can really kill your interactive experience and end up frustrating you more. So as a hint I wouldn't push too hard on it. In my case those other tasks aren't interactive, but they exist nonetheless, as mentioned above. My benchmarks show that hyperthreading (fake cpus) actually slow down single thread processes such as image conversions. HT seems like a marketing breakthrough to me. Although having the effective extra registers available may benefit a highly threaded application. I just don't have any performance critical highly threaded applications. I am sure they exist somewhere along with unicorns and other good sources of sparkles. This has been my experience as well. SMT traditionally doesn't work well when you oversubscribe more compute bound processes than a machine has physical cores. This was discovered relatively quickly after Intel's HT CPUs hit the market. Folks began running one s...@home process per virtual CPU on dual socket Xeon boxen, 4 processes total, and their elapsed time per process increased substantially vs running one process per socket. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2cecef.1020...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
John Hasler put forth on 1/11/2011 4:12 PM: Bob writes: They do consume memory and cpu scheduling queue resources. Very little, due to shared memory and copy-on-write. In this case I don't think all that much memory is shared. Each process' data portion is different as each processes a different picture file. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2cee26.4010...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Bob writes: They do consume memory and cpu scheduling queue resources. I wrote: Very little, due to shared memory and copy-on-write. Stan writes: In this case I don't think all that much memory is shared. Each process' data portion is different as each processes a different picture file. I was referring to pre-forking. Pre-forked processes share text and also share data while waiting for work. Thus they consume little in the way of resources until they have something to do. -- John Hasler -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87sjwz53q4@thumper.dhh.gt.org
Re: need help making shell script use two CPUs/cores
Karl Vogel put forth on 1/9/2011 6:04 PM: On Sun, 09 Jan 2011 10:05:43 -0600, Stan Hoeppner s...@hardwarefreak.com said: S #! /bin/sh S for k in $(ls *.JPG); do convert $k -resize 1024 $k; done Someone was ragging on you to let the shell do the file expansion. I like your way better because most scripting shells aren't smart enough to realize that when there aren't any .JPG files, I don't want the script to echo '*.JPG' as if that's actually useful. This doesn't matter to me as I only use this script on a single temp directory after I dump the camera files into it. The camera, a Fujifilm FinePix A820 8.3MP, saves its files in all upper case. S I use the above script to batch re-size digital camera photos after I S dump them to my web server. It takes a very long time with lots of new S photos as the server is fairly old, even though it is a 2-way SMP, S because the script only runs one convert process at a time serially, S only taking advantage of one CPU. First things first: are you absolutely certain that running two parallel jobs will exercise both CPUs? I've seen SMP systems that don't exactly live up to truth-in-advertising. If you stuff two convert jobs in the background and then run top (or the moral equivalent) do you SEE both CPUs being worked? See my response to Bob. And see Bob's response to you. :) The issue you describe was resolved with a few patches many years ago, and only reared its ugly head on processors with SMT (HT) enabled. The kernel scheduler work lagged behind the hardware releases of IBM's SMT and Intel's HT. The chips were on the market a while before regular distro release cycles caught up. So early adopters of SMT chips saw the problem you describe. As Bob noted, in most situations, simply turning SMT off fixed the problem instantly. For those who don't know the acronyms, SMT stands for Simultaneous Multi-threading, which is the textbook term for this technology. Intel gave their SMT implementation a catchy marketing name, HyperThreading, as they seem to do with every product, sadly. Second: do you have taskset installed? If the work isn't being divided up the way you like, you can bind a process to a desired core: http://planet.admon.org/how-to-bind-a-certain-process-to-specified-core/ cpusets (see also cpumemsets) which is the kernel feature that tasksel manipulates, is overkill for managing process scheduling on a 2-way box, and wouldn't yield much, if any, benefit. In fact, if I were to attempt using it with my piddly workloads, I'd likely be far less efficient at manually scheduling tasks than the kernel. In fact, I can guarantee you of this. :) And last: if you're not using something like LVM, can you do anything to make sure you're not hitting the same disk? If all your new photos are on the same drive, any CPU savings you get from parallel processing will probably be erased by disk contention. Better yet, do you have enough memory to do the processing on a RAM-backed filesystem? Apparently you've never used Imagemagick's convert utility, or any other image manipulation tools, or not on an older ~550MHz machine with tiny L2 cache (by today's standards). Image manipulation programs are always CPU bound, rarely, if ever, IO bound. I'd say never but I'm sure there is a rare corner case out there somewhere. It's odd isn't it, that I have pretty intimate knowledge of the things above, yet am handicapped WRT shell scripting? Nobody knows everything, and I'm sure glad lists such as debian-users exist to fill in the knowledge gaps. :) -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2ac7b7.3010...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
On Sun, 09 Jan 2011 14:39:56 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/9/2011 12:12 PM: Better if you check it, but I dunno how to get the compile options for the lenny package... where is this defined, in source or diff packages? You're taking this thread down the wrong path. I asked for assistance writing a simple script to do what I want it to do. Accomplishing that will fix all of my problems WRT Imagemagick. I didn't ask for help in optimizing or fixing the Lenny i386 Imagemagick package. ;) I read it as how to speed up the execution of a batch script that has to deal with resizing big images and usually you get some gains if the program to run was compiled to work with threads in mind. Anyway, how are you going to take any advantadge of multi-threading capabilities if the program you are going to run was not compiled with this flag enabled? I think you're missing something. Go back and read my original post. If you still don't understand, maybe refresh yourself on Linux process scheduling. Good. It would be nice to see the results when you finally go it working the way you like ;-) Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.01.10.14.08...@gmail.com
Re: need help making shell script use two CPUs/cores
Camaleón put forth on 1/10/2011 8:08 AM: On Sun, 09 Jan 2011 14:39:56 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/9/2011 12:12 PM: Better if you check it, but I dunno how to get the compile options for the lenny package... where is this defined, in source or diff packages? You're taking this thread down the wrong path. I asked for assistance writing a simple script to do what I want it to do. Accomplishing that will fix all of my problems WRT Imagemagick. I didn't ask for help in optimizing or fixing the Lenny i386 Imagemagick package. ;) I read it as how to speed up the execution of a batch script that has to deal with resizing big images and usually you get some gains if the program to run was compiled to work with threads in mind. I said lots of small images, IIRC. Regardless, threading isn't simply turned on with a compile time argument. A program must be written specifically to create master and worker threads. Implementation is somewhat similar to exec and fork, compared to serial programming anyway, though the IPC semantics are different. It's a safe bet that the programs in the Lenny i386 Imagemagick package do have the threading support. The following likely explains why _I_ wasn't seeing the threading. From: http://www.imagemagick.org/Usage/api/#speed For small images using the IM multi-thread capabilities will not give you any advantage, though on a large busy server it could be detrimental. But for large images the OpenMP multi-thread capabilities can produce a definate speed advantage as it uses more CPU's to complete the individual image processing operations. It would be nice to know their definition of small images. Good. It would be nice to see the results when you finally go it working the way you like ;-) Bob's xargs suggestion got it working instantly many hours ago. I'm not sure of the results you refer to. Are you looking for something like watch top output for Cpu0 and Cpu1? See for yourself. 1. wget all the 35 .JPG files from this URL: http://www.hardwarefreak.com/server-pics/ copy them all to a working temp dir 2. On your dual processor, or dual core system, execute: for k in *.JPG; do echo $k; done | xargs -I{} -P2 convert {} -resize 3072 {} For a quad core system, change -P2 to -P4. You may want to wrap it with the time command. 3. Immediately execute top and watch Cpu0/1/2/3 in the summary area. You'll see pretty linear parallel scaling of the convert processes. Also note memory consumption doubles with each doubling of the process count. Now, to compare the xargs -P parallel process performance to standard serial performance, clear the temp dir and copy the original files over again. Now execute: for k in *.JPG; do convert $k -resize 3072 $k; done and launch top. You'll see only a single convert process running. Again, you can wrap this with the time command if you like to compare total run times. What you'll find is nearly linear scaling as the number of convert processes is doubled, up to the point #processes equals #cores. Running more processes than cores merely eats memory wastefully and increases total processing time. Linux is pretty efficient at scheduling multiple processes among cores in multiprocessor and/or multi-core systems and achieving near linear performance scaling. This is one reason why fork and forget is such a popular method used for parallel programming. All you have to do is fork many children and the kernel takes care of scheduling the processes to run simultaneously. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2b4a23.3020...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
On Mon, 10 Jan 2011 12:04:19 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/10/2011 8:08 AM: Good. It would be nice to see the results when you finally go it working the way you like ;-) Bob's xargs suggestion got it working instantly many hours ago. I'm not sure of the results you refer to. Are you looking for something like watch top output for Cpu0 and Cpu1? See for yourself. Did'nt you run any test? Okay... (now downloading the sample images) 2. On your dual processor, or dual core system, execute: for k in *.JPG; do echo $k; done | xargs -I{} -P2 convert {} -resize 3072 {} I used a VM to get the closest environment as you seem to have (a low resource machine) and the above command (timed) gives: real1m44.038s user2m5.420s sys 1m17.561s It uses 2 convert proccesses so the files are being run on pairs. And you can even get the job done faster if using -P8: real1m25.255s user2m1.792s sys 0m43.563s No need to have a quad core with HT. Nice :-) Now, to compare the xargs -P parallel process performance to standard serial performance, clear the temp dir and copy the original files over again. Now execute: for k in *.JPG; do convert $k -resize 3072 $k; done This gives: real2m30.007s user2m11.908s sys 1m42.634s Which is ~0.46s. of plus delay. Not that bad. and launch top. You'll see only a single convert process running. Again, you can wrap this with the time command if you like to compare total run times. What you'll find is nearly linear scaling as the number of convert processes is doubled, up to the point #processes equals #cores. Running more processes than cores merely eats memory wastefully and increases total processing time. Running more processes than real cores seems fine, did you try it? Linux is pretty efficient at scheduling multiple processes among cores in multiprocessor and/or multi-core systems and achieving near linear performance scaling. This is one reason why fork and forget is such a popular method used for parallel programming. All you have to do is fork many children and the kernel takes care of scheduling the processes to run simultaneously. Yep. It handles the proccesses quite nice. Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.01.10.20.11...@gmail.com
[OT]: Re: need help making shell script use two CPUs/cores
On Mon, 10 Jan 2011 12:04:19 -0600 Stan Hoeppner s...@hardwarefreak.com wrote: [snip] http://www.hardwarefreak.com/server-pics/ Which gallery system are you using? I quite like it. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110110175242.5cb05...@ws82.int.tlc
need help making shell script use two CPUs/cores
I'm not very skilled at writing shell scripts. #! /bin/sh for k in $(ls *.JPG); do convert $k -resize 1024 $k; done I use the above script to batch re-size digital camera photos after I dump them to my web server. It takes a very long time with lots of new photos as the server is fairly old, even though it is a 2-way SMP, because the script only runs one convert process at a time serially, only taking advantage of one CPU. The convert program is part of the imagemagick toolkit. How can I best modify this script so that it splits the overall job in half, running two simultaneous convert processes, one on each CPU? Having such a script should cut the total run time in half, or nearly so, which would really be great. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d29dcd7.6090...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
On Sun, Jan 09, 2011 at 10:05:43AM -0600, Stan Hoeppner wrote: #! /bin/sh for k in $(ls *.JPG); do convert $k -resize 1024 $k; done I use the above script to batch re-size digital camera photos after I dump them to my web server. It takes a very long time with lots of new photos as the server is fairly old, even though it is a 2-way SMP, because the script only runs one convert process at a time serially, only taking advantage of one CPU. The convert program is part of the imagemagick toolkit. How can I best modify this script so that it splits the overall job in half, running two simultaneous convert processes, one on each CPU? Having such a script should cut the total run time in half, or nearly so, which would really be great. You need parallel: http://ftp.gnu.org/gnu/parallel/ From their home page (http://freshmeat.net/projects/parallel): GNU parallel is a shell tool for executing jobs in parallel locally or using remote computers. A job is typically a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. If you use xargs today you will find GNU parallel very easy to use, as GNU parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel. If you use ppss or pexec you will find GNU parallel will often make the command easier to read. GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU parallel as input for other programs. -- Huella de clave primaria: 0FDA C36F F110 54F4 D42B D0EB 617D 396C 448B 31EB signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
On Sun, 09 Jan 2011 10:05:43 -0600, Stan Hoeppner wrote: I'm not very skilled at writing shell scripts. #! /bin/sh for k in $(ls *.JPG); do convert $k -resize 1024 $k; done I use the above script to batch re-size digital camera photos after I dump them to my web server. It takes a very long time with lots of new photos as the server is fairly old, even though it is a 2-way SMP, because the script only runs one convert process at a time serially, only taking advantage of one CPU. The convert program is part of the imagemagick toolkit. How can I best modify this script so that it splits the overall job in half, running two simultaneous convert processes, one on each CPU? Having such a script should cut the total run time in half, or nearly so, which would really be great. http://www.imagemagick.org/Usage/api/#speed The above doc provides hints on how to speed-up image magick operations. Note that multi-threading should be automatically used whether possible, as per this paragraph: *** # IM by default uses multiple threads for image processing operations. That means you can have the computer do two or more separate threads of image processing, it will be faster than a single CPU machine. *** I'm afraid you will have to find out whether your IM package was compiled with multi-threading capablities. Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.01.09.16.59...@gmail.com
Re: need help making shell script use two CPUs/cores
Camaleón put forth on 1/9/2011 10:59 AM: http://www.imagemagick.org/Usage/api/#speed The above doc provides hints on how to speed-up image magick operations. Note that multi-threading should be automatically used whether possible, as per this paragraph: *** # IM by default uses multiple threads for image processing operations. That means you can have the computer do two or more separate threads of image processing, it will be faster than a single CPU machine. *** I'm afraid you will have to find out whether your IM package was compiled with multi-threading capablities. I'm using the i386 Lenny package. Obviously it wasn't, or it would be working, and it is not. No script ideas Camaleón? You're not a script kiddie? -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d29ed90.3080...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
On Sun, 09 Jan 2011 11:17:04 -0600, Stan Hoeppner wrote: Camaleón put forth on 1/9/2011 10:59 AM: *** # IM by default uses multiple threads for image processing operations. That means you can have the computer do two or more separate threads of image processing, it will be faster than a single CPU machine. *** I'm afraid you will have to find out whether your IM package was compiled with multi-threading capablities. I'm using the i386 Lenny package. Obviously it wasn't, or it would be working, and it is not. Better if you check it, but I dunno how to get the compile options for the lenny package... where is this defined, in source or diff packages? No script ideas Camaleón? You're not a script kiddie? He, he.. not at all :-) Anyway, how are you going to take any advantadge of multi-threading capabilities if the program you are going to run was not compiled with this flag enabled? Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2011.01.09.18.12...@gmail.com
Re: need help making shell script use two CPUs/cores
Camaleón put forth on 1/9/2011 12:12 PM: Better if you check it, but I dunno how to get the compile options for the lenny package... where is this defined, in source or diff packages? You're taking this thread down the wrong path. I asked for assistance writing a simple script to do what I want it to do. Accomplishing that will fix all of my problems WRT Imagemagick. I didn't ask for help in optimizing or fixing the Lenny i386 Imagemagick package. ;) No script ideas Camaleón? You're not a script kiddie? He, he.. not at all :-) Anyway, how are you going to take any advantadge of multi-threading capabilities if the program you are going to run was not compiled with this flag enabled? I think you're missing something. Go back and read my original post. If you still don't understand, maybe refresh yourself on Linux process scheduling. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2a1d1c.6070...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
Stan Hoeppner wrote: I'm not very skilled at writing shell scripts. #! /bin/sh for k in $(ls *.JPG); do convert $k -resize 1024 $k; done First off don't use ls to list files matching a pattern. Instead let the shell match the pattern. #! /bin/sh for k in *.JPG; do convert $k -resize 1024 $k; done I never like to resize in place. Because then if I mess things up I can lose resolution. So I recommend doing it to a named resolution file. Do anything you like but this would be the way I would go. for k in *.JPG; do convert $k -resize 1024 $(basename $k .JPG).1024.jpg done And not wanting to do the same work again and again: for k in *.JPG; do base=$(basename $k .JPG) test -f $base.1024.jpg continue # skip if already done convert $k -resize 1024 $base.1024.jpg done I use the above script to batch re-size digital camera photos after I dump them to my web server. It takes a very long time with lots of new photos as the server is fairly old, even though it is a 2-way SMP, because the script only runs one convert process at a time serially, only taking advantage of one CPU. The convert program is part of the imagemagick toolkit. How can I best modify this script so that it splits the overall job in half, running two simultaneous convert processes, one on each CPU? Having such a script should cut the total run time in half, or nearly so, which would really be great. GNU xargs has an extension to run jobs in parallel. This is already installed on your system. (But won't work on other Unix systems.) for k in *.JPG; do echo $k; done | xargs -I{} -P4 echo convert {} -resize 1024 {} Verify that does what you want and then remove the echo. unfortunately that simple approach is harder to do with my renaming scheme. So I would probably write a helper script that did the options to convert and renamed the file and so forth. for k in *.JPG; do base=$(basename $k .JPG) test -f $base.1024.jpg continue # skip if already done echo $k; done | xargs -L1 -P4 echo my-convert-helper And my-convert-helper could take the argument and apply the options in the order needed and so forth. Adjust 4 in the above to be the number of jobs you want to run on your multicore system. Note that in Sid in the latest coreutils there is a new command 'nproc' to print out the number of cores. Or you could get it from grep. grep -c ^processor /proc/cpuinfo All of the above is off the top of my head and needs to be tested. YMMV. But hopefully it will give you some ideas. HTH, Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
On Sun, 09 Jan 2011 10:05:43 -0600, Stan Hoeppner s...@hardwarefreak.com said: S #! /bin/sh S for k in $(ls *.JPG); do convert $k -resize 1024 $k; done Someone was ragging on you to let the shell do the file expansion. I like your way better because most scripting shells aren't smart enough to realize that when there aren't any .JPG files, I don't want the script to echo '*.JPG' as if that's actually useful. S I use the above script to batch re-size digital camera photos after I S dump them to my web server. It takes a very long time with lots of new S photos as the server is fairly old, even though it is a 2-way SMP, S because the script only runs one convert process at a time serially, S only taking advantage of one CPU. First things first: are you absolutely certain that running two parallel jobs will exercise both CPUs? I've seen SMP systems that don't exactly live up to truth-in-advertising. If you stuff two convert jobs in the background and then run top (or the moral equivalent) do you SEE both CPUs being worked? Second: do you have taskset installed? If the work isn't being divided up the way you like, you can bind a process to a desired core: http://planet.admon.org/how-to-bind-a-certain-process-to-specified-core/ And last: if you're not using something like LVM, can you do anything to make sure you're not hitting the same disk? If all your new photos are on the same drive, any CPU savings you get from parallel processing will probably be erased by disk contention. Better yet, do you have enough memory to do the processing on a RAM-backed filesystem? -- Karl Vogel I don't speak for the USAF or my company If you're searching for the cause of a ghastly noise and find out that it's not the cat, leave the area immediately. --how to survive a horror movie -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/2011011500.46f09b...@kev.msw.wpafb.af.mil
Re: need help making shell script use two CPUs/cores
Karl Vogel wrote: Stan Hoeppner said: S for k in $(ls *.JPG); do convert $k -resize 1024 $k; done Someone was ragging on you to let the shell do the file expansion. I like your way better because most scripting shells aren't smart enough to realize that when there aren't any .JPG files, I don't want the script to echo '*.JPG' as if that's actually useful. :-) I thought about saying something about .JPG instead of .jpg. Unix is all about lower case after all. But I restrained myself. :-) :-) $ for i in *.doesnotexist; do echo $i; done *.doesnotexist As to your comment about shells passing the glob off when it doesn't match, that is a good comment. That wasn't really too much in my style of coding and so had missed it. Thanks for keeping me honest! Just to push on it a little bit more I could do this: set -- *.doesnotexist for i in $@; do echo $i done That avoids the problem and still avoids spawning another process. And depending upon what I was doing I might do something completely different. First things first: are you absolutely certain that running two parallel jobs will exercise both CPUs? I've seen SMP systems that don't exactly live up to truth-in-advertising. If you stuff two convert jobs in the background and then run top (or the moral equivalent) do you SEE both CPUs being worked? When was that in terms of kernel versions? Was Intel hyperthreading also involved? Because your description matches very closely running a two core system with Intel hyperthreading on an older Linux kernel. Here is the problem I know about. On a dual cpu system with hyperthreading the Linux kernel saw four cores and numbered them 0, 1, 2, 3. But of course zero and one were on one core and two and three were on the other core. The first process would run on, say, zero. The second process would run on the next core, say, one. To Linux of that day it thought it had allocated those processes onto different cpus. But of course both were running on the same cpu, each getting half of it and taking twice as long to run, and the other cpu was idle. A big problem. I always disabled Intel hyperthreading to avoid that problem. It was more trouble than it was worth. Also my benchmarks showed that HT would slightly slow down single threaded simulation processes (mostly Spice and other simulations) that we were running. But as far as I know this has now been addressed and the Linux kernel now knows about hyperthreaded cpus. It seems that with recent kernels that the cpu allocation works okay even in the presence of fake cpus through hyperthreading. So I think the problem you describe is now behind us. Bob signature.asc Description: Digital signature
Re: need help making shell script use two CPUs/cores
On Jan 9, 2011 3:09 PM, Stan Hoeppner s...@hardwarefreak.com wrote: shawn wilson put forth on 1/9/2011 11:43 AM: On Jan 9, 2011 12:17 PM, Stan Hoeppner s...@hardwarefreak.com wrote: Camaleón put forth on 1/9/2011 10:59 AM: http://www.imagemagick.org/Usage/api/#speed The above doc provides hints on how to speed-up image magick operations. Note that multi-threading should be automatically used whether possible, as per this paragraph: *** # IM by default uses multiple threads for image processing operations. That means you can have the computer do two or more separate threads of image processing, it will be faster than a single CPU machine. *** I'm afraid you will have to find out whether your IM package was compiled with multi-threading capablities. I'm using the i386 Lenny package. Obviously it wasn't, or it would be working, and it is not. No script ideas Camaleón? You're not a script kiddie? If parallel does actually have the same args as xargs than you should be able to convert this fairly easily: find -type f -iname *. jpg -print0 ¦ xargs -0 -i{} convert {} -resize 1024 {} I don't quite follow this Shawn. Will this command line simply simultaneously launch one convert process for each jpg file in the directory? I.e. if I have 500 photos in the directory will this command line simply fire up 500 simultaneous convert processes? I think your question has been answered. However what that does is find all jpg files with a case insensitive match (iname vs name). The print0 is pretty specific to xargs (though you could probably just -print and pipe it through and do {} in xargs with the same effect). xargs takes that input and knows about find's -print0 with the -0 switch and -i{} tells it to use {} as a place holder for what it gets as input. I prefer to use find for searching for files because it is fast, very customizable and has a nice File::Find template builder with find2perl. Also since Bob pointed out that xargs has the -P option, you might just use that along with find's searching and I don't think you'll get much better results (use -type f with find will speed that up a bit too).
Re: need help making shell script use two CPUs/cores
Bob Proulx put forth on 1/9/2011 3:12 PM: GNU xargs has an extension to run jobs in parallel. This is already installed on your system. (But won't work on other Unix systems.) for k in *.JPG; do echo $k; done | xargs -I{} -P4 echo convert {} -resize 1024 {} Verify that does what you want and then remove the echo. Thank you Bob. This is EXACTLY what I was looking for. It does exactly what I want, in the precise way I want. And it's such a simple modification of my original script, not requiring a big rewrite. Excellent. :) I'm using -P2 as the target system is an old dual proc server, two single core CPUs. I made three timed runs against 11 camera photo files, first using -P4, then -P2, then the original script. The two process run was 5 seconds faster and consuming half as much memory as the 4 processes run, and the -P2 overall run time was almost exactly half that of the original script. Very excellent indeed. Thanks again Bob. You rock. :) -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2a78d7.5000...@hardwarefreak.com
Re: need help making shell script use two CPUs/cores
unfortunately that simple approach is harder to do with my renaming scheme. So I would probably write a helper script that did the options to convert and renamed the file and so forth. for k in *.JPG; do base=$(basename $k .JPG) test -f $base.1024.jpg continue # skip if already done echo $k; done | xargs -L1 -P4 echo my-convert-helper And my-convert-helper could take the argument and apply the options in the order needed and so forth. If you want to use the renaming form of the command (which I also tend to prefer), then I think that using a Makefile makes a lot of sense (and GNU make's -j argument lets you specify parallel behavior). Stefan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/jwvlj2tz6bg.fsf-monnier+gmane.linux.debian.u...@gnu.org
Re: need help making shell script use two CPUs/cores
In 2011011500.46f09b...@kev.msw.wpafb.af.mil, Karl Vogel wrote: On Sun, 09 Jan 2011 10:05:43 -0600, Stan Hoeppner s...@hardwarefreak.com said: S #! /bin/sh S for k in $(ls *.JPG); do convert $k -resize 1024 $k; done Someone was ragging on you to let the shell do the file expansion. I like your way better because most scripting shells aren't smart enough to realize that when there aren't any .JPG files, I don't want the script to echo '*.JPG' as if that's actually useful. $(ls *.ext) splits into arguments at each run of shell-whitespace in the ls output. *.ext splits into arguments at the end of each filename. If you want to do the right thing, independent of the characters in the filenames and the value of the IFS environment variable, use the later. TL;DR: *.ext work when filenames contain spaces; $(ls *.ext) doesn't. -- Boyd Stephen Smith Jr. ,= ,-_-. =. b...@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/\_/ signature.asc Description: This is a digitally signed message part.