Re: problem with parallel foreach
On Thursday, 14 May 2015 at 17:12:07 UTC, John Colvin wrote: Would it be OK if I showed some parts of this code as examples in my DConf talk in 2 weeks? Sure!!!
Re: problem with parallel foreach
On Thursday, 14 May 2015 at 10:46:53 UTC, Gerald Jansen wrote: John Colvin's improvements to my D program seem to have resolved the problem. (http://forum.dlang.org/post/ydgmzhlspvvvrbeem...@forum.dlang.org and http://dpaste.dzfl.pl/114d5a6086b7). I have rerun my tests and now the picture is a bit different (see tables below). In the middle table I have used gnu parallel in combination with a slightly modified version of the D program which runs a single trait (specified in argv[1]). This approach runs the jobs as completely isolated processes, but at the extra cost of re-reading the common data for each trait. The elapsed time is very similar with the parallel foreach in the D program or using gnu parallel (for this particular program and these data run on this server...). I'm guessing the program is now essentially limited by disk I/O, so this is about as good as it gets. So, just to wrap up: - there is a nice speed improvement over Python program :-) - one needs to learn a fair bit to fully benefit from D's potential - thanks for all the help! Gerald Jansen Jobs __ time for D parallel foreach w. JC mods 1 4.71user 0.56system 0:05.28elapsed 99%CPU 2 6.59user 0.96system 0:05.48elapsed 137%CPU 411.45user 1.94system 0:07.24elapsed 184%CPU 820.30user 5.18system 0:13.16elapsed 193%CPU 16 68.48user 13.87system 0:27.21elapsed 302%CPU 27 99.66user 18.73system 0:42.34elapsed 279%CPU Jobs __ gnu parallel + D program for single job __ 1 4.71user 0.56system 0:05.28elapsed 99%CPU as above 2 9.66user 1.28system 0:05.76elapsed 189%CPU 418.86user 3.85system 0:08.15elapsed 278%CPU 840.76user 7.53system 0:14.69elapsed 328%CPU 16 135.76user 20.68system 0:31.06elapsed 503%CPU 27 189.43user 28.26system 0:47.75elapsed 455%CPU Jobs _ time for python version _ 145.39user 1.52system 0:46.88elapsed 100%CPU 277.76user 2.42system 0:47.16elapsed 170%CPU 4 141.28user 4.37system 0:48.77elapsed 298%CPU 8 280.45user 8.80system 0:56.00elapsed 516%CPU 16 926.05user 20.48system 1:31.36elapsed 1036%CPU 27 1329.09user 27.18system 2:11.79elapsed 1029%CPU Would it be OK if I showed some parts of this code as examples in my DConf talk in 2 weeks?
Re: problem with parallel foreach
John Colvin's improvements to my D program seem to have resolved the problem. (http://forum.dlang.org/post/ydgmzhlspvvvrbeem...@forum.dlang.org and http://dpaste.dzfl.pl/114d5a6086b7). I have rerun my tests and now the picture is a bit different (see tables below). In the middle table I have used gnu parallel in combination with a slightly modified version of the D program which runs a single trait (specified in argv[1]). This approach runs the jobs as completely isolated processes, but at the extra cost of re-reading the common data for each trait. The elapsed time is very similar with the parallel foreach in the D program or using gnu parallel (for this particular program and these data run on this server...). I'm guessing the program is now essentially limited by disk I/O, so this is about as good as it gets. So, just to wrap up: - there is a nice speed improvement over Python program :-) - one needs to learn a fair bit to fully benefit from D's potential - thanks for all the help! Gerald Jansen Jobs __ time for D parallel foreach w. JC mods 1 4.71user 0.56system 0:05.28elapsed 99%CPU 2 6.59user 0.96system 0:05.48elapsed 137%CPU 411.45user 1.94system 0:07.24elapsed 184%CPU 820.30user 5.18system 0:13.16elapsed 193%CPU 16 68.48user 13.87system 0:27.21elapsed 302%CPU 27 99.66user 18.73system 0:42.34elapsed 279%CPU Jobs __ gnu parallel + D program for single job __ 1 4.71user 0.56system 0:05.28elapsed 99%CPU as above 2 9.66user 1.28system 0:05.76elapsed 189%CPU 418.86user 3.85system 0:08.15elapsed 278%CPU 840.76user 7.53system 0:14.69elapsed 328%CPU 16 135.76user 20.68system 0:31.06elapsed 503%CPU 27 189.43user 28.26system 0:47.75elapsed 455%CPU Jobs _ time for python version _ 145.39user 1.52system 0:46.88elapsed 100%CPU 277.76user 2.42system 0:47.16elapsed 170%CPU 4 141.28user 4.37system 0:48.77elapsed 298%CPU 8 280.45user 8.80system 0:56.00elapsed 516%CPU 16 926.05user 20.48system 1:31.36elapsed 1036%CPU 27 1329.09user 27.18system 2:11.79elapsed 1029%CPU
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 03:19:17 UTC, thedeemon wrote: In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result. Okay, so to do something equivalent I would need to use std.process. My next question is how to pass the common data to the sub-processes. In the Python approach I guess this is automatically looked after by pickling serialization. Is there something similar in D? Alternatively, would the use of std.mmfile to temporarily store the common data be a reasonable approach?
Re: problem with parallel foreach
On 13/05/2015 2:59 a.m., Gerald Jansen wrote: I am a data analyst trying to learn enough D to decide whether to use D for a new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is: import std.parallelism; // and other modules function main() { // ... // read common data and store in arrays // ... foreach (job; parallel(jobs, 1)) { runJob(job, arr1, arr2.dup); } } function runJob(string job, in int[] arr1, int[] arr2) { // read file of job specific data file and modify arr2 copy // write job specific output data file } The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 1 45.171.44 0:46.65 99 D 18.441.17 0:09.24 104 Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.) The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs). Clearly simple my approach with parallel foreach has some problem(s). Any suggestions? Gerald Jansen I managed to rewrite most of the core IO part to use ranges. However runTrait I could not rewrite just by reading it. To do so, you would focus on the line being read instead of the trait. There are many lines of input data. Very few traits. There is a couple more places where IO is being performed in runTrait. Again turn them into ranges like I've done. But more importantly, split up the functions that performs the logic in runTrait as much as possible. Small pieces of code can be optimized easier then large blobs. Anyway that is my attempt at getting it faster. /** Assign unknown parent groups (UPG) - trait specific version UPG defined in 5-year intervals for sires and dams of animals with a domestic or foreign animal ID (eg. IT vs XX). */ // History: // 2015.05.12 GJ - original version in D /*import std.stdio, std.path, std.conv, std.string, std.datetime; import std.algorithm, std.parallelism; import std.math: ceil; import core.memory : GC;*/ struct ControlVars // TODO: read this from a control file { import std.stdio : File; File log; string osmdata = .; int ped_cycles = 2; int ped_cc_start = 3; string ped_country = IT; int ped_cutoff = 1980; int ped_upg_size = 50; bool dereg_bulls_only = false; } //--; void main() { /** pedupg main - store pednum.csv and process traits in parallel; */ import std.stdio : stdout, File; import std.path : buildPath; import std.parallelism : parallel; import core.memory : GC; ControlVars g = ControlVars(/*stdout*/File(out.txt, w)); GC.disable; GC.reserve(1024 * 1024 * 1024); datedMessage(pedupg started ..., g.log); auto suffix = ; auto logfile = buildPath(g.osmdata, log, pedupg ~ suffix ~ .log); int nPed = 1 + getPednumMax(g.osmdata); g.log.writefln(%8d nPed (i.e. pednum_max + 1), nPed); // read and store pedigree info FileReader fileReader = FileReader(buildPath(g.osmdata, wrk, pednum.csv), g); // TODO: readd //log.writefln(%8d records read from %s, j, pednum); auto traits = getRunTraits(g); foreach(fl; fileReader.parallel) { import std.stdio : writeln; import std.conv : text; g.log.write(text(fl) ~ \n); } datedMessage(pedupg all done., g.log); } struct FileLine { int sire; int dam; short byear; bool ccode; } struct FileReader { private { import std.stdio : File; import std.traits : ReturnType; ControlVars g; FileLine last; bool done; ReturnType!(File.byLine!()) fileReader; } this(string filename, ControlVars g) { this.g = g; fileReader = File(filename, r).byLine
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks.
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 09:01:05 UTC, Gerald Jansen wrote: On Wednesday, 13 May 2015 at 03:19:17 UTC, thedeemon wrote: In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result. Okay, so to do something equivalent I would need to use std.process. My next question is how to pass the common data to the sub-processes. In the Python approach I guess this is automatically looked after by pickling serialization. Is there something similar in D? Alternatively, would the use of std.mmfile to temporarily store the common data be a reasonable approach? Assuming you're on a POSIX compliant platform, you would just take advantage of fork()'s shared memory model and pipes - i.e, read the data, then fork in a loop to process it, then use pipes to communicate. It ran about 3x faster for me by doing this, and obviously scales with the workloads you have(the provided data only seems to have 2.) If you could provide a larger dataset and the python implementation, that would be great. I'm actually surprised and disappointed that there isn't a fork()-backend to std.process OR std.parallel. You have to use stdc
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck. http://dpaste.dzfl.pl/80cd36fd6796 The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake. The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here.
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. The problem is already evident with 2 traits: the Elapsed time is about doubled for the D version whereas it is practically unchanged for the Python version. But just for fun here are 4 traits: http://dekoppel.eu/tmp/pedupgLarge.tar.gz (109 Mb) If you need even more traits, you can just make copies of the wrk/mil directory, make empty directories with the same name in (eg. log/mi4) and add the names on the first line of file wrk/run_traits. To run a single trait, just remove all names except mil from that file.
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 14:28:52 UTC, Gerald Jansen wrote: On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote: On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck. http://dpaste.dzfl.pl/80cd36fd6796 The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake. The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here. Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely! The empty line comes from the very last line on the files, which also end with a newline (as per normal practice?). Yup, that would be it. I added a bit of buffered writing and it actually seems to scale quite well for me now. http://dpaste.dzfl.pl/710afe8b6df5
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 14:11:25 UTC, Gerald Jansen wrote: On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. The problem is already evident with 2 traits: the Elapsed time is about doubled for the D version whereas it is practically unchanged for the Python version. But just for fun here are 4 traits: http://dekoppel.eu/tmp/pedupgLarge.tar.gz (109 Mb) If you need even more traits, you can just make copies of the wrk/mil directory, make empty directories with the same name in (eg. log/mi4) and add the names on the first line of file wrk/run_traits. To run a single trait, just remove all names except mil from that file. http://dekoppel.eu/tmp/pedupgLarge4.tar.gz (109 Mb)
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote: On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck. http://dpaste.dzfl.pl/80cd36fd6796 The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake. The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here. Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely! The empty line comes from the very last line on the files, which also end with a newline (as per normal practice?).
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 12:16:19 UTC, weaselcat wrote: On Wednesday, 13 May 2015 at 09:01:05 UTC, Gerald Jansen wrote: On Wednesday, 13 May 2015 at 03:19:17 UTC, thedeemon wrote: In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result. Okay, so to do something equivalent I would need to use std.process. My next question is how to pass the common data to the sub-processes. In the Python approach I guess this is automatically looked after by pickling serialization. Is there something similar in D? Alternatively, would the use of std.mmfile to temporarily store the common data be a reasonable approach? Assuming you're on a POSIX compliant platform, you would just take advantage of fork()'s shared memory model and pipes - i.e, read the data, then fork in a loop to process it, then use pipes to communicate. It ran about 3x faster for me by doing this, and obviously scales with the workloads you have(the provided data only seems to have 2.) If you could provide a larger dataset and the python implementation, that would be great. I'm actually surprised and disappointed that there isn't a fork()-backend to std.process OR std.parallel. You have to use stdc Okay, more studying... The python implementation is part of a larger package so it would be a fair bit of work to provide a working version. Anyway, the salient bits are like this: from parallel import Pool def run_job(args): (job, arr1, arr2) = args # ... do the work for each dataset def main(): # ... read common data and store in numpy arrays... pool = Pool() pool.map(run_job, [(job, arr1, arr2) for job in jobs])
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 14:43:50 UTC, John Colvin wrote: On Wednesday, 13 May 2015 at 14:28:52 UTC, Gerald Jansen wrote: On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote: On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck. http://dpaste.dzfl.pl/80cd36fd6796 The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake. The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here. Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely! The empty line comes from the very last line on the files, which also end with a newline (as per normal practice?). Yup, that would be it. I added a bit of buffered writing and it actually seems to scale quite well for me now. http://dpaste.dzfl.pl/710afe8b6df5 Fixed the file reading spare '\n' problem, added some comments. http://dpaste.dzfl.pl/114d5a6086b7
Re: problem with parallel foreach
On 05/12/2015 08:19 PM, thedeemon wrote: In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result. Right. To do the same in D, one must use fibers. Here is a draft of my understanding of them: http://ddili.org/ders/d.en/fibers.html As noted there as well, for highest performance with fibers, one likely needs to use an M:N threading model. Ali
Re: problem with parallel foreach
On Wednesday, 13 May 2015 at 06:59:02 UTC, Ali Çehreli wrote: In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result. Right. To do the same in D, one must use fibers. No, to do the same one must use separate OS processes. Fibers won't help you against parallel threads fighting for GC IO locks.
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 16:46:42 UTC, thedeemon wrote: On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote: The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 The fact that most of the time is spent in System department is quite important. I suspect there are too many system calls from line-wise reading and writing the files. How many lines are read and written there? About 3.5 million lines read by main(), 0.5 to 2 million lines read and 3.5 million lines written by runTraits (aka runJob). I have smaller datasets that I test on my laptop with a single quad-core I7 which sometimes show little increase in System time and other times have a marked increase, but not nearly as exagerated as in the large datasets on the server. Gerald
Re: problem with parallel foreach
At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d As per Rick's first suggestion (thanks) I added import core.memory : GC; main() GC.disable; GC.reserve(1024 * 1024 * 1024); ... to no avail. thanks for all the help so far. Gerald ps. I am using GDC 4.9.2 and don't have DMD on that server
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote: The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 The fact that most of the time is spent in System department is quite important. I suspect there are too many system calls from line-wise reading and writing the files. How many lines are read and written there?
Re: problem with parallel foreach
On 05/12/2015 08:35 AM, Gerald Jansen wrote: I could put it somewhere if that would help. Please do so. We all want to learn to avoid such issues. Thank you, Ali
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program.
Re: problem with parallel foreach
On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d As per Rick's first suggestion (thanks) I added import core.memory : GC; main() GC.disable; GC.reserve(1024 * 1024 * 1024); ... to no avail. thanks for all the help so far. Gerald ps. I am using GDC 4.9.2 and don't have DMD on that server Well atleast we now know that it isn't caused by memory allocation/deallocation! Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow.
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I haven't had time to read code closely. But if you disable the logging does that change things? If so, how about having the logging done asynchronously in another thread? And are you using optimization on gdc ?
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 19:10:13 UTC, Laeeth Isharc wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. I haven't had time to read code closely. But if you disable the logging does that change things? If so, how about having the logging done asynchronously in another thread? And are you using optimization on gdc ? Also try byLineFast eg http://forum.dlang.org/thread/umkcjntsxchskljyg...@forum.dlang.org#post-20130516144627.50da:40unknown I don't know if std.csv CSVReader would be faster than parsing yourself, but worth trying. Some tricks here, also: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 17:45:54 UTC, thedeemon wrote: On Tuesday, 12 May 2015 at 17:02:19 UTC, Gerald Jansen wrote: About 3.5 million lines read by main(), 0.5 to 2 million lines read and 3.5 million lines written by runTraits (aka runJob). Each GC allocation in D is a locking operation (and disabling GC doesn't help here at all), probably each writeln too, so when multiple threads try to write millions of lines such issue is easy to meet. I would look for a way to write those lines without allocations and locking, and also reduce total number of system calls by buffering data, doing less f.writef's. Your advice is appreciated but quite disheartening. I was hoping for something (nearly) as easy to use as Python's parallel.Pool() map(), given that this is essentially an embarassingly parallel problem. Avoidance of GC allocation and self-written buffered IO functions seems a bit much to ask of a newcomer to a language.
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 19:14:23 UTC, Laeeth Isharc wrote: But if you disable the logging does that change things? There is only a tiny bit of logging happening. And are you using optimization on gdc ? gdc -Ofast -march=native -frelease Also try byLineFast eg http://forum.dlang.org/thread/umkcjntsxchskljyg...@forum.dlang.org#post-20130516144627.50da:40unknown Thx, I'll have a look. Performance is good for a single dataset so I thought regular byLine would be okay. I don't know if std.csv CSVReader would be faster than parsing yourself, but worth trying. No, my initial experience with CSVReader was that it was not very fast: http://forum.dlang.org/post/wklmolsqcmnagluid...@forum.dlang.org . Some tricks here, also: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Thanks again. I am having doubts about d-is-for-data-science. The learning curve is very steep compared to my experience with R/Python/(Julia). But I'm trying...
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. Profiling shows that your program spends most of the time reading the data. I see a considerable speed boost with the following one-line patch (plus imports): - foreach (line; File(pednum, r).byLine()) { + foreach (line; (cast(string)read(pednum)).split('\n')) {
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 20:58:16 UTC, Vladimir Panteleev wrote: On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: On 13/05/2015 4:20 a.m., Gerald Jansen wrote: At the risk of great embarassment ... here's my program: http://dekoppel.eu/tmp/pedupg.d Would it be possible to give us some example data? I might give it a go to try rewriting it tomorrow. http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) Contains two largish datasets in a directory structure expected by the program. Profiling shows that your program spends most of the time reading the data. I see a considerable speed boost with the following one-line patch (plus imports): - foreach (line; File(pednum, r).byLine()) { + foreach (line; (cast(string)read(pednum)).split('\n')) { Nice, thanks. Making that replacement in three points in the program resulted in roughly a 30% speedup at the cost of about 30% more memory in this specific case. Unfortunately it didn't help with the performance deteroration problem with parallel foreach.
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 20:50:45 UTC, Gerald Jansen wrote: Your advice is appreciated but quite disheartening. I was hoping for something (nearly) as easy to use as Python's parallel.Pool() map(), given that this is essentially an embarassingly parallel problem. Avoidance of GC allocation and self-written buffered IO functions seems a bit much to ask of a newcomer to a language. You're right, these are issues of D's standard library that are not easy for a newcomer to tackle. In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result.
Re: problem with parallel foreach
On 05/12/2015 07:59 AM, Gerald Jansen wrote: the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. It may be related to GC collections. If it hasn't been changed recently, every allocation from GC triggers a collection cycle. D's current GC being a stop-the-world kind, you lose all benefit of parallel processing when that happens. Without seeing runJob, even arr2.dup may be having such an effect on the performance. Ali
problem with parallel foreach
I am a data analyst trying to learn enough D to decide whether to use D for a new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is: import std.parallelism; // and other modules function main() { // ... // read common data and store in arrays // ... foreach (job; parallel(jobs, 1)) { runJob(job, arr1, arr2.dup); } } function runJob(string job, in int[] arr1, int[] arr2) { // read file of job specific data file and modify arr2 copy // write job specific output data file } The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 1 45.171.44 0:46.65 99 D 18.441.17 0:09.24 104 Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.) The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs). Clearly simple my approach with parallel foreach has some problem(s). Any suggestions? Gerald Jansen
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote: I am a data analyst trying to learn enough D to decide whether to use D for a new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is: import std.parallelism; // and other modules function main() { // ... // read common data and store in arrays // ... foreach (job; parallel(jobs, 1)) { runJob(job, arr1, arr2.dup); } } function runJob(string job, in int[] arr1, int[] arr2) { // read file of job specific data file and modify arr2 copy // write job specific output data file } The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 1 45.171.44 0:46.65 99 D 18.441.17 0:09.24 104 Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.) The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs). Clearly simple my approach with parallel foreach has some problem(s). Any suggestions? Gerald Jansen Have you tried adjusting the workUnitSize argument to parallel? It should probably be 1 for such large individual tasks.
Re: problem with parallel foreach
On Tuesday, 12 May 2015 at 15:11:01 UTC, John Colvin wrote: On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote: I am a data analyst trying to learn enough D to decide whether to use D for a new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is: import std.parallelism; // and other modules function main() { // ... // read common data and store in arrays // ... foreach (job; parallel(jobs, 1)) { runJob(job, arr1, arr2.dup); } } function runJob(string job, in int[] arr1, int[] arr2) { // read file of job specific data file and modify arr2 copy // write job specific output data file } The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 1 45.171.44 0:46.65 99 D 18.441.17 0:09.24 104 Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.) The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs). Clearly simple my approach with parallel foreach has some problem(s). Any suggestions? Gerald Jansen Have you tried adjusting the workUnitSize argument to parallel? It should probably be 1 for such large individual tasks. ignore me, i missed that you already had done that.
Re: problem with parallel foreach
Thanks Ali. I have tried putting GC.disable() in both main and runJob, but the timing behaviour did not change. The python version works in a similar fashion and also has automatic GC. I tend to think that is not the (biggest) problem. The program is long and newbie-ugly ... but I could put it somewhere if that would help. Gerald On Tuesday, 12 May 2015 at 15:24:45 UTC, Ali Çehreli wrote: On 05/12/2015 07:59 AM, Gerald Jansen wrote: the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. It may be related to GC collections. If it hasn't been changed recently, every allocation from GC triggers a collection cycle. D's current GC being a stop-the-world kind, you lose all benefit of parallel processing when that happens. Without seeing runJob, even arr2.dup may be having such an effect on the performance. Ali
Re: problem with parallel foreach
On 13/05/2015 2:59 a.m., Gerald Jansen wrote: I am a data analyst trying to learn enough D to decide whether to use D for a new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core. The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is: import std.parallelism; // and other modules function main() { // ... // read common data and store in arrays // ... foreach (job; parallel(jobs, 1)) { runJob(job, arr1, arr2.dup); } } function runJob(string job, in int[] arr1, int[] arr2) { // read file of job specific data file and modify arr2 copy // write job specific output data file } The output of /usr/bin/time is as follows: Lang JobsUser System Elapsed %CPU Py 1 45.171.44 0:46.65 99 D 18.441.17 0:09.24 104 Py 2 79.242.16 0:48.90 166 D 2 19.41 10.14 0:17.96 164 Py 30 1255.17 58.38 2:39.54 823 * Pool(12) D 30 421.61 4565.97 6:33.73 1241 (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.) The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs). Clearly simple my approach with parallel foreach has some problem(s). Any suggestions? Gerald Jansen A couple of things comes to mind at the start of the main function. --- import core.runtime : GC; GC.disable; GC.reserve(1024 * 1024 * 1024); -- That will reserve 1gb of ram for the GC to work with. It will also stop the GC from trying to collect. I would HIGHLY recommend that each worker thread have a preallocated amount of memory for it. Where that memory is then used to have the data put into. Basically you are just being sloppy memory allocation wise. Try your best to move your processing into @nogc annotated functions. Then make it as much as possible with no allocations. Remember you can pass around slices of arrays as long as they are not mutated without memory allocation freely!