Re: [ccp4bb] Cluster Design
Hello Frank, Here at the CLS we have a 8-CPU AMD Quad-Core 2.9GHz machines (ie 32 cores per node). There are newer models capable of 48 cores. Each node has four Gb network ports bonded through Etherchannel. Our file server is a Sun 7310 storage server also with 4 bonded Gb network ports. We use NFSv4. We also use XDS extensively and I have spent some time trying to optimize the system. In addition to the comments on the XDS wiki and in Kay's reply, I have made a few observations: - I determine the optimum number of CPUS and number of jobs to provide to XDS.INP using the following python snippet: min_cpus = int(round(DELPHI/delta)) stride = int(math.ceil(num_frames/float(total_cores))) jobs = (frames//min_cpus)//stride num_cpus = 1 + (num_frames//jobs)//stride For example if you have a cluster with 96 cores, and you want to process 360 frames with delta of 1 deg, this gives you maximum_number_of_cpus=6 maximum_number_of_jobs=18 So you get 18 jobs each with 4 batches, each batch will use 5 Cores each of which will be processing one frame from the frames in each batch, which maximises the CPU usage but does not over commit them. So each core will process 4 images before the integration job is complete. I find that adding many more cpus to the maximum_number_of_cpus above the number actually used in each batch is counterproductive. I simply add 1 because sometimes a few batches will get an extra frame if the number of jobs is not an integral multiple of the number of frames. I hope this helps. Michel Fodje, Canadian Macromolecular Crystallography Facility, Canadian Light Source -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Frank Murphy Sent: April-19-11 6:06 AM To: CCP4BB@JISCMAIL.AC.UK Subject: [ccp4bb] Cluster Design Dear All, Here at NE-CAT, we make extensive use of XDS in a parallel environment. We are looking to purchase some new hardware, so I am soliciting your opinions. Our current cluster is made up of 16 nodes, each with 2 processors that have four cores, running at 2.2 GHz (I believe). We run with hyperthreading on, so 8 physical and 16 virtual cores per node. Our benchmarking with XDS (see https://rapd.nec.aps.anl.gov/wiki/RAPD_NecatStats for an example) shows a diminishing return on increasing the MAXIMUM_NUMBER_OF_PROCESSORS beyond the number of physical cores, and we are wondering if this is due to the test, the processor, the RAM, or XDS. In short, will going to 2 six core processors speed up processing using up to 12 for MAXIMUM_NUMBER_OF_PROCESSORS? Please do not feel the need to constrain the discussion to XDS, as we use our cluster for pretty much all the common crystallographic tasks. Thanks in advance, Frank Murphy Beamline Scientist, NE-CAT
Re: [ccp4bb] Cluster Design
Hi Frank, sorry, my first response was not very specific for your situation! I studied your list and graphs and would just like to point out that in "2.1 Testing Filesystem Performance" you may be severely overcommitting the CPU resources, if this was for one machine only with 16 (=8+8) cores. (but maybe your were really only interested in filesystem performance; we are generally happy with NFSv4 but haven't tried anything else) If testing with one machine, I'd try for XDS JOBS PROCESSORS 1 16 28 44 82 16 1 and then - if "4 4" was best, for example 54 64 We use several 48-core AMD machines (4*12-core 6176SE 2.3GHz CPUs), and we are happy with them for general work. We also use Intel (2*6-core X5670 2.93GHz Xeon, plus Hyperthreading) machines, which give a higher single-CPU performance, but of course 48 cores are nice in some situations (for XDS, e.g. 6 to 8 JOBS of 8 PROCESSORS each). OTOH more machines may mean more rack space, and more administration. HTH, Kay On 04/19/2011 02:06 PM, Frank Murphy wrote: Dear All, Here at NE-CAT, we make extensive use of XDS in a parallel environment. We are looking to purchase some new hardware, so I am soliciting your opinions. Our current cluster is made up of 16 nodes, each with 2 processors that have four cores, running at 2.2 GHz (I believe). We run with hyperthreading on, so 8 physical and 16 virtual cores per node. Our benchmarking with XDS (see https://rapd.nec.aps.anl.gov/wiki/RAPD_NecatStats for an example) shows a diminishing return on increasing the MAXIMUM_NUMBER_OF_PROCESSORS beyond the number of physical cores, and we are wondering if this is due to the test, the processor, the RAM, or XDS. In short, will going to 2 six core processors speed up processing using up to 12 for MAXIMUM_NUMBER_OF_PROCESSORS? Please do not feel the need to constrain the discussion to XDS, as we use our cluster for pretty much all the common crystallographic tasks. Thanks in advance, Frank Murphy Beamline Scientist, NE-CAT -- Kay Diederichshttp://strucbio.biologie.uni-konstanz.de email: kay.diederi...@uni-konstanz.deTel +49 7531 88 4049 Fax 3183 Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz This e-mail is digitally signed. If your e-mail client does not have the necessary capabilities, just ignore the attached signature "smime.p7s". smime.p7s Description: S/MIME Cryptographic Signature
Re: [ccp4bb] Cluster Design
Hi Frank, the following are some recommendation for increasing the processing speed of XDS. You can find them (and add to them !) at http://strucbio.biologie.uni-konstanz.de/xdswiki/index.php/Performance . Only item 7 is specific for a cluster. In the order of effect: 1. XDS scales well (i.e. the wallclock time for data processing goes down when the number of available cores is increased) in the COLSPOT, IDXREF, INTEGRATE and CORRECT steps when using the MAXIMUM_NUMBER_OF_PROCESSORS keyword. This triggers program-level parallelization, using OpenMP threads. 2. the program scales very well in the COLSPOT and INTEGRATE steps when using the MAXIMUM_NUMBER_OF_JOBS keyword. This triggers a shell-level parallelization. 3. combining these both keywords gives the highest performance in my experience (see [[1]] for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. 4. some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) is beneficial; you'll have to play with these two parameters. 5. the next thing to consider is DELPHI together with OSCILLATION_RANGE: if DELPHI is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. 6. performance-wise, I/O also plays a role because as soon as you run 24 or so processes then a single GB ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load. 7. XDS with the MAXIMUM_NUMBER_OF_JOBS keyword can use several machines. This requires some setup as explained at the bottom of http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/downloading.html . 8. Hyperthreading (SMT), if available on Intel CPUs, is beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine. 9. The 64-bit binaries generally are a bit faster than the 32-bit binaries (but that's not specific for XDS). HTH, Kay On 04/19/2011 02:06 PM, Frank Murphy wrote: Dear All, Here at NE-CAT, we make extensive use of XDS in a parallel environment. We are looking to purchase some new hardware, so I am soliciting your opinions. Our current cluster is made up of 16 nodes, each with 2 processors that have four cores, running at 2.2 GHz (I believe). We run with hyperthreading on, so 8 physical and 16 virtual cores per node. Our benchmarking with XDS (see https://rapd.nec.aps.anl.gov/wiki/RAPD_NecatStats for an example) shows a diminishing return on increasing the MAXIMUM_NUMBER_OF_PROCESSORS beyond the number of physical cores, and we are wondering if this is due to the test, the processor, the RAM, or XDS. In short, will going to 2 six core processors speed up processing using up to 12 for MAXIMUM_NUMBER_OF_PROCESSORS? Please do not feel the need to constrain the discussion to XDS, as we use our cluster for pretty much all the common crystallographic tasks. Thanks in advance, Frank Murphy Beamline Scientist, NE-CAT -- Kay Diederichshttp://strucbio.biologie.uni-konstanz.de email: kay.diederi...@uni-konstanz.deTel +49 7531 88 4049 Fax 3183 Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz This e-mail is digitally signed. If your e-mail client does not have the necessary capabilities, just ignore the attached signature "smime.p7s". smime.p7s Description: S/MIME Cryptographic Signature