And just because I roll like that, I turned the debug back off, ran another time and got this:
# Using mpi-io calls. [E 13:45:48.408547] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3. [E 13:45:48.418717] msgpair failed, will retry: Operation cancelled (possibly due to timeout) [E 13:50:48.602454] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 26. nr_procs = 2, nr_iter = 1, blk_sz = 16777216, coll = 0 # total_size = 33554432 # Write: min_t = 0.033078, max_t = 302.189165, mean_t = 151.111121, var_t = 45649.150432 # Read: min_t = 0.029111, max_t = 0.030349, mean_t = 0.029730, var_t = 0.000001 Write bandwidth = 0.111038 Mbytes/sec Read bandwidth = 1105.618442 Mbytes/sec real 333.09 user 0.00 sys 0.00 Now with collectives time -p mpiexec -n 2 -npernode 1 /home/bradles/software/anl-io-test/bin/anl-io-test-mx -C -f pvfs2:/tmp/bradles-pav/mount/anl-io-data # Using mpi-io calls. [E 13:51:51.427632] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 3. [E 13:51:51.437759] msgpair failed, will retry: Operation cancelled (possibly due to timeout) =>> PBS: job killed: walltime 640 exceeded limit 600 So it will work, if given enough time. But something is going on during some of those writes that is killing performance. I have no idea what it could be. A 10,000-fold difference in read and write performance. yikes. Thanks for any insight you can provide, Scott. Cheers, Brad On Thu, Mar 5, 2009 at 1:52 PM, Bradley Settlemyer <[email protected]> wrote: > Heh, the job works whenever I do that: > > http://www.parl.clemson.edu/~bradles/downloads/anl-io-bm-mx-16-2.o168456 > > However, this run had a really slow write in the second instance: > > http://www.parl.clemson.edu/~bradles/downloads/anl-io-bm-mx-16-2.o168495 > > Both include debug from two procs (on seperate nodes). Hope that is okay. > > Cheers, > Brad > > On Thu, Mar 5, 2009 at 1:23 PM, Scott Atchley <[email protected]> wrote: >> Brad, >> >> Can you rerun the job with PVFS2_DEBUGMASK=network exported? >> >> Scott >> >> On Mar 5, 2009, at 12:39 PM, Bradley Settlemyer wrote: >> >>> Scott and Rob, >>> >>> PAV is the pvfs autovolume service, it allows me to start pvfs for a >>> job on the compute nodes I've scheduled. Effectively, its a remote >>> configuration tool that takes a config file and configures and starts >>> the pvfs servers on a subset of my job's nodes. >>> >>> Additional requested info . . . >>> MX version: >>> [brad...@node0394:bradles-pav:1009]$ mx_info >>> MX Version: 1.2.7 >>> MX Build: w...@node0002:/home/wolf/rpm/BUILD/mx-1.2.7 Wed Dec 3 >>> 09:21:26 EST 2008 >>> 1 Myrinet board installed. >>> The MX driver is configured to support a maximum of: >>> 16 endpoints per NIC, 1024 NICs on the network, 32 NICs per host >>> =================================================================== >>> Instance #0: 313.6 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0 >>> Status: Running, P0: Link Up >>> Network: Myrinet 10G >>> >>> MAC Address: 00:60:dd:47:23:4e >>> Product code: 10G-PCIE-8A-C >>> Part number: 09-03327 >>> Serial number: 338892 >>> Mapper: 00:60:dd:47:21:dd, version = 0x00000063, configured >>> Mapped hosts: 772 >>> >>> >>> Pvfs2 is version 2.7.1, built with the mx turned on and the tcp turned >>> off. I can copy files out of the file system, but writing to the file >>> system is precarious. Data gets written in, but seems to hang or >>> something. Here is my job output using mpi-io-test: >>> >>> time -p mpiexec -n 2 -npernode 1 >>> /home/bradles/software/anl-io-test/bin/anl-io-test-mx -f >>> pvfs2:/tmp/bradles-pav/mount/anl-io-data >>> # Using mpi-io calls. >>> [E 12:21:32.047891] job_time_mgr_expire: job time out: cancelling bmi >>> operation, job_id: 3. >>> [E 12:21:32.058035] msgpair failed, will retry: Operation cancelled >>> (possibly due to timeout) >>> [E 12:26:32.217723] job_time_mgr_expire: job time out: cancelling bmi >>> operation, job_id: 56. >>> [E 12:26:32.227774] msgpair failed, will retry: Operation cancelled >>> (possibly due to timeout) >>> =>> PBS: job killed: walltime 610 exceeded limit 600 >>> >>> This is writing 32MB into a file. The data seems to all be there >>> (file size is 33554432), but the writes must not ever return I guess. >>> I don't know how to diagnose what is the matter. Any help is much >>> appreciated. >>> >>> Thanks >>> Brad >>> >>> >>> >>> On Thu, Mar 5, 2009 at 9:41 AM, Scott Atchley <[email protected]> wrote: >>>> >>>> On Mar 5, 2009, at 8:46 AM, Robert Latham wrote: >>>> >>>>> On Wed, Mar 04, 2009 at 07:15:24PM -0500, Bradley Settlemyer wrote: >>>>>> >>>>>> Hello >>>>>> >>>>>> I am trying to use PAV to run pvfs with the MX protocol. I've >>>>>> updated pav so that servers start and ping correctly. But when I try >>>>>> and run an mpi code, I'm getting client timeouts like the client >>>>>> cannot contact the servers: >>>>>> >>>>>> Lots of this stuff: >>>>>> >>>>>> [E 19:11:02.573509] job_time_mgr_expire: job time out: cancelling bmi >>>>>> operation, job_id: 3. >>>>>> [E 19:11:02.583659] msgpair failed, will retry: Operation cancelled >>>>>> (possibly due to timeout) >>>> >>>> Brad, which version of MX and PVFS2? >>>> >>>>> OK, so pvfs utilities are all hunky-dory? not just pvfs2-ping but >>>>> pvfs2-cp and pvfs2-ls? >>>>> >>>>> On Jazz, I usually configure MPICH2 to communicate over TCP and have >>>>> the PVFS system interface communicate over MX. This keeps the >>>>> situation fairly simple, but of course you get awful MPI performance. >>>>> >>>>> Does MX still have the "ports" restriction that GM has? I wonder if >>>>> MPI communication is getting in the way of PVFS communication... >>>>> >>>>> In short, I don't exactly know what's wrong myself. just tossing out >>>>> some theories. >>>>> >>>>> ==rob >>>> >>>> Rob, MX is limited to 8 endpoints per NIC. One can use mx_info to get the >>>> number: >>>> >>>> 8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host >>>> >>>> This can be increased to 16 with a module parameter. >>>> >>>> Generally, you want no more than one endpoint per process and one process >>>> per core for MPI. When you want to use MPI-IO over PVFS2, each process >>>> will >>>> need two endpoints (one for MPI and one for PVFS2). If you have eight >>>> cores, >>>> you should increase the max endpoints to 16 (if you have eight cores). >>>> >>>> Generally, I would not want to limit my MPI to TCP and IO to MX >>>> especially >>>> if my TCP is over gigabit Ethernet. Unless your IO can exceed the link >>>> rate, >>>> there will be plenty of bandwidth left over for MPI and your latency will >>>> stay much lower than TCP. >>>> >>>> What is PAV? >>>> >>>> Scott >>>> >>> >> >> > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
