And just because I roll like that, I turned the debug back off, ran
another time and got this:

# Using mpi-io calls.
[E 13:45:48.408547] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 3.
[E 13:45:48.418717] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
[E 13:50:48.602454] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 26.
nr_procs = 2, nr_iter = 1, blk_sz = 16777216, coll = 0
# total_size = 33554432
# Write: min_t = 0.033078, max_t = 302.189165, mean_t = 151.111121,
var_t = 45649.150432
# Read:  min_t = 0.029111, max_t = 0.030349, mean_t = 0.029730, var_t = 0.000001
Write bandwidth = 0.111038 Mbytes/sec
Read bandwidth = 1105.618442 Mbytes/sec
real 333.09
user 0.00
sys 0.00
Now with collectives
time -p mpiexec -n 2 -npernode 1
/home/bradles/software/anl-io-test/bin/anl-io-test-mx -C -f
pvfs2:/tmp/bradles-pav/mount/anl-io-data
# Using mpi-io calls.
[E 13:51:51.427632] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 3.
[E 13:51:51.437759] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
=>> PBS: job killed: walltime 640 exceeded limit 600

So it will work, if given enough time.  But something is going on
during some of those writes that is killing performance.  I have no
idea what it could be.  A 10,000-fold difference in read and write
performance.  yikes.

Thanks for any insight you can provide, Scott.

Cheers,
Brad


On Thu, Mar 5, 2009 at 1:52 PM, Bradley Settlemyer
<[email protected]> wrote:
> Heh, the job works whenever I do that:
>
> http://www.parl.clemson.edu/~bradles/downloads/anl-io-bm-mx-16-2.o168456
>
> However, this run had a really slow write in the second instance:
>
> http://www.parl.clemson.edu/~bradles/downloads/anl-io-bm-mx-16-2.o168495
>
> Both include debug from two procs (on seperate nodes).  Hope that is okay.
>
> Cheers,
> Brad
>
> On Thu, Mar 5, 2009 at 1:23 PM, Scott Atchley <[email protected]> wrote:
>> Brad,
>>
>> Can you rerun the job with PVFS2_DEBUGMASK=network exported?
>>
>> Scott
>>
>> On Mar 5, 2009, at 12:39 PM, Bradley Settlemyer wrote:
>>
>>> Scott and Rob,
>>>
>>>  PAV is the pvfs autovolume service, it allows me to start pvfs for a
>>> job on the compute nodes I've scheduled.  Effectively, its a remote
>>> configuration tool that takes a config file and configures and starts
>>> the pvfs servers on a subset of my job's nodes.
>>>
>>> Additional requested info . . .
>>> MX version:
>>> [brad...@node0394:bradles-pav:1009]$ mx_info
>>> MX Version: 1.2.7
>>> MX Build: w...@node0002:/home/wolf/rpm/BUILD/mx-1.2.7 Wed Dec  3
>>> 09:21:26 EST 2008
>>> 1 Myrinet board installed.
>>> The MX driver is configured to support a maximum of:
>>>        16 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
>>> ===================================================================
>>> Instance #0:  313.6 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
>>>        Status:         Running, P0: Link Up
>>>        Network:        Myrinet 10G
>>>
>>>        MAC Address:    00:60:dd:47:23:4e
>>>        Product code:   10G-PCIE-8A-C
>>>        Part number:    09-03327
>>>        Serial number:  338892
>>>        Mapper:         00:60:dd:47:21:dd, version = 0x00000063, configured
>>>        Mapped hosts:   772
>>>
>>>
>>> Pvfs2 is version 2.7.1, built with the mx turned on and the tcp turned
>>> off.  I can copy files out of the file system, but writing to the file
>>> system is precarious.  Data gets written in, but seems to hang or
>>> something.  Here is my job output using mpi-io-test:
>>>
>>> time -p mpiexec -n 2 -npernode 1
>>> /home/bradles/software/anl-io-test/bin/anl-io-test-mx -f
>>> pvfs2:/tmp/bradles-pav/mount/anl-io-data
>>> # Using mpi-io calls.
>>> [E 12:21:32.047891] job_time_mgr_expire: job time out: cancelling bmi
>>> operation, job_id: 3.
>>> [E 12:21:32.058035] msgpair failed, will retry: Operation cancelled
>>> (possibly due to timeout)
>>> [E 12:26:32.217723] job_time_mgr_expire: job time out: cancelling bmi
>>> operation, job_id: 56.
>>> [E 12:26:32.227774] msgpair failed, will retry: Operation cancelled
>>> (possibly due to timeout)
>>> =>> PBS: job killed: walltime 610 exceeded limit 600
>>>
>>> This is writing 32MB into a file.  The data seems to all be there
>>> (file size is 33554432), but the writes must not ever return I guess.
>>> I don't know how to diagnose what is the matter.  Any help is much
>>> appreciated.
>>>
>>> Thanks
>>> Brad
>>>
>>>
>>>
>>> On Thu, Mar 5, 2009 at 9:41 AM, Scott Atchley <[email protected]> wrote:
>>>>
>>>> On Mar 5, 2009, at 8:46 AM, Robert Latham wrote:
>>>>
>>>>> On Wed, Mar 04, 2009 at 07:15:24PM -0500, Bradley Settlemyer wrote:
>>>>>>
>>>>>> Hello
>>>>>>
>>>>>>  I am trying to use PAV to run pvfs with the MX protocol.  I've
>>>>>> updated pav so that servers start and ping correctly.  But when I try
>>>>>> and run an mpi code, I'm getting client timeouts like the client
>>>>>> cannot contact the servers:
>>>>>>
>>>>>> Lots of this stuff:
>>>>>>
>>>>>> [E 19:11:02.573509] job_time_mgr_expire: job time out: cancelling bmi
>>>>>> operation, job_id: 3.
>>>>>> [E 19:11:02.583659] msgpair failed, will retry: Operation cancelled
>>>>>> (possibly due to timeout)
>>>>
>>>> Brad, which version of MX and PVFS2?
>>>>
>>>>> OK, so pvfs utilities are all hunky-dory? not just pvfs2-ping but
>>>>> pvfs2-cp and pvfs2-ls?
>>>>>
>>>>> On Jazz, I usually configure MPICH2 to communicate over TCP and have
>>>>> the PVFS system interface communicate over MX.  This keeps the
>>>>> situation fairly simple, but of course you get awful MPI performance.
>>>>>
>>>>> Does MX still have the "ports" restriction that GM has?  I wonder if
>>>>> MPI communication is getting in the way of PVFS communication...
>>>>>
>>>>> In short, I don't exactly know what's wrong myself.  just tossing out
>>>>> some theories.
>>>>>
>>>>> ==rob
>>>>
>>>> Rob, MX is limited to 8 endpoints per NIC. One can use mx_info to get the
>>>> number:
>>>>
>>>> 8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
>>>>
>>>> This can be increased to 16 with a module parameter.
>>>>
>>>> Generally, you want no more than one endpoint per process and one process
>>>> per core for MPI. When you want to use MPI-IO over PVFS2, each process
>>>> will
>>>> need two endpoints (one for MPI and one for PVFS2). If you have eight
>>>> cores,
>>>> you should increase the max endpoints to 16 (if you have eight cores).
>>>>
>>>> Generally, I would not want to limit my MPI to TCP and IO to MX
>>>> especially
>>>> if my TCP is over gigabit Ethernet. Unless your IO can exceed the link
>>>> rate,
>>>> there will be plenty of bandwidth left over for MPI and your latency will
>>>> stay much lower than TCP.
>>>>
>>>> What is PAV?
>>>>
>>>> Scott
>>>>
>>>
>>
>>
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to