On Wed, Oct 22, 2008 at 07:06:17PM -0500, Mohamad Chaarawi wrote: > Hey all, > > I have successfully configured and installed PVFS2 on our cluster. I > managed to get the pvfs2 servers and clients running properly. The mount > point is set fine, and i can create/delete files properly. > Operating System: OpenSuSe 11.0 > > OpenMPI (trunk) used configured with: > ./configure CFLAGS=-I/opt/pvfs2-2.7.1/include/ > LDFLAGS=-L/opt/pvfs2-2.7.1/lib/ LIBS=-lpvfs2 -lpthread > --prefix=/home/mschaara/OMPI-PVFS2 --with-openib=/usr > --with-slurm=/opt/SLURM > --with-io-romio-flags=--with-file-system=pvfs2+ufs+nfs
As far as PVFS is concerned, OMPI and MPICH2 do the same things: both are based on ROMIO. > pvfs-2.7.1: > ./configure --with-kernel=/usr/src/linux-2.6.25.11/ > --prefix=/opt/pvfs2-2.7.1 --enable-shared > > However when i run an MPI program that open a PVFS2 file and Writes_all, > one of the PVFS2 servers crashes. I attached the test file that im running > (test_write_all.c). If i run the test file with 1,2,or 3 processes, it > gives the correct output. However with more than 3 processes it gives the > following error: How many servers do you have running in this test? > When i Login in to the node (shark07) the server would not be running, If > is start the server again on that node, pvfs2 would be fine again (testing > by pvfs2-ping). > I saw this in the pvfs2-server.log: > [E 10/22 18:55] src/common/misc/state-machine-fns.c line 289: Error: > state machine returned SM_ACTION_TERMINATE but didn't reach terminate > [E 10/22 18:55] [bt] > /opt/pvfs2-2.7.1/sbin/pvfs2-server(PINT_state_machine_next+0x1d5) > [0x41f1b5] > [E 10/22 18:55] [bt] > /opt/pvfs2-2.7.1/sbin/pvfs2-server(PINT_state_machine_continue+0x1e) > [0x41ec0e] > [E 10/22 18:55] [bt] > /opt/pvfs2-2.7.1/sbin/pvfs2-server(main+0xe3e) [0x4122be] > [E 10/22 18:55] [bt] /lib64/libc.so.6(__libc_start_main+0xe6) > [0x7f4640020436] > [E 10/22 18:55] [bt] /opt/pvfs2-2.7.1/sbin/pvfs2-server > [0x40f939] > [D 10/22 18:55] server_state_machine_terminate 0x7881b0 > > and this in var/log/messages: > shark07 kernel: pvfs2-server[14842]: segfault at 7f6ae09c7ec0 ip > 7f6ae09c7ec0 sp 7fffea083628 error 15 in > libgcc_s.so.1[7f6ae09c7000+1000] > > So any idea what might be wrong with my configuration on pvfs2, or OMPI? > Or might be a bug somewhere? I have two thoughts: - Your backtrace shows you linked with /lib64, and you're running OpenSuse. I presume then that you're running in a bi-arch environment. Could you have possibly built pvfs2-server as a 32 bit executable but ended up linking it with 64 bit libraries? I have to confess that this theory is a bit of a longshot... - When you built OPENMPI you might have compiled against some oddball pvfs2.h header file or linked with an incompatible libpvfs2. Do you have any other pvfs installations on your system? Are you sure? Check the configure output: was configure able to find pvfs2-config? Check your mpicc wraper script: is it including links to the expected libpvfs2? I've run your test code on my (32 bit) laptop (4 procs, one server) and on a 64 bit Ubuntu system (4 procs, 4 servers) and did not see a segfault. Thanks for sending along a testcase, but I'm afraid I'm not going to be able to help very much if I can't reproduce the crash on my end. Sometimes I get weird behavior when the PVFS + MPI + application software stack gets out of sync: the one other suggestion I can make is to 'make clean' and rebuild everything, in case symbols from an earlier iteration are somehow floating around (they shouldn't be, but sometimes it happens) ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
