On Wed, Oct 22, 2008 at 07:06:17PM -0500, Mohamad Chaarawi wrote:
> Hey all,
> 
> I have successfully configured and installed PVFS2 on our cluster. I
> managed to get the pvfs2 servers and clients running properly. The mount
> point is set fine, and i can create/delete files properly.
> Operating System: OpenSuSe 11.0
> 
> OpenMPI (trunk) used configured with:
>     ./configure CFLAGS=-I/opt/pvfs2-2.7.1/include/
> LDFLAGS=-L/opt/pvfs2-2.7.1/lib/ LIBS=-lpvfs2 -lpthread
> --prefix=/home/mschaara/OMPI-PVFS2 --with-openib=/usr
> --with-slurm=/opt/SLURM
> --with-io-romio-flags=--with-file-system=pvfs2+ufs+nfs

As far as PVFS is concerned, OMPI and MPICH2 do the same things: both
are based on ROMIO.

> pvfs-2.7.1:
>     ./configure --with-kernel=/usr/src/linux-2.6.25.11/
> --prefix=/opt/pvfs2-2.7.1 --enable-shared
> 
> However when i run an MPI program that open a PVFS2 file and Writes_all,
> one of the PVFS2 servers crashes. I attached the test file that im running
> (test_write_all.c). If i run the test file with 1,2,or 3 processes, it
> gives the correct output. However with more than 3 processes it gives the
> following error:

How many servers do you have running in this test?

> When i Login in to the node (shark07) the server would not be running, If
> is start the server again on that node, pvfs2 would be fine again (testing
> by pvfs2-ping).
> I saw this in the pvfs2-server.log:
>     [E 10/22 18:55] src/common/misc/state-machine-fns.c line 289: Error:
> state machine returned SM_ACTION_TERMINATE but didn't reach terminate
>     [E 10/22 18:55]         [bt]
> /opt/pvfs2-2.7.1/sbin/pvfs2-server(PINT_state_machine_next+0x1d5)
> [0x41f1b5]
>     [E 10/22 18:55]         [bt]
> /opt/pvfs2-2.7.1/sbin/pvfs2-server(PINT_state_machine_continue+0x1e)
> [0x41ec0e]
>     [E 10/22 18:55]         [bt]
> /opt/pvfs2-2.7.1/sbin/pvfs2-server(main+0xe3e) [0x4122be]
>     [E 10/22 18:55]         [bt] /lib64/libc.so.6(__libc_start_main+0xe6)
> [0x7f4640020436]
>     [E 10/22 18:55]         [bt] /opt/pvfs2-2.7.1/sbin/pvfs2-server
> [0x40f939]
>     [D 10/22 18:55] server_state_machine_terminate 0x7881b0
> 
> and this in var/log/messages:
>     shark07 kernel: pvfs2-server[14842]: segfault at 7f6ae09c7ec0 ip
> 7f6ae09c7ec0 sp 7fffea083628 error 15 in
> libgcc_s.so.1[7f6ae09c7000+1000]
> 
> So any idea what might be wrong with my configuration on pvfs2, or OMPI?
> Or might be a bug somewhere?

I have two thoughts: 

- Your backtrace shows you linked with /lib64, and you're running
  OpenSuse.  I presume then that you're running in a bi-arch
  environment.  Could you have possibly built pvfs2-server as a 32 bit
  executable but ended up linking it with 64 bit libraries?  I have to
  confess that this theory is a bit of a longshot...

- When you built OPENMPI you might have compiled against some oddball
  pvfs2.h header file or linked with an incompatible libpvfs2.  Do you
  have any other pvfs installations on your system?  Are you sure?
  Check the configure output: was configure able to find pvfs2-config?
  Check your mpicc wraper script:  is it including links to the
  expected libpvfs2?

I've run your test code on my (32 bit) laptop (4 procs, one server)
and on a 64 bit Ubuntu system (4 procs, 4 servers) and did not see a
segfault.   Thanks for sending along a testcase, but I'm afraid I'm
not going to be able to help very much if I can't reproduce the crash
on my end.

Sometimes I get weird behavior when the PVFS + MPI + application
software stack gets out of sync: the one other suggestion I can make
is to 'make clean' and rebuild everything, in case symbols from an
earlier iteration are somehow floating around (they shouldn't be, but
sometimes it happens)

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA                 B29D F333 664A 4280 315B
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to