[OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host
Hi, I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04 machines. I am using ssh to launch MPI jobs and I'm able to run simple programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get the expected output (pachy1 being an entry in my /etc/hosts file). I started using MPI_Comm_spawn in my app with the intent of NOT calling mpirun to launch the program that calls MPI_Comm_spawn (my attempt at using the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0 standard). The app needs to launch an MPI job of a given size from a given hostfile, where the job needs to report some info back to the app, so it seemed MPI_Comm_spawn was my best bet. The app is only rarely going to be used this way, thus mpirun not being used to launch the app that is the parent in the MPI_Comm_spawn operation. This pattern works fine if the only entries in the hostfile are 'localhost'. However if I add a host that isn't local I get a segmentation fault from the orted process. In any case, I distilled my example down as small as I could. I've attached the C code of the master and the hostfile I'm using. Here's the output: evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master ~/mpi/test_distributed.hostfile [lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess_base_jobid 1377173504 [lasarti:32022] *** Process received signal *** [lasarti:32022] Signal: Segmentation fault (11) [lasarti:32022] Signal code: Address not mapped (1) [lasarti:32022] Failing at address: (nil) [lasarti:32022] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340] [lasarti:32022] [ 1] /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2] [lasarti:32022] [ 2] /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430] [lasarti:32022] [ 3] /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154] [lasarti:32022] [ 4] /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6] [lasarti:32022] [ 5] /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a] [lasarti:32022] [ 6] /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034] [lasarti:32022] [ 7] /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f] [lasarti:32022] [ 8] orted(main+0x47)[0x400877] [lasarti:32022] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5] [lasarti:32022] [10] orted[0x4008cb] [lasarti:32022] *** End of error message *** If I launch 'master.c' using mpirun, I don't get a segmentation fault, but it doesn't seem to be launching the process on anything more than localhost, no matter what hostfile I give it. For what it's worth, I fully expected to debug some path issues regarding the binary I wanted to launch with MPI_Comm_spawn when I used this distributed, but this error at first glance doesn't appear to have anything to do with that. I'm sure this is something silly I'm doing wrong, but I don't really know how to debug this further given this error. Evan P.S. Only including zipped config.log since the "ompi_info -v ompi full --parsable" command I got from http://www.open-mpi.org/community/help/ doesn't seem to work anymore. #include "mpi.h" #include int main(int argc, char **argv) { int rc; MPI_Init(&argc, &argv); MPI_Info the_info; rc = MPI_Info_create(&the_info); assert(rc == MPI_SUCCESS); // I tried both (with appropriately different argv[1])...same result. #if 1 rc = MPI_Info_set(the_info, "hostfile", argv[1]); assert(rc == MPI_SUCCESS); #else rc = MPI_Info_set(the_info, "host", argv[1]); assert(rc == MPI_SUCCESS); #endif MPI_Comm the_group; rc = MPI_Comm_spawn("hostname", MPI_ARGV_NULL, 8, the_info, 0, MPI_COMM_WORLD, &the_group, MPI_ERRCODES_IGNORE); assert(rc == MPI_SUCCESS); MPI_Finalize(); return 0; } localhost pachy1 config.log.tar.bz2 Description: application/bzip
Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host
Hi Ralph, Good to know you've reproduced it. I was experiencing this using both the hostfile and host key. A simple comm_spawn was working for me as well, but it was only launching locally, and I'm pretty sure each node only has 4 slots given past behavior (the mpirun -np 8 example I gave in my first email launches on both hosts). Is there a way to specify the hosts I want to launch on without the hostfile or host key so I can test remote launch? And to the "hostname" response...no wonder it was hanging! I just constructed that as a basic example. In my real use I'm launching something that calls MPI_Init. Evan
Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host
Setting these environment variables did indeed change the way mpirun maps things, and I didn't have to specify a hostfile. However, setting these for my MPI_Comm_spawn code still resulted in the same segmentation fault. Evan On Tue, Feb 3, 2015 at 10:09 AM, Ralph Castain wrote: > If you add the following to your environment, you should run on multiple > nodes: > > OMPI_MCA_rmaps_base_mapping_policy=node > OMPI_MCA_orte_default_hostfile= > > The first tells OMPI to map-by node. The second passes in your default > hostfile so you don't need to specify it as an Info key. > > HTH > Ralph > > > On Tue, Feb 3, 2015 at 9:23 AM, Evan Samanas > wrote: > >> Hi Ralph, >> >> Good to know you've reproduced it. I was experiencing this using both >> the hostfile and host key. A simple comm_spawn was working for me as well, >> but it was only launching locally, and I'm pretty sure each node only has 4 >> slots given past behavior (the mpirun -np 8 example I gave in my first >> email launches on both hosts). Is there a way to specify the hosts I want >> to launch on without the hostfile or host key so I can test remote launch? >> >> And to the "hostname" response...no wonder it was hanging! I just >> constructed that as a basic example. In my real use I'm launching >> something that calls MPI_Init. >> >> Evan >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/02/26271.php >> > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/02/26272.php >
Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host
Yes, I did. I replaced the info argument of MPI_Comm_spawn with MPI_INFO_NULL. On Tue, Feb 3, 2015 at 5:54 PM, Ralph Castain wrote: > When running your comm_spawn code, did you remove the Info key code? You > wouldn't need to provide a hostfile or hosts any more, which is why it > should resolve that problem. > > I agree that providing either hostfile or host as an Info key will cause > the program to segfault - I'm woking on that issue. > > > On Tue, Feb 3, 2015 at 3:46 PM, Evan Samanas > wrote: > >> Setting these environment variables did indeed change the way mpirun maps >> things, and I didn't have to specify a hostfile. However, setting these >> for my MPI_Comm_spawn code still resulted in the same segmentation fault. >> >> Evan >> >> On Tue, Feb 3, 2015 at 10:09 AM, Ralph Castain wrote: >> >>> If you add the following to your environment, you should run on multiple >>> nodes: >>> >>> OMPI_MCA_rmaps_base_mapping_policy=node >>> OMPI_MCA_orte_default_hostfile= >>> >>> The first tells OMPI to map-by node. The second passes in your default >>> hostfile so you don't need to specify it as an Info key. >>> >>> HTH >>> Ralph >>> >>> >>> On Tue, Feb 3, 2015 at 9:23 AM, Evan Samanas >>> wrote: >>> >>>> Hi Ralph, >>>> >>>> Good to know you've reproduced it. I was experiencing this using both >>>> the hostfile and host key. A simple comm_spawn was working for me as well, >>>> but it was only launching locally, and I'm pretty sure each node only has 4 >>>> slots given past behavior (the mpirun -np 8 example I gave in my first >>>> email launches on both hosts). Is there a way to specify the hosts I want >>>> to launch on without the hostfile or host key so I can test remote launch? >>>> >>>> And to the "hostname" response...no wonder it was hanging! I just >>>> constructed that as a basic example. In my real use I'm launching >>>> something that calls MPI_Init. >>>> >>>> Evan >>>> >>>> >>>> >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/02/26271.php >>>> >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/02/26272.php >>> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/02/26281.php >> > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/02/26285.php >
Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host
Indeed, I simply commented out all the MPI_Info stuff, which you essentially did by passing a dummy argument. I'm still not able to get it to succeed. So here we go, my results defy logic. I'm sure this could be my fault...I've only been an occasional user of OpenMPI and MPI in general over the years and I've never used MPI_Comm_spawn before this project. I tested simple_spawn like so: mpicc simple_spawn.c -o simple_spawn ./simple_spawn When my default hostfile points to a file that just lists localhost, this test completes successfully. If it points to my hostfile with localhost and 5 remote hosts, here's the output: evan@lasarti:~/devel/toy_progs/mpi_spawn$ mpicc simple_spawn.c -o simple_spawn evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./simple_spawn [pid 5703] starting up! 0 completed MPI_Init Parent [pid 5703] about to spawn! [lasarti:05703] [[14661,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess_base_jobid 960823296 [lasarti:05705] *** Process received signal *** [lasarti:05705] Signal: Segmentation fault (11) [lasarti:05705] Signal code: Address not mapped (1) [lasarti:05705] Failing at address: (nil) [lasarti:05705] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fc185dcf340] [lasarti:05705] [ 1] /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_compute_bindings+0x650)[0x7fc186033bb0] [lasarti:05705] [ 2] /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x939)[0x7fc18602fb99] [lasarti:05705] [ 3] /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7fc18577dcc4] [lasarti:05705] [ 4] /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_daemon+0xdf8)[0x7fc186010438] [lasarti:05705] [ 5] orted(main+0x47)[0x400887] [lasarti:05705] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc185a1aec5] [lasarti:05705] [ 7] orted[0x4008db] [lasarti:05705] *** End of error message *** You can see from the message that this particular run IS from the latest snapshot, though the failure happens on v.1.8.4 as well. I didn't bother installing the snapshot on the remote nodes though. Should I do that? It looked to me like this error happened well before we got to a remote node, so that's why I didn't. Your thoughts? Evan On Tue, Feb 3, 2015 at 7:40 PM, Ralph Castain wrote: > I confess I am sorely puzzled. I replace the Info key with MPI_INFO_NULL, > but still had to pass a bogus argument to master since you still have the > Info_set code in there - otherwise, info_set segfaults due to a NULL > argv[1]. Doing that (and replacing "hostname" with an MPI example code) > makes everything work just fine. > > I've attached one of our example comm_spawn codes that we test against - > it also works fine with the current head of the 1.8 code base. I confess > that some changes have been made since 1.8.4 was released, and it is > entirely possible that this was a problem in 1.8.4 and has since been fixed. > > So I'd suggest trying with the nightly 1.8 tarball and seeing if it works > for you. You can download it from here: > > http://www.open-mpi.org/nightly/v1.8/ > > HTH > Ralph > > > On Tue, Feb 3, 2015 at 6:20 PM, Evan Samanas > wrote: > >> Yes, I did. I replaced the info argument of MPI_Comm_spawn with >> MPI_INFO_NULL. >> >> On Tue, Feb 3, 2015 at 5:54 PM, Ralph Castain wrote: >> >>> When running your comm_spawn code, did you remove the Info key code? You >>> wouldn't need to provide a hostfile or hosts any more, which is why it >>> should resolve that problem. >>> >>> I agree that providing either hostfile or host as an Info key will cause >>> the program to segfault - I'm woking on that issue. >>> >>> >>> On Tue, Feb 3, 2015 at 3:46 PM, Evan Samanas >>> wrote: >>> >>>> Setting these environment variables did indeed change the way mpirun >>>> maps things, and I didn't have to specify a hostfile. However, setting >>>> these for my MPI_Comm_spawn code still resulted in the same segmentation >>>> fault. >>>> >>>> Evan >>>> >>>> On Tue, Feb 3, 2015 at 10:09 AM, Ralph Castain >>>> wrote: >>>> >>>>> If you add the following to your environment, you should run on >>>>> multiple nodes: >>>>> >>>>> OMPI_MCA_rmaps_base_mapping_policy=node >>>>> OMPI_MCA_orte_default_hostfile= >>>>> >>>>> The first tells OMPI to map-by node. The second passes in your default >>>>> hostfile so you don't need to specify it as an Info key. >
Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host
Hi Ralph, Thanks for addressing this issue. I tried downloading your fork from that pull request and the seg fault appears to be gone. However I didn't install it on my remote machine before testing, and I got this error: bash: /opt/ompi-release-cmr-singlespawn/bin/orted: No such file or directory (along with the usual complaints about ORTE not being able to start one of the daemons). On both machines I have openmpi installed to a directory in /opt, and /opt/openmpi is a symlink to whatever installation I want to use...then my paths point to the symlink. I went to the remote machine and simply changed the name of the directory to match the other one and I just got a version mismatch error...a much more expected error. I'm not familiar with OMPI source, but does this have to do with the prefix issue you mentioned in the pull request? Should it handle symlinks? Apologies if I'm misguided. Evan On Thu, Feb 5, 2015 at 9:51 AM, Ralph Castain wrote: > Okay, I tracked this down - thanks for your patience! I have a fix pending > review. You can track it here: > > https://github.com/open-mpi/ompi-release/pull/179 > > > On Feb 4, 2015, at 5:14 PM, Evan Samanas wrote: > > Indeed, I simply commented out all the MPI_Info stuff, which you > essentially did by passing a dummy argument. I'm still not able to get it > to succeed. > > So here we go, my results defy logic. I'm sure this could be my > fault...I've only been an occasional user of OpenMPI and MPI in general > over the years and I've never used MPI_Comm_spawn before this project. I > tested simple_spawn like so: > mpicc simple_spawn.c -o simple_spawn > ./simple_spawn > > When my default hostfile points to a file that just lists localhost, this > test completes successfully. If it points to my hostfile with localhost > and 5 remote hosts, here's the output: > evan@lasarti:~/devel/toy_progs/mpi_spawn$ mpicc simple_spawn.c -o > simple_spawn > evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./simple_spawn > [pid 5703] starting up! > 0 completed MPI_Init > Parent [pid 5703] about to spawn! > [lasarti:05703] [[14661,1],0] FORKING HNP: orted --hnp --set-sid > --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca > ess_base_jobid 960823296 > [lasarti:05705] *** Process received signal *** > [lasarti:05705] Signal: Segmentation fault (11) > [lasarti:05705] Signal code: Address not mapped (1) > [lasarti:05705] Failing at address: (nil) > [lasarti:05705] [ 0] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fc185dcf340] > [lasarti:05705] [ 1] > /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_compute_bindings+0x650)[0x7fc186033bb0] > [lasarti:05705] [ 2] > /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x939)[0x7fc18602fb99] > [lasarti:05705] [ 3] > /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7fc18577dcc4] > [lasarti:05705] [ 4] > /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_daemon+0xdf8)[0x7fc186010438] > [lasarti:05705] [ 5] orted(main+0x47)[0x400887] > [lasarti:05705] [ 6] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc185a1aec5] > [lasarti:05705] [ 7] orted[0x4008db] > [lasarti:05705] *** End of error message *** > > You can see from the message that this particular run IS from the latest > snapshot, though the failure happens on v.1.8.4 as well. I didn't bother > installing the snapshot on the remote nodes though. Should I do that? It > looked to me like this error happened well before we got to a remote node, > so that's why I didn't. > > Your thoughts? > > Evan > > > > On Tue, Feb 3, 2015 at 7:40 PM, Ralph Castain wrote: > >> I confess I am sorely puzzled. I replace the Info key with MPI_INFO_NULL, >> but still had to pass a bogus argument to master since you still have the >> Info_set code in there - otherwise, info_set segfaults due to a NULL >> argv[1]. Doing that (and replacing "hostname" with an MPI example code) >> makes everything work just fine. >> >> I've attached one of our example comm_spawn codes that we test against - >> it also works fine with the current head of the 1.8 code base. I confess >> that some changes have been made since 1.8.4 was released, and it is >> entirely possible that this was a problem in 1.8.4 and has since been fixed. >> >> So I'd suggest trying with the nightly 1.8 tarball and seeing if it works >> for you. You can download it from here: >> >> http://www.open-mpi.org/nightly/v1.8/ >> >> HTH >> Ralph >> >> >> On Tue, Feb 3, 2015 at 6:20 PM, Evan Samanas >> wrote: >&
[OMPI users] Problems Using PVFS2 with OpenMPI
I am unable to use PVFS2 with OpenMPI in a simple test program. My configuration is given below. I'm running on RHEL5 with GigE (probably not important). OpenMPI 1.4 (had same issue with 1.3.3) is configured with ./configure --prefix=/work/rd/evan/archives/openmpi/openmpi/1.4/enable_pvfs \ --enable-mpi-threads --with-io-romio-flags="--with-filesystems=pvfs2+ufs+nfs" PVFS 2.8.1 is configured to install in the default location (/usr/local) with ./configure --with-mpi=/work/rd/evan/archives/openmpi/openmpi/1.4/enable_pvfs I build and install these (in this order) and setup my PVFS2 space using instructions at pvfs.org. I am able to use this space using the /usr/local/bin/pvfs2-ls types of commands. I am simply running a 2-server config (2 data servers and the same 2 hosts are metadata servers). As I say, manually, this all seems fine (even when I'm not root). It may be relevant that I am *not* using the kernel interface for PVFS2 as I am just trying to get a better understanding of how this works. It is perhaps relevant that I have not had to explicitly tell OpenMPI where I installed PVFS. I have told PVFS where I installed OpenMPI, though. This does seem slightly odd but there does not appear to be a way of telling OpenMPI this information. Perhaps it is not needed. In any event, I then build my test program against this OpenMPI and in that program I have the following call sequence (i is 0 and where mntPoint is the path to my pvfs2 mount point -- I also tried prefixing a "pvfs2:" in the front of this as I read somewhere that that was optional). sprintf(aname, "%s/%d.fdm", mntPoint, i); for(int j = 0; j < numFloats; j++) buf[j] = (float)i; int retval = MPI_SUCCESS; if(MPI_SUCCESS == (retval = MPI_File_open(MPI_COMM_SELF, aname, MPI_MODE_RDWR|MPI_MODE_CREATE|MPI_MODE_UNIQUE_OPEN, MPI_INFO_NULL, &fh))) { MPI_File_write(fh, (void*)buf, numFloats, MPI_FLOAT, MPI_STATUS_IGNORE); MPI_File_close(&fh); } else { int errBufferLen; char errBuffer[MPI_MAX_ERROR_STRING]; MPI_Error_string(retval, errBuffer, &errBufferLen); fprintf(stdout,"%d: open error on %s with code %s\n", rank, aname, errBuffer); } Which will only execute on one of my ranks (the way I'm running it). No matter what I try, the MPI_File_open call fails with an MPI_ERR_ACCESS error code. This suggests a permission problem but I am able to manually cp and rm from the pvfs2 space without problem so I am not at all clear on what the permission problem is. My access flags look fine to me (the MPI_MODE_UNIQUE_OPEN flag makes no difference in this case as I'm only opening a single file anyway). If I write this file to shared NFS storage, all is "fine" (obviously, I do not consider that a permanent solution, though). Does anyone have any idea why this is not working? Alternately or in addition, does anyone have step-by-step instructions for how to build and set up PVFS2 with OpenMPI as well as an example program because this is the first time I've attempted this so I may well be doing something wrong. Thanks in advance, Evan
Re: [OMPI users] Problems Using PVFS2 with OpenMPI
I had been using an older variant of the needed flag for building romio (because the newer one was failing as the preceding suggests). I made this change and built with the correct romio flag. I next need to fix the ways pvfs2 build so that is uses -fPIC. Interestingly, about 95% of pvfs2 builds with this flag by default but the final 5% does not. It needs to. With that fixed, built and installed, I was able to rebuild openmpi correctly. My test program now works like a charm. I will give the *precise* steps I needed to build pvfs2 2.8.1 with openmpi 1.4 here for the record... 1. Determine where openmpi will be installed. I'm not certain that it needs to actually be installed there for this to work. If so, you will need to install openmpi twice. The first time, it clearly need not be built entirely correctly for pvfs2 (it can't be because setp 2 is a prerequisite for that) but probably building something without the "--with-io-romio-flags=..." should do if this actually must be installed at all. I'm betting it is not required but as I say, I have not verified this. It certainly works if it has been pre-installed as I just indicated. 2. Build pvfs2 correctly (I get conflicting info on whether the "--with-mpi=..." is needed but FWIW, this is how I built it and it installs into /usr/local which is it's default location... cd setenv CFLAGS -fPIC ./configure --with-mpi=/work/rd/evan/archives/openmpi/openmpi/1.4/enable_pvfs \ --enable-verbose-build make all make install exit 3. Build openmpi correctly. This is straightforward at this point. Also, the --enable-mpi-threads is not required for pvfs2 to work but I happen to also want this flag cd ./configure --prefix=/work/rd/evan/archives/openmpi/openmpi/1.4/enable_pvfs \ --enable-mpi-threads --with-io-romio-flags="--with-file-system=pvfs2+ufs+nfs" make all make install exit ... and that's it. Hopefully, the next person who needs to figure this out will be helped by these instructions. Evan This seems to have done the trick. Edgar Gabriel wrote: I don't know whether its relevant for this problem or not, but a couple of weeks ago we also found that we had to apply the following patch to to compile ROMIO with OpenMPI over pvfs2. There is an additional header pvfs2-compat.h included in the ROMIO version of MPICH, but is somehow missing in the OpenMPI version ompi/mca/io/romio/romio/adio/ad_pvfs2/ad_pvfs2.h --- a/ompi/mca/io/romio/romio/adio/ad_pvfs2/ad_pvfs2.h Thu Sep 03 11:55:51 2009 -0500 +++ b/ompi/mca/io/romio/romio/adio/ad_pvfs2/ad_pvfs2.h Mon Sep 21 10:16:27 2009 -0500 @@ -11,6 +11,10 @@ #include "adio.h" #ifdef HAVE_PVFS2_H #include "pvfs2.h" +#endif + +#ifdef PVFS2_VERSION_MAJOR +#include "pvfs2-compat.h" #endif Thanks Edgar Rob Latham wrote: On Tue, Jan 12, 2010 at 02:15:54PM -0800, Evan Smyth wrote: OpenMPI 1.4 (had same issue with 1.3.3) is configured with ./configure --prefix=/work/rd/evan/archives/openmpi/openmpi/1.4/enable_pvfs \ --enable-mpi-threads --with-io-romio-flags="--with-filesystems=pvfs2+ufs+nfs" PVFS 2.8.1 is configured to install in the default location (/usr/local) with ./configure --with-mpi=/work/rd/evan/archives/openmpi/openmpi/1.4/enable_pvfs In addition to Jeff's request for the build logs, do you have 'pvfs2-config' in your path? I build and install these (in this order) and setup my PVFS2 space using instructions at pvfs.org. I am able to use this space using the /usr/local/bin/pvfs2-ls types of commands. I am simply running a 2-server config (2 data servers and the same 2 hosts are metadata servers). As I say, manually, this all seems fine (even when I'm not root). It may be relevant that I am *not* using the kernel interface for PVFS2 as I am just trying to get a better understanding of how this works. That's a good piece of information. I run in that configuration often, so we should be able to make this work. It is perhaps relevant that I have not had to explicitly tell OpenMPI where I installed PVFS. I have told PVFS where I installed OpenMPI, though. This does seem slightly odd but there does not appear to be a way of telling OpenMPI this information. Perhaps it is not needed. PVFS needs an MPI library only to build MPI-based testcases. The servers, client libraries, and utilities do not use MPI. In any event, I then build my test program against this OpenMPI and in that program I have the following call sequence (i is 0 and where mntPoint is the path to my pvfs2 mount point -- I also tried prefixing a "pvfs2:" in the front of this as I read somewhere that that was optional). In this case, since you do not have the PVFS file system mounted, the 'pvfs2:' prefix is mandatory. Otherwise, the MPI-IO library will try to look for a directory that does not exist. Which will only execute on one of my
[OMPI users] openmpi equivalent to mpich serv_p4 daemon
I had been using MPICH and its serv_p4 daemon to speed startup times. I've decided to try OpenMPI (primarily for the fault-tolerance features) and would like to know what the equivalent of the serv_p4 daemon is. It appears as though the orted daemon may be what I am after but I don't quite understand it. I used to run serv_p4 with a specific port number and then pass a -p4ssport flag to mpirun. The daemon would remain running on each node and each new mpirun job would simply communicate directly through a port with the already running instance of the daemon on that machine and would save the mpirun from having to launch an rsh. This was great for reducing startup and run times due to rsh issues. The orted daemon does support a -persistent flag which seems relevant, but I cannot find a real usage example. I expect that most of the readers will find this to be a trivial problem but I'm hoping someone can give me an openmpi equivalent usage example. Thanks in advance, Evan -- -- Evan Smyth e...@dreamworks.com Dreamworks Animation 818.695.4105, Riverside 146