Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-28 Thread Edgar Gabriel
hm, this looks actually correct. The question now basically is, why the intermediate hand-shake by the processes with rank 0 on the inter-communicator is not finishing. I am wandering whether this could be related to a problem reported in another thread (Processes stuck after MPI_Waitall() in

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-28 Thread Grzegorz Maj
I've attached gdb to the client which has just connected to the grid. Its bt is almost exactly the same as the server's one: #0 0x428066d7 in sched_yield () from /lib/libc.so.6 #1 0x00933cbf in opal_progress () at ../../opal/runtime/opal_progress.c:220 #2 0x00d460b8 in opal_condition_wait

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-27 Thread Edgar Gabriel
based on your output shown here, there is absolutely nothing wrong (yet). Both processes are in the same function and do what they are supposed to do. However, I am fairly sure that the client process bt that you show is already part of current_intracomm. Could you try to create a bt of the

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-27 Thread Ralph Castain
This slides outside of my purview - I would suggest you post this question with a different subject line specifically mentioning failure of intercomm_merge to work so it attracts the attention of those with knowledge of that area. On Jul 27, 2010, at 9:30 AM, Grzegorz Maj wrote: > So now I

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-27 Thread Grzegorz Maj
So now I have a new question. When I run my server and a lot of clients on the same machine, everything looks fine. But when I try to run the clients on several machines the most frequent scenario is: * server is stared on machine A * X (= 1, 4, 10, ..) clients are started on machine B and they

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-26 Thread Ralph Castain
No problem at all - glad it works! On Jul 26, 2010, at 7:58 AM, Grzegorz Maj wrote: > Hi, > I'm very sorry, but the problem was on my side. My installation > process was not always taking the newest sources of openmpi. In this > case it hasn't installed the version with the latest patch. Now I >

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-26 Thread Grzegorz Maj
Hi, I'm very sorry, but the problem was on my side. My installation process was not always taking the newest sources of openmpi. In this case it hasn't installed the version with the latest patch. Now I think everything works fine - I could run over 130 processes with no problems. I'm sorry again

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-21 Thread Ralph Castain
We're having some problem replicating this once my patches are applied. Can you send us your configure cmd? Just the output from "head config.log" will do for now. Thanks! On Jul 20, 2010, at 9:09 AM, Grzegorz Maj wrote: > My start script looks almost exactly the same as the one published by

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-20 Thread Grzegorz Maj
My start script looks almost exactly the same as the one published by Edgar, ie. the processes are starting one by one with no delay. 2010/7/20 Ralph Castain : > Grzegorz: something occurred to me. When you start all these processes, how > are you staggering their wireup? Are

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-20 Thread Ralph Castain
Grzegorz: something occurred to me. When you start all these processes, how are you staggering their wireup? Are they flooding us, or are you time-shifting them a little? On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote: > Hm, so I am not sure how to approach this. First of all, the test

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-19 Thread Edgar Gabriel
Hm, so I am not sure how to approach this. First of all, the test case works for me. I used up to 80 clients, and for both optimized and non-optimized compilation. I ran the tests with trunk (not with 1.4 series, but the communicator code is identical in both cases). Clearly, the patch from Ralph

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-13 Thread Ralph Castain
As far as I can tell, it appears the problem is somewhere in our communicator setup. The people knowledgeable on that area are going to look into it later this week. I'm creating a ticket to track the problem and will copy you on it. On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote: > > On

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-13 Thread Ralph Castain
On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote: > Bad news.. > I've tried the latest patch with and without the prior one, but it > hasn't changed anything. I've also tried using the old code but with > the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't > help. > While

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-13 Thread Grzegorz Maj
Bad news.. I've tried the latest patch with and without the prior one, but it hasn't changed anything. I've also tried using the old code but with the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't help. While looking through the sources of openmpi-1.4.2 I couldn't find any

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-12 Thread Ralph Castain
Dug around a bit and found the problem!! I have no idea who or why this was done, but somebody set a limit of 64 separate jobids in the dynamic init called by ompi_comm_set, which builds the intercommunicator. Unfortunately, they hard-wired the array size, but never check that size before

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-12 Thread Grzegorz Maj
1024 is not the problem: changing it to 2048 hasn't change anything. Following your advice I've run my process using gdb. Unfortunately I didn't get anything more than: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf7e4c6c0 (LWP 20246)] 0xf7f39905 in ompi_comm_set ()

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-07 Thread Ralph Castain
I would guess the #files limit of 1024. However, if it behaves the same way when spread across multiple machines, I would suspect it is somewhere in your program itself. Given that the segfault is in your process, can you use gdb to look at the core file and see where and why it fails? On Jul

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-07 Thread Grzegorz Maj
2010/7/7 Ralph Castain : > > On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote: > >> Hi Ralph, >> sorry for the late response, but I couldn't find free time to play >> with this. Finally I've applied the patch you prepared. I've launched >> my processes in the way you've described

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-07 Thread Ralph Castain
On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote: > Hi Ralph, > sorry for the late response, but I couldn't find free time to play > with this. Finally I've applied the patch you prepared. I've launched > my processes in the way you've described and I think it's working as > you expected. None of

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-06 Thread Grzegorz Maj
Hi Ralph, sorry for the late response, but I couldn't find free time to play with this. Finally I've applied the patch you prepared. I've launched my processes in the way you've described and I think it's working as you expected. None of my processes runs the orted daemon and they can perform MPI

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-24 Thread Ralph Castain
Actually, OMPI is distributed with a daemon that does pretty much what you want. Checkout "man ompi-server". I originally wrote that code to support cross-application MPI publish/subscribe operations, but we can utilize it here too. Have to blame me for not making it more publicly known.The

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-24 Thread Krzysztof Zarzycki
Hi Ralph, I'm Krzysztof and I'm working with Grzegorz Maj on this our small project/experiment. We definitely would like to give your patch a try. But could you please explain your solution a little more? You still would like to start one mpirun per mpi grid, and then have processes started by us

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-23 Thread Ralph Castain
In thinking about this, my proposed solution won't entirely fix the problem - you'll still wind up with all those daemons. I believe I can resolve that one as well, but it would require a patch. Would you like me to send you something you could try? Might take a couple of iterations to get it

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-23 Thread Ralph Castain
HmmmI -think- this will work, but I cannot guarantee it: 1. launch one process (can just be a spinner) using mpirun that includes the following option: mpirun -report-uri file where file is some filename that mpirun can create and insert its contact info into it. This can be a relative or

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-23 Thread Grzegorz Maj
To be more precise: by 'server process' I mean some process that I could run once on my system and it could help in creating those groups. My typical scenario is: 1. run N separate processes, each without mpirun 2. connect them into MPI group 3. do some job 4. exit all N processes 5. goto 1

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-23 Thread Grzegorz Maj
Thank you Ralph for your explanation. And, apart from that descriptors' issue, is there any other way to solve my problem, i.e. to run separately a number of processes, without mpirun and then to collect them into an MPI intracomm group? If I for example would need to run some 'server process'

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-17 Thread Grzegorz Maj
Yes, I know. The problem is that I need to use some special way for running my processes provided by the environment in which I'm working and unfortunately I can't use mpirun. 2010/4/18 Ralph Castain : > Guess I don't understand why you can't use mpirun - all it does is start

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-17 Thread Ralph Castain
Guess I don't understand why you can't use mpirun - all it does is start things, provide a means to forward io, etc. It mainly sits there quietly without using any cpu unless required to support the job. Sounds like it would solve your problem. Otherwise, I know of no way to get all these

[OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-04-17 Thread Grzegorz Maj
Hi, I'd like to dynamically create a group of processes communicating via MPI. Those processes need to be run without mpirun and create intracommunicator after the startup. Any ideas how to do this efficiently? I came up with a solution in which the processes are connecting one by one using