On Feb 3, 2014, at 2:01 PM, Eric Chamberland <eric.chamberl...@giref.ulaval.ca> 
wrote:

> Hi Ralph,
> 
> On 02/03/2014 04:20 PM, Ralph Castain wrote:
>> On Feb 3, 2014, at 1:13 PM, Eric Chamberland 
>> <eric.chamberl...@giref.ulaval.ca> wrote:
>> 
>>> On 02/03/2014 03:59 PM, Ralph Castain wrote:
>>>> Very strange - even if you kill the job with SIGTERM, or have processes 
>>>> that segfault, OMPI should clean itself up and remove those session 
>>>> directories. Granted, the 1.6 series isn't as good about doing so as the 
>>>> 1.7 series, but it at least to-date has done pretty well.
>>> Ok, one more information here that may matter: All sequential tests are 
>>> launched *without* mpiexec...  I don't know if the "cleanup" phase is done 
>>> by mpiexec or the binaries...
>> Ah, yes that would be a source of the problem! We can't guarantee cleanup if 
>> you just kill the procs or they segfault *unless* mpiexec is used to launch 
>> the job. What are you using to launch? Most resource managers provide an 
>> "epilog" capability for precisely this purpose as all MPIs would display the 
>> same issue.
> For the sequential jobs, we just launch the tests on the "command line"... no 
> resource manager is ever used.  For the jobs which requires more than 1 
> process, we have "mpiexec -n ..." added to the command line...

Understood. FWIW, if those sequential jobs call "MPI_Init", then they will 
create a session directory tree. I've been removing that in the 1.7 series so 
it only gets created when needed, but not in the 1.6 series.

> 
>>> which should delete files that shouldn't exists... ;-)
>>> 
>>> But, IMHO, I still think OpenMPI should "choose" another directory name if 
>>> it can't create it because a poor file exists!
>> We could do that - but now we get into the bottomless pit of trying every 
>> possible combination of directory names, and ensuring that every process 
>> comes up with the same answer! Remember, the session dir is where the shared 
>> memory regions rendezvous, so every process on a node would have to find the 
>> same place
> ok.  Just for my knowledge: that means if I launch 2 processes on a single 
> node and they have to communicate, they will do it by the files in /tmp?

They won't communicate via the files - they just use the files as a rendezvous 
point to exchange shared memory region pointers.

> 
>>> How can all users be aware that they have to cleanup such files?
>> Given how long 1.6.x has been out there, and that this is about the only 
>> time I've heard of a problem, I'm not sure this is a general enough issue to 
>> merit the concern
> Ok.  I did just verified on 8 other computers/architectures that are running 
> the same tests: there is only 1 which have files in the directory level of 
> /tmp/openmpi-sessions-${USER}*
> Since we do that kind of testing since many years, I also agree it is not a 
> widespread issue...  But it just occured 2 times in the last 3 days!!! :-/

Bummer :-(

>> 
>>> Maybe a good compromise would be to have the error message to tell there is 
>>> a file with the same name of the directory chosen?
>> I can make that change - good suggestion.
> ok, thanks!
> 
>> 
>>> Or add a new entry to the FAQ to help users find the workaround you 
>>> proposed... ;-)
>> we can try to do that too
> 
> If I may suggest to test the behavior of 1.7.x... what about this: Have a 
> test case that creates a bunch of files (from 0 to 65536) in 
> /tmp/openmpi-sessions-${USER}... before launching an executable without 
> mpirun... >:)

Ick - it will actually only conflict if/when the pid's wrap, so it's a pretty 
rare issue.

> 
> Anyway, thanks a lot!
> 
> Eric
> 

Reply via email to