Thanks Eric,

the goal of the patch is simply not to output info that is not needed (by
both orted and a.out)
/* since you ./a.out, an orted is forked under the hood */
so the patch is really optional, though convenient.


Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

>
>
> On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:
>
>> Eric,
>>
>>
>> can you please provide more information on how your tests are launched ?
>>
>>
> Yes!
>
> do you
>>
>> mpirun -np 1 ./a.out
>>
>> or do you simply
>>
>> ./a.out
>>
>>
> For all sequential tests, we do ./a.out.
>
>
>> do you use a batch manager ? if yes, which one ?
>>
>
> No.
>
>
>> do you run one test per job ? or multiple tests per job ?
>>
>
> On this automatic compilation, up to 16 tests are launched together.
>
>
>> how are these tests launched ?
>>
> For sequential ones, the special thing is that they are launched via
> python Popen call, which launches "time" which launches the code.
>
> So the "full" commande line is:
>
> /usr/bin/time -v -o /users/cmpbib/compilations/lor
> ien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/
> Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt
> mpi_v=2 verbose=True Beowulf=False outilMassif=False outilPerfRecord=False
> verifValgrind=False outilPerfStat=False outilCallgrind=False
> RepertoireDestination=/users/cmpbib/compilations/lorien/linu
> x_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien
> RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMP
> ILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien
> Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier
>
>
>
>>
>> do the test that crashes use MPI_Comm_spawn ?
>>
>> i am surprised by the process name [[9325,5754],0], which suggests there
>> MPI_Comm_spawn was called 5753 times (!)
>>
>>
>> can you also run
>>
>> hostname
>>
>> on the 'lorien' host ?
>>
>>
> [eric@lorien] Scripts (master $ u+1)> hostname
> lorien
>
> if you configure'd Open MPI with --enable-debug, can you
>>
> Yes.
>
>
>> export OMPI_MCA_plm_base_verbose 5
>>
>> then run one test and post the logs ?
>>
>>
> Hmmm, strange?
>
> [lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path
> NULL
> [lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash
> 1366255883
> [lorien:93841] plm:base:set_hnp_name: final jobfam 22260
> [lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:93841] [[22260,0],0] plm:base:receive start comm
> [lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered
> [lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a
> dynamic spawn
> [lorien:93841] [[22260,0],0] plm:base:receive stop comm
>
> ~
>
>>
>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>> produce job family 5576 (but you get 9325)
>>
>> the discrepancy could be explained by the use of a batch manager and/or
>> a full hostname i am unaware of.
>>
>>
>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
>> bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>
>> so strictly speaking, it is possible two jobs launched on the same node
>> are assigned the same 16 bits job family.
>>
>>
>> the easiest way to detect this could be to
>>
>> - edit orte/mca/plm/base/plm_base_jobid.c
>>
>> and replace
>>
>>     OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>>                          "plm:base:set_hnp_name: final jobfam %lu",
>>                          (unsigned long)jobfam));
>>
>> with
>>
>>     OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>>                          "plm:base:set_hnp_name: final jobfam %lu",
>>                          (unsigned long)jobfam));
>>
>> configure Open MPI with --enable-debug and rebuild
>>
>> and then
>>
>> export OMPI_MCA_plm_base_verbose=4
>>
>> and run your tests.
>>
>>
>> when the problem occurs, you will be able to check which pids produced
>> the faulty jobfam, and that could hint to a conflict.
>>
>> Does this gives the same output as with export
> OMPI_MCA_plm_base_verbose=5 without the patch?
>
> If so, beacause all is automated, applying a patch is "harder" than doing
> a simple
> export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add
> OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs?
>
> Thanks!
>
> Eric
>
>
>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>
>>> Hi,
>>>
>>> It is the third time this happened into the last 10 days.
>>>
>>> While running nighlty tests (~2200), we have one or two tests that
>>> fails at the very beginning with this strange error:
>>>
>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
>>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>>>
>>> But I can't reproduce the problem right now... ie: If I launch this
>>> test alone "by hand", it is successful... the same test was successful
>>> yesterday...
>>>
>>> Is there some kind of "race condition" that can happen on the creation
>>> of "tmp" files if many tests runs together on the same node? (we are
>>> oversubcribing even sequential runs...)
>>>
>>> Here are the build logs:
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>> .01h16m01s_config.log
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>> .01h16m01s_ompi_info_all.txt
>>>
>>>
>>> Thanks,
>>>
>>> Eric
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to