Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
yeah it's a strange one The clues we have at this point are the 5 ctests that have been failing on the nautilus dashboard. http://open.cdash.org/viewTest.php?onlyfailed&buildid=2719388, and 3.14.1 doesn't have the issue (it's still running OK). I'll see if I can narrow down when the issue started, and if the mpi binaries have debugging symbols I will see if I can walk through. On 12/18/2012 11:42 AM, Utkarsh Ayachit wrote: That's really odd. Looking at the call stacks, it looks like the the code on both processes is at the right location: both are calling MPI_Finalize(). I verified that MPI_Finalize() does indeed gets called once (by adding a break point on MPI_Finalize in pvbatch). Burlen, can you peek into the files (finalize.c, adi.c etc.) to see if we can spot why the two processes diverge? Utkarsh On Fri, Dec 7, 2012 at 3:13 PM, Burlen Loring wrote: #5 0x2b073a2e3c04 in PMPI_Finalize () at finalize.c:27 ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
That's really odd. Looking at the call stacks, it looks like the the code on both processes is at the right location: both are calling MPI_Finalize(). I verified that MPI_Finalize() does indeed gets called once (by adding a break point on MPI_Finalize in pvbatch). Burlen, can you peek into the files (finalize.c, adi.c etc.) to see if we can spot why the two processes diverge? Utkarsh On Fri, Dec 7, 2012 at 3:13 PM, Burlen Loring wrote: > #5 0x2b073a2e3c04 in PMPI_Finalize () at finalize.c:27 ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
On Fri, Dec 7, 2012 at 12:13 PM, Burlen Loring wrote: > Hi Kyle et al. > > below are stack traces where PV is hung. I'm stumped by this, and can get no > foothold. I still have one chance if we can get valgrind to run with MPI on > nautilus. But it's a long shot, valgrinding pvbatch on my local system > throws many hundreds of errors. I'm not sure which of these are valid > reports. > > PV 3.14.1 doesn't hang in pvbatch, so I wondering if anyone knows of a > change in 3.98 that may account for the new hang? > > Burlen > > rank 0 > #0 0x2b0762b3f590 in gru_get_next_message () from > /usr/lib64/libgru.so.0 > #1 0x2b073a2f4bd2 in MPI_SGI_grudev_progress () at grudev.c:1780 > #2 0x2b073a31cc25 in MPI_SGI_progress_devices () at progress.c:93 > #3 MPI_SGI_progress () at progress.c:207 > #4 0x2b073a3244eb in MPI_SGI_request_finalize () at req.c:1548 > #5 0x2b073a2b8bee in MPI_SGI_finalize () at adi.c:667 > #6 0x2b073a2e3c04 in PMPI_Finalize () at finalize.c:27 > #7 0x2b073969d96f in vtkProcessModule::Finalize () at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229 > #8 0x2b0737bb0f9e in vtkInitializationHelper::Finalize () at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145 > #9 0x00403c50 in ParaViewPython::Run (processType=4, argc=2, > argv=0x7fff06195c88) at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124 > #10 0x00403cd5 in main (argc=2, argv=0x7fff06195c88) at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21 > > rank 1 > #0 0x2b07391bde70 in __nanosleep_nocancel () from > /lib64/libpthread.so.0 > #1 0x2b073a32c898 in MPI_SGI_millisleep (milliseconds= out>) at sleep.c:34 > #2 0x2b073a326365 in MPI_SGI_slow_request_wait (request=0x7fff061959f8, > status=0x7fff061959d0, set=0x7fff061959f4, gen_rc=0x7fff061959f0) at > req.c:1460 > #3 0x2b073a2c6ef3 in MPI_SGI_slow_barrier (comm=1) at barrier.c:275 > #4 0x2b073a2b8bf8 in MPI_SGI_finalize () at adi.c:671 > #5 0x2b073a2e3c04 in PMPI_Finalize () at finalize.c:27 > #6 0x2b073969d96f in vtkProcessModule::Finalize () at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229 > #7 0x2b0737bb0f9e in vtkInitializationHelper::Finalize () at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145 > #8 0x00403c50 in ParaViewPython::Run (processType=4, argc=2, > argv=0x7fff06195c88) at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124 > #9 0x00403cd5 in main (argc=2, argv=0x7fff06195c88) at > /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21 Hi Burlen, Thanks for getting these. I'll take a closer look today and see what I can find. -kyle > > > > On 12/04/2012 05:15 PM, Burlen Loring wrote: >> >> Hi Kyle, >> >> I was wrong about MPI_Finalize being invoked twice, I had miss read the >> code. I'm not sure why pvbatch is hanging in MPI_Finalize on Nautilus. I >> haven't been able to find anything in the debugger. This is new for 3.98. >> >> Burlen >> >> On 12/03/2012 07:36 AM, Kyle Lutz wrote: >>> >>> Hi Burlen, >>> >>> On Thu, Nov 29, 2012 at 1:27 PM, Burlen Loring wrote: it looks like pvserver is also impacted, hanging after the gui disconnects. On 11/28/2012 12:53 PM, Burlen Loring wrote: > > Hi All, > > some parallel tests have been failing for some time on Nautilus. > http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614 > > There are MPI calls made after finalize which cause deadlock issues on > SGI > MPT. It affects pvbatch for sure. The following snip-it shows the bug, > and > bug report here: http://paraview.org/Bug/view.php?id=13690 > > > > // > bool vtkProcessModule::Finalize() > { > >... > >vtkProcessModule::GlobalController->Finalize(1);<---mpi_finalize > called here >>> >>> This shouldn't be calling MPI_Finalize() as the finalizedExternally >>> argument is 1 and in vtkMPIController::Finalize(): >>> >>> if (finalizedExternally == 0) >>>{ >>>MPI_Finalize(); >>>} >>> >>> So my guess is that it's being invoked elsewhere. >>> >... > > #ifdef PARAVIEW_USE_MPI >if (vtkProcessModule::FinalizeMPI) > { > MPI_Barrier(MPI_COMM_WORLD);<-barrier > after > mpi_finalize > MPI_Finalize();<--second > mpi_finalize >
Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
Hi Kyle et al. below are stack traces where PV is hung. I'm stumped by this, and can get no foothold. I still have one chance if we can get valgrind to run with MPI on nautilus. But it's a long shot, valgrinding pvbatch on my local system throws many hundreds of errors. I'm not sure which of these are valid reports. PV 3.14.1 doesn't hang in pvbatch, so I wondering if anyone knows of a change in 3.98 that may account for the new hang? Burlen rank 0 #0 0x2b0762b3f590 in gru_get_next_message () from /usr/lib64/libgru.so.0 #1 0x2b073a2f4bd2 in MPI_SGI_grudev_progress () at grudev.c:1780 #2 0x2b073a31cc25 in MPI_SGI_progress_devices () at progress.c:93 #3 MPI_SGI_progress () at progress.c:207 #4 0x2b073a3244eb in MPI_SGI_request_finalize () at req.c:1548 #5 0x2b073a2b8bee in MPI_SGI_finalize () at adi.c:667 #6 0x2b073a2e3c04 in PMPI_Finalize () at finalize.c:27 #7 0x2b073969d96f in vtkProcessModule::Finalize () at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229 #8 0x2b0737bb0f9e in vtkInitializationHelper::Finalize () at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145 #9 0x00403c50 in ParaViewPython::Run (processType=4, argc=2, argv=0x7fff06195c88) at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124 #10 0x00403cd5 in main (argc=2, argv=0x7fff06195c88) at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21 rank 1 #0 0x2b07391bde70 in __nanosleep_nocancel () from /lib64/libpthread.so.0 #1 0x2b073a32c898 in MPI_SGI_millisleep (milliseconds=optimized out>) at sleep.c:34 #2 0x2b073a326365 in MPI_SGI_slow_request_wait (request=0x7fff061959f8, status=0x7fff061959d0, set=0x7fff061959f4, gen_rc=0x7fff061959f0) at req.c:1460 #3 0x2b073a2c6ef3 in MPI_SGI_slow_barrier (comm=1) at barrier.c:275 #4 0x2b073a2b8bf8 in MPI_SGI_finalize () at adi.c:671 #5 0x2b073a2e3c04 in PMPI_Finalize () at finalize.c:27 #6 0x2b073969d96f in vtkProcessModule::Finalize () at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229 #7 0x2b0737bb0f9e in vtkInitializationHelper::Finalize () at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145 #8 0x00403c50 in ParaViewPython::Run (processType=4, argc=2, argv=0x7fff06195c88) at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124 #9 0x00403cd5 in main (argc=2, argv=0x7fff06195c88) at /sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21 On 12/04/2012 05:15 PM, Burlen Loring wrote: Hi Kyle, I was wrong about MPI_Finalize being invoked twice, I had miss read the code. I'm not sure why pvbatch is hanging in MPI_Finalize on Nautilus. I haven't been able to find anything in the debugger. This is new for 3.98. Burlen On 12/03/2012 07:36 AM, Kyle Lutz wrote: Hi Burlen, On Thu, Nov 29, 2012 at 1:27 PM, Burlen Loring wrote: it looks like pvserver is also impacted, hanging after the gui disconnects. On 11/28/2012 12:53 PM, Burlen Loring wrote: Hi All, some parallel tests have been failing for some time on Nautilus. http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614 There are MPI calls made after finalize which cause deadlock issues on SGI MPT. It affects pvbatch for sure. The following snip-it shows the bug, and bug report here: http://paraview.org/Bug/view.php?id=13690 // bool vtkProcessModule::Finalize() { ... vtkProcessModule::GlobalController->Finalize(1);<---mpi_finalize called here This shouldn't be calling MPI_Finalize() as the finalizedExternally argument is 1 and in vtkMPIController::Finalize(): if (finalizedExternally == 0) { MPI_Finalize(); } So my guess is that it's being invoked elsewhere. ... #ifdef PARAVIEW_USE_MPI if (vtkProcessModule::FinalizeMPI) { MPI_Barrier(MPI_COMM_WORLD);<-barrier after mpi_finalize MPI_Finalize();<--second mpi_finalize } #endif I've made a patch which should prevent this second of code from ever being called twice by setting the FinalizeMPI flag to false after calling MPI_Finalize(). Can you take a look here: http://review.source.kitware.com/#/t/1808/ and let me know if that helps the issue. Otherwise, would you be able to set a breakpoint on MPI_Finalize() and get a backtrace of where it gets invoked for the second time? That would be very helpful in tracking down the problem. Thanks, Kyle __
Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
Hi Kyle, I was wrong about MPI_Finalize being invoked twice, I had miss read the code. I'm not sure why pvbatch is hanging in MPI_Finalize on Nautilus. I haven't been able to find anything in the debugger. This is new for 3.98. Burlen On 12/03/2012 07:36 AM, Kyle Lutz wrote: Hi Burlen, On Thu, Nov 29, 2012 at 1:27 PM, Burlen Loring wrote: it looks like pvserver is also impacted, hanging after the gui disconnects. On 11/28/2012 12:53 PM, Burlen Loring wrote: Hi All, some parallel tests have been failing for some time on Nautilus. http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614 There are MPI calls made after finalize which cause deadlock issues on SGI MPT. It affects pvbatch for sure. The following snip-it shows the bug, and bug report here: http://paraview.org/Bug/view.php?id=13690 // bool vtkProcessModule::Finalize() { ... vtkProcessModule::GlobalController->Finalize(1);<---mpi_finalize called here This shouldn't be calling MPI_Finalize() as the finalizedExternally argument is 1 and in vtkMPIController::Finalize(): if (finalizedExternally == 0) { MPI_Finalize(); } So my guess is that it's being invoked elsewhere. ... #ifdef PARAVIEW_USE_MPI if (vtkProcessModule::FinalizeMPI) { MPI_Barrier(MPI_COMM_WORLD);<-barrier after mpi_finalize MPI_Finalize();<--second mpi_finalize } #endif I've made a patch which should prevent this second of code from ever being called twice by setting the FinalizeMPI flag to false after calling MPI_Finalize(). Can you take a look here: http://review.source.kitware.com/#/t/1808/ and let me know if that helps the issue. Otherwise, would you be able to set a breakpoint on MPI_Finalize() and get a backtrace of where it gets invoked for the second time? That would be very helpful in tracking down the problem. Thanks, Kyle ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
Hi Burlen, On Thu, Nov 29, 2012 at 1:27 PM, Burlen Loring wrote: > it looks like pvserver is also impacted, hanging after the gui disconnects. > > > On 11/28/2012 12:53 PM, Burlen Loring wrote: >> >> Hi All, >> >> some parallel tests have been failing for some time on Nautilus. >> http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614 >> >> There are MPI calls made after finalize which cause deadlock issues on SGI >> MPT. It affects pvbatch for sure. The following snip-it shows the bug, and >> bug report here: http://paraview.org/Bug/view.php?id=13690 >> >> >> // >> bool vtkProcessModule::Finalize() >> { >> >> ... >> >> vtkProcessModule::GlobalController->Finalize(1); <---mpi_finalize >> called here This shouldn't be calling MPI_Finalize() as the finalizedExternally argument is 1 and in vtkMPIController::Finalize(): if (finalizedExternally == 0) { MPI_Finalize(); } So my guess is that it's being invoked elsewhere. >> >> ... >> >> #ifdef PARAVIEW_USE_MPI >> if (vtkProcessModule::FinalizeMPI) >> { >> MPI_Barrier(MPI_COMM_WORLD); <-barrier after >> mpi_finalize >> MPI_Finalize(); <--second >> mpi_finalize >> } >> #endif I've made a patch which should prevent this second of code from ever being called twice by setting the FinalizeMPI flag to false after calling MPI_Finalize(). Can you take a look here: http://review.source.kitware.com/#/t/1808/ and let me know if that helps the issue. Otherwise, would you be able to set a breakpoint on MPI_Finalize() and get a backtrace of where it gets invoked for the second time? That would be very helpful in tracking down the problem. Thanks, Kyle ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] 3.98 MPI_Finalize out of order in pvbatch
it looks like pvserver is also impacted, hanging after the gui disconnects. On 11/28/2012 12:53 PM, Burlen Loring wrote: Hi All, some parallel tests have been failing for some time on Nautilus. http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614 There are MPI calls made after finalize which cause deadlock issues on SGI MPT. It affects pvbatch for sure. The following snip-it shows the bug, and bug report here: http://paraview.org/Bug/view.php?id=13690 // bool vtkProcessModule::Finalize() { ... vtkProcessModule::GlobalController->Finalize(1); <---mpi_finalize called here ... #ifdef PARAVIEW_USE_MPI if (vtkProcessModule::FinalizeMPI) { MPI_Barrier(MPI_COMM_WORLD); <-barrier after mpi_finalize MPI_Finalize(); <--second mpi_finalize } #endif ... } Burlen ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview