Ok great, thanks - it sounds perfectly safe to disable it the way we're
using it here then.
On 01/07/16 19:34, Jiajun Cao wrote:
> Hi Jonathan,
>
> Setting the environment variable is to disable the dl plugin of DMTCP.
> What the plugin mainly does is to disable checkpointing in the middle of
> dlopen()/dlclose(). Doing so may cause undefined behavior on restart.
> Disabling the dl plugin should be okay for most applications, since most
> programs load the shared libraries during initialization. If you do not
> checkpoint the app at this point, it's safe.
>
> Having said that, I'm still not sure why it fails to open the shared
> library, since dmtcp does nothing special about the dl calls, except
> what I described above.
>
> Let me know if you have any questions,
> Jiajun
>
> On Fri, Jul 1, 2016 at 12:01 PM, Jonathan Patterson <[email protected]
> <mailto:[email protected]>> wrote:
>
>
> This thread appears to have died, so in the hope of getting
> an answer
> from one of the developers, here's the basic question again:
> When told to "consider setting the environment variable
> DMTCP_DL_PLUGIN
> to 0", what are the implications of doing this? I've seen the error with
> every matlab job that people try to run so far. Does this error mean the
> program will not run ok? Does disabling the DL_PLUGIN (whatever that is)
> mean that checkpointing will not work?
> Some guidance please....
>
> On 29-Jun-16 8:08 AM, Jonathan Patterson wrote:
> >
> > Great, thank you.
> > We're just using TCP over gigabit ethernet for the network.
> > Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing
> that manually. As fas as slurm is concerned, there is no checkpointing.
> > I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to
> checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs
> that I want to checkpoint, so these can be moved around to make way
> for the more complex jobs. So I'm thinking we can leave MPI out of this.
> > Most jobs are running with no problems, it's just the dmtcp ones
> that *occasionally* have a problem.
> > Some of the failing jobs (specific ones) complain about libdl.so
> (see below), but not all of them, if that helps. Maybe we should
> deal with that issue first?
> > The other failing jobs fail simply with the message I posted before.
> >
> > [43000] WARNING at dlwrappers.cpp:75 in dlopen;
> REASON='JWARNING(ret) failed'
> > filename = libirml.so.1
> > flag = 1
> > Message: dlopen failed. You may also see a message 'ERROR: ld.so:'
> > from libdl.so. If this happens only under DMTCP, then consider
> setting
> > the environment variable DMTCP_DL_PLUGIN to "0" before
> 'dmtcp_launch'.
> > If the problem persists, please write to the DMTCP developers.
> >
> > [43000] NOTE at processinfo.cpp:199 in growStack;
> REASON='bottom-most page of stack (page with highest address) was
> > invisible in /proc/self/maps. It is made visible again now.'
> > [43000] WARNING at dlwrappers.cpp:75 in dlopen;
> REASON='JWARNING(ret) failed'
> > filename = libcilkrts.so
> > flag = 1
> > Message: dlopen failed. You may also see a message 'ERROR: ld.so:'
> > from libdl.so. If this happens only under DMTCP, then consider
> setting
> > the environment variable DMTCP_DL_PLUGIN to "0" before
> 'dmtcp_launch'.
> > If the problem persists, please write to the DMTCP developers.
> >
> > [43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
> > _magicBits =
> > Message: read invalid message, _magicBits mismatch. Did DMTCP
> coordinator die uncleanly?
> >
> >
> >
> > On 28/06/16 22:59, Jiajun Cao wrote:
> >> Hi Jonathan,
> >>
> >> Thanks for writing to us. We're definitely glad to help you with the
> >> problem. Can you provide us the following info:
> >>
> >> What's the interconnect of the cluster, InfiniBand, TCP?
> >>
> >> What versions of Slurm and MPI do you use?
> >>
> >> Aside from the failure jobs, are the remaining jobs successful?
> Can they
> >> checkpoint/restart successfully?
> >>
> >> The log you sent is very general: it tells only that the client
> cannot
> >> connect to the coordinator somehow. There can be various reasons for
> >> that. We'll need to dig further.
> >>
> >> Best,
> >> Jiajun
> >>
> >> On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson
> <[email protected] <mailto:[email protected]>
> >> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>
> >>
> >> Hello!
> >> I'm running v. 2.4.4 on CentOS 6.8, kernel
> >> 2.6.32-431.20.3.el6.x86_64
> >> This is a cluster, with ~ 100 compute nodes,
> running slurm.
> >> Jobs are started with dmtcp_launch --rm. The idea
> is that
> >> jobs can be checkpointed as needed, to move them around between
> >> machines to fit jobs together to make room for high
> memory/specific
> >> MPI geometry jobs. This has worked well, but...
> >> Out of ~ 45,000 jobs that have run so far, ~ 100 have
> >> errors as below. I cannot find a common compute node, time, job
> >> type, user, memory usage, or any other factor - it seems
> that dmtcp
> >> is just randomly generating this error. This stops the job,
> which is
> >> a bit of a problem. No checkpointing was attempted on these
> jobs.
> >> Any ideas where I should look for the problem, anybody?
> >> Anything I can do to get some more debugging info? Is it the
> >> coordinator, or the dmtcp library wrapped around the running
> program
> >> that's generating this error?
> >> Thanks in advance...
> >>
> >> [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
> >> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) ==
> 0) failed'
> >> _magicBits =
> >> Message: read invalid message, _magicBits mismatch. Did DMTCP
> >> coordinator die uncleanly?
> >> main-PYTHIA8-lhef (47000): Terminating...
> >> [40000] ERROR at coordinatorapi.cpp:601 in
> >> createNewConnectionBeforeFork;
> >> REASON='JASSERT(_coordinatorSocket.isValid()) failed'
> >> bash (40000): Terminating...
> >>
> >>
> >>
>
> ------------------------------------------------------------------------------
> >> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T
> Park in San
> >> Francisco, CA to explore cutting-edge tech and listen to
> tech luminaries
> >> present their vision of the future. This family event has
> something for
> >> everyone, including kids. Get more information and register
> today.
> >> http://sdm.link/attshape
> >> _______________________________________________
> >> Dmtcp-forum mailing list
> >> [email protected]
> <mailto:[email protected]>
> >> <mailto:[email protected]
> <mailto:[email protected]>>
> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >>
> >>
> >
> >
>
> ------------------------------------------------------------------------------
> > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park
> in San
> > Francisco, CA to explore cutting-edge tech and listen to tech
> luminaries
> > present their vision of the future. This family event has
> something for
> > everyone, including kids. Get more information and register today.
> > http://sdm.link/attshape
> > _______________________________________________
> > Dmtcp-forum mailing list
> > [email protected]
> <mailto:[email protected]>
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
>
>
> ------------------------------------------------------------------------------
> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
> Francisco, CA to explore cutting-edge tech and listen to tech luminaries
> present their vision of the future. This family event has something for
> everyone, including kids. Get more information and register today.
> http://sdm.link/attshape
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> <mailto:[email protected]>
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum