Hi Eliot,
    At least when I reproduced the bug, I migrated from an AMD machine
to an Intel one.  The two filename prefixes that I saw were:
  /usr/lib/jvm/java-7-openjdk-common  [ on the AMD machine ]
  /usr/lib/jvm/java-6-openjdk-common  [ on the Intel machine ]
    So, this first fix is for an existing bug in DMTCP.  (When we
re-create a shared memory section, we don't check if the backing
file is a legal filename.  Our fix will be to map it as private in that case.
The checkpoint image will then carry with it all of the memory from
java-7.  Hopefully, DMTCP will then restart correctly on the
destination machine, even if there is _no_ java at all on the destination.
    In your own case, you might be seeing the same issue due to
differing NFS mount points, just as you describe.  But the same
solution should apply.

    I agree with you that we'll then need to test this solution, and
sse if there is a further problem.  If we find a further problem,
we might need to then add a specialized plugin to help Java adapt to the
new environment.  For example, if the memory segments refer to a
filename on the source Java machine, and _if_ the jvm tries to open
them _after_ restart on the destination machine, that would be a problem.
DMTCP plugins are designed for cases just like this: adapting to a new
environment on restart.

    So, we'll begin with the DMTCP bug fix, and then see if we also
need to add a new plugin specialized for Java.

Best,
- Gene


On Thu, Mar 05, 2015 at 04:41:38PM -0500, Eliot Moss wrote:
> On 3/5/2015 2:45 PM, Gene Cooperman wrote:
> >Hi Eliot,
> >     Good to hear from you again.  Sorry there was a delay before
> >we answered your bug report.
> >
> >Hi Rohan and Jiajun,
> >     I see what the bug is.  Could one of you implement the bug fix
> >(see below)?
> 
> So here's a wondering.  I am not sure the file will have a different
> *name* on different hosts.  The naming scheme through the file system
> should be the same.  However, on different hosts the file might be mapped
> to different locations when linked, and that could be problematic, no?
> I am not even sure how Java could be made to adjust to that.  I think
> you'd have to request mapping to the same address.
> 
> The files that I think are in question are on NFS mounts, and the mount
> information indicated the remote system ip address AND the local client
> ip address.  Maybe that somehow is viewed as part of the name of the
> files?
> 
> Thanks very much for investigating!  When you have a fix I think I can
> probably test it fairly easily.
> 
> Regards -- Eliot Moss
> 
> >     I was able to reproduce the bug by checkpointing java1 from the
> >test suite on dekaksi:
> >   env CLASSPATH=./test ./bin/dmtcp_launch --checkpoint-open-files  -i7 java 
> > -Xmx5M java1
> >
> >I then recursively copy ('scp -r') ckpt_java_* to CCIS Linux (since there
> >are some open files).
> >
> >I then restart on CCIS Linux:
> >   bin/dmtcp_restart ckpt_java_1d4a852a5f139a6-40000-54f8acae.dmtcp
> >   [27628] mtcp_restart.c:1321 open_shared_file:
> >   unable to create file 
> > /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar: 2
> >Segmentation fault (core dumped)
> >
> >I then look at the checkpoint image:
> >   gzip -dc ckpt_java1*.dmtcp | util/readdmtcp.sh tmp.dmtcp 2>&1 | grep -- 
> > '-s'
> >
> >Sure enough, Java is opening files at /usr/lib/jvm/... as shared files.
> >We try to restore it re-create the shared image in mtcp/mtcp_restart.c
> >with the _same_ underlying file.  But on the new host, the full pathname
> >of the underlying shared file has changed.
> >
> >Presumably, Java creates the shared image so that the Java jvm can
> >share the memory mapped file among multiple running jvm's.
> >
> >I assume that the solution is that if the underlying filename of a shared
> >memory image doesn't exist on the new target machine, then we should
> >simply open the file as shared, but with no underlying file,
> >using MAP_ANONYMOUS in mmap.
> >
> >The necessary logic should be self-contained inside mtcp/mtcp_restart.c.
> >
> >Jiajun or Rohan,
> >     Could one of you implement this fix (and also add this new issue
> >to github)?
> >
> >Thanks,
> >- Gene
> >
> >
> >On Wed, Mar 04, 2015 at 03:53:30PM -0500, Kapil Arya wrote:
> >>Rohan,Jiajun,
> >>
> >>Could one of you take a quick look at it?
> >>
> >>Kapil
> >>
> >>On Sat, Feb 28, 2015 at 12:04 PM, Eliot Moss <[email protected]> wrote:
> >>
> >>>On 2/26/2015 7:19 PM, Eliot Moss wrote:
> >>>
> >>>>gunzip -c foo.gz | java blah blah 2> blah.err | gzip > bar.gz
> >>>>
> >>>>1) Typically fails in restart if restarted on a host different from that
> >>>>      used for first part of the run.  The complaint is about Unix
> >>>shared-memory
> >>>>      stuff in the Java process.
> >>>>
> >>>>      Workaround: Restart only on the original host.
> >>>
> >>>Here's what happens when restarted on a different host:
> >>>
> >>>[42000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId
> >>>!= -1) failed'
> >>>       (strerror((*__errno_location ()))) = No such file or directory
> >>>java (42000): Terminating...
> >>>
> >>>As for the other problems (relative versus absolute path for stderr of a
> >>>Java
> >>>process), either I had confounded it with the above or it does not happen
> >>>every
> >>>time, so I may have been wrong about it, and in any case do not currently
> >>>have
> >>>failure output for it.
> >>>
> >>>Regards -- EM
> >>>
> >>>
> >>>------------------------------------------------------------------------------
> >>>Dive into the World of Parallel Programming The Go Parallel Website,
> >>>sponsored
> >>>by Intel and developed in partnership with Slashdot Media, is your hub for
> >>>all
> >>>things parallel software development, from weekly thought leadership blogs
> >>>to
> >>>news, videos, case studies, tutorials and more. Take a look and join the
> >>>conversation now. http://goparallel.sourceforge.net/
> >>>_______________________________________________
> >>>Dmtcp-forum mailing list
> >>>[email protected]
> >>>https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >>>
> >
> >>------------------------------------------------------------------------------
> >>Dive into the World of Parallel Programming The Go Parallel Website, 
> >>sponsored
> >>by Intel and developed in partnership with Slashdot Media, is your hub for 
> >>all
> >>things parallel software development, from weekly thought leadership blogs 
> >>to
> >>news, videos, case studies, tutorials and more. Take a look and join the
> >>conversation now. http://goparallel.sourceforge.net/
> >
> >>_______________________________________________
> >>Dmtcp-forum mailing list
> >>[email protected]
> >>https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to