Hi Eliot,
We're slowly coming up for air.
1) We still own you on the restart issue of 2/26/15. It's surprisingly subtle.
2) "Am I right in guessing that a file on a remote drive is identified
in part by the uuid of the local mount point, and in part by information
about the remote drive?"
I believe that DMTCP knows nothing about local or remote drives or
mount points. It should view files as part of a single unified filesystem.
Is there a simple way that we could locally test what you're seeing,
without having to crash a Grid Engine compute node :-).
I think I remember once seeing a strange bug in which if I renamed a DMTCP
checkpoint image to a new name, then the restart would fail. But I haven't
checked recently to see if that bug is still there. I was hoping to
get the list of outstanding major DMTCP bugs (and enhancements) fixed
for the 2.4.0 release, and then get back to these corner cases.
3) "For java jar files, it appears that every checkpoint makes another copy of
an open jar file -- even when (as far as I know) such files are read only."
The default policy of DMTCP should be that if the file is read-only,
and even if it is writeable but the offset is at the end of the file,
then DMTCP should _not_ be making a copy of the file. The flag
--checkpoint-open-files for dmtcp_launch is intended to force DMTCP
to make copies of open files in order to overcome that default behavior.
If you're seeing something different, could you confirm that? In that
case, I'll check again locally, to verify this bug. Thanks.
Sorry that we're late in getting back to you on item #1 above.
As an example of some things that we've been facing, leading up to our
2.4.0 full release, we were recently testing on the Lustre filesystem,
and hit an issue with parallel calls to 'mkdir()' that seems to be
a known bug under Lustre. Because it affects our support for MPI,
we've been putting a fair amount of time into first diagnosing it (the
diagnosis that I quote above), and then producing a fix.
By the way, if you're curious about seeing things on the developer
side, take a look at: https://github.com/dmtcp/dmtcp
where you'll find a bunch of "issues" (mostly design issues, some real
bugs, and a fair amount of gray area in between), and "pull requests".
Here's the issue that you brought to us about Java process migration:
https://github.com/dmtcp/dmtcp/issues/40
Thanks for your patience, Eliot.
Best,
- Gene
On Mon, May 11, 2015 at 08:43:23PM -0400, Eliot Moss wrote:
> Hi, guys -- A few questions:
>
> 1) I originally posted about my restart issues on 2/26/15. I am wondering
> how things are
> coming w.r.t an updated version :-) ...
>
> 2) Am I right in guessing that a file on a remote drive is identified in part
> by the uuid
> of the local mount point, and in part by information about the remote
> drive? I ask
> because it would appear that if a Grid Engine compute node crashes and
> gets rebuilt,
> restart of a job previously running there seems always to fail. Note
> that a rebuild
> ends up giving the local disk of the compute node a new uuid, since the
> disk image is
> wiped and built from scratch. However, the *remote* files remain the
> same. It would
> seem that in such a case the remote identity is what should matter, not
> the local
> path name ...
>
> In fact, maybe the situation is that the file is identified by the uuid
> of the disk
> for '/' ... or something like that? I claim that's broken for other
> mounts, such as
> the NFS mounts typical for my files ... Anyway, I am wondering how this
> works, and
> how it is intended to work.
>
> 3) For java jar files, it appears that every checkpoint makes another copy of
> an open
> jar file -- even when (as far as I know) such files are read only. Now
> the files are
> not big and I have plenty of storage, but it makes me wonder about the
> logic of
> the code in DMTCP ...
>
> Regards -- Eliot Moss
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum