Dear forum members,

I have observed an annoying occurence many times now. I'm running
parallel HDF5 (1.8.14) on top of OpenMPI (1.7.2) with gcc (4.8.1) on a
OpenSuse Linux (13.1). The storage is located on a NFS Server.

Running on typically 4 cores, I'm writing relatively large files (at
least several hundred MB, sometimes many GB) in parallel with HDF5.
Sometimes I have to interrupt the code with a CTRL+C signal during such
a write operation (often because of user error). Occasionally, this will
cause a catastrophic hangup, and I get the error message:
kernel BUG: soft lockup - CPU stuck for 23s!

This will invariably cause a violent system crash after a very short
time. I have observed this on at least 5 different machines (same
software stack), and so I don't believe it is a hardware problem. Since
these lockups only happen during interrupted write operations, I suspect
the HDF5 library to be causing them in some way, possibly not freeing
some resources.

Of course, it could also be caused by OpenMPI. Due to the highly
disruptive nature of the problem, I am not keen to try it too often. I
cannot easily try a different (or newer) MPI implementation. It might
also be caused by the fact that I'm not writing to a physical drive, but
a NFS drive.

Hence a general question, without appending example code: Has anyone
observed this behavior before, and if so, is there a fix? Am I blaming
HDF5 unfairly, and another cause is more likely? If this error is
unheard of, it's most likely caused by my setup...

Thanks,
Wolf

-- 




_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to