Hi,no, I do not modify the file that gets corrupted in do_work(). I only access other HDF5 files.
I figured out, that this problem only exists when creating large files (> 3 TB) in parallel (> 1500 MPI tasks). For much smaller files, I did not run into this problem.
I will try to figure out the critical file size a create a replicator but this might need some time since I have to wait for the compute resources.
I am not sure is this is helpful, but here is the error I get, when I try to access the corrupt file with h5debug:
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: H5F.c line 1582 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#001: H5F.c line 1373 in H5F_open(): unable to read superblock
major: File accessibilty
minor: Read failed
#002: H5Fsuper.c line 351 in H5F_super_read(): unable to load superblock
major: Object cache
minor: Unable to protect metadata
#003: H5AC.c line 1329 in H5AC_protect(): H5C_protect() failed.
major: Object cache
minor: Unable to protect metadata
#004: H5C.c line 3570 in H5C_protect(): can't load entry
major: Object cache
minor: Unable to load metadata into cache
#005: H5C.c line 7950 in H5C_load_entry(): unable to load entry
major: Object cache
minor: Unable to load metadata into cache
#006: H5Fsuper_cache.c line 471 in H5F_sblock_load(): truncated file: eof =
3968572377152, sblock->base_addr = 0, stored_eoa = 3968574947328
major: File accessibilty
minor: File has been truncated
cannot open file
Best regards, Sebastian On 06/08/2015 05:08 PM, Mohamad Chaarawi wrote:
Hi Sebastian, What happens in do_work()? Are you modifying the file in question? If yes, then corruption can be expected.. If not, then the file should not be corrupted, and if it is then we have a bug in the library. If you can send a replicator for this problem we can investigate further. Thanks, Mohamad -----Original Message----- From: Hdf-forum [mailto:[email protected]] On Behalf Of Sebastian Rettenberger Sent: Tuesday, June 02, 2015 4:27 AM To: HDF Users Discussion List Subject: [Hdf-forum] File state after flush and crash Hi, I want to save the program state of a parallel program in an HDF5 file. For performance reason I do not want to open/close the file each time I write the program state, but use flush it instead. Thus my main loop looks basically like this: while more_work() { do_work() update_hdf5_attributes() update hdf5_dataset() flush_hdf5() } However, even if the program crashes during do_work(), I get a corrupt HDF5 file. I found a short conversation regarding this but it is already 5 years old: http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2010-February/002543.html They mentioned that this might be a problem with the meta data cache. Is this still true? Is there a way around it? I also have some other open HDF5 file. Might this be a problem? Best regards, Sebastian -- Sebastian Rettenberger, M.Sc. Technische Universität München Department of Informatics Chair of Scientific Computing Boltzmannstrasse 3, 85748 Garching, Germany http://www5.in.tum.de/ _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
-- Sebastian Rettenberger, M.Sc. Technische Universität München Department of Informatics Chair of Scientific Computing Boltzmannstrasse 3, 85748 Garching, Germany http://www5.in.tum.de/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
