Hi Sebastian, Thank you for the use case to replicate the problem. I managed to replicate and it is indeed a bug in the library caused by moving truncation of the file to its allocated EOA from the flush call to the close call, which someone did a while ago. I have entered Jira Bug HDFFV-9418 for this.
Thanks, Mohamad -----Original Message----- From: Hdf-forum [mailto:[email protected]] On Behalf Of Sebastian Rettenberger Sent: Monday, June 15, 2015 9:07 AM To: HDF Users Discussion List Subject: Re: [Hdf-forum] File state after flush and crash Hi, I think I figure out the problem: The file size was not the real problem but the slightly different implementation for the large files. To get a good performance on the parallel file system, I add some gaps in the dataset such that every task starts writing data at a multiple of the file system block size. This introduces "gaps" in the dataset with uninitialized values. This itself is not a problem, however, the last task also added a gap at the end of the dataset which is never written. Thus, the file is smaller than expected. An H5F_close() seems to fix either the header or the file size while a simple H5F_flush() does not. Remove the gap from the last task solves the problem for me. To reproduce this: - Create a new file with a single dataset. - Write parts of the dataset (make sure that some values at the end of the dataset are not initialized) - Flush the file - Crash the program - Try to open the h5 file with h5dump or h5debug I am using the MPIO backend and I have the 2 flags for the dataset: H5Pset_layout(h5plist, H5D_CONTIGUOUS); H5Pset_alloc_time(h5plist, H5D_ALLOC_TIME_EARLY); Not sure if this is important. Let me know, if you still need a reproducer code. Best regards, Sebastian On 06/10/2015 02:28 PM, Sebastian Rettenberger wrote: > Hi, > > no, I do not modify the file that gets corrupted in do_work(). I only > access other HDF5 files. > > I figured out, that this problem only exists when creating large files > (> 3 TB) in parallel (> 1500 MPI tasks). For much smaller files, I did > not run into this problem. > > I will try to figure out the critical file size a create a replicator > but this might need some time since I have to wait for the compute > resources. > > I am not sure is this is helpful, but here is the error I get, when I > try to access the corrupt file with h5debug: > >> HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0: >> #000: H5F.c line 1582 in H5Fopen(): unable to open file >> major: File accessibilty >> minor: Unable to open file >> #001: H5F.c line 1373 in H5F_open(): unable to read superblock >> major: File accessibilty >> minor: Read failed >> #002: H5Fsuper.c line 351 in H5F_super_read(): unable to load >> superblock >> major: Object cache >> minor: Unable to protect metadata >> #003: H5AC.c line 1329 in H5AC_protect(): H5C_protect() failed. >> major: Object cache >> minor: Unable to protect metadata >> #004: H5C.c line 3570 in H5C_protect(): can't load entry >> major: Object cache >> minor: Unable to load metadata into cache >> #005: H5C.c line 7950 in H5C_load_entry(): unable to load entry >> major: Object cache >> minor: Unable to load metadata into cache >> #006: H5Fsuper_cache.c line 471 in H5F_sblock_load(): truncated >> file: eof = 3968572377152, sblock->base_addr = 0, stored_eoa = >> 3968574947328 >> major: File accessibilty >> minor: File has been truncated >> cannot open file > > Best regards, > Sebastian > > On 06/08/2015 05:08 PM, Mohamad Chaarawi wrote: >> Hi Sebastian, >> >> What happens in do_work()? Are you modifying the file in question? >> If yes, then corruption can be expected.. >> If not, then the file should not be corrupted, and if it is then we >> have a bug in the library. >> >> If you can send a replicator for this problem we can investigate further. >> >> Thanks, >> Mohamad >> >> -----Original Message----- >> From: Hdf-forum [mailto:[email protected]] On >> Behalf Of Sebastian Rettenberger >> Sent: Tuesday, June 02, 2015 4:27 AM >> To: HDF Users Discussion List >> Subject: [Hdf-forum] File state after flush and crash >> >> Hi, >> >> I want to save the program state of a parallel program in an HDF5 file. >> For performance reason I do not want to open/close the file each time >> I write the program state, but use flush it instead. >> >> Thus my main loop looks basically like this: >> while more_work() { >> do_work() >> update_hdf5_attributes() >> update hdf5_dataset() >> flush_hdf5() >> } >> >> However, even if the program crashes during do_work(), I get a >> corrupt >> HDF5 file. >> >> I found a short conversation regarding this but it is already 5 years >> old: >> http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2010 >> -February/002543.html >> >> >> They mentioned that this might be a problem with the meta data cache. >> Is this still true? Is there a way around it? >> I also have some other open HDF5 file. Might this be a problem? >> >> Best regards, >> Sebastian >> >> -- >> Sebastian Rettenberger, M.Sc. >> Technische Universität München >> Department of Informatics >> Chair of Scientific Computing >> Boltzmannstrasse 3, 85748 Garching, Germany http://www5.in.tum.de/ >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.o >> rg >> Twitter: https://twitter.com/hdf5 >> > -- Sebastian Rettenberger, M.Sc. Technische Universität München Department of Informatics Chair of Scientific Computing Boltzmannstrasse 3, 85748 Garching, Germany http://www5.in.tum.de/ _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
