Re: [Hdf-forum] Incremental writing of compound dataset slow

Daniel Rimmelspacher Tue, 21 Jul 2015 05:08:37 -0700

Hi David,

some additional investigations show the following behavior:


1. Open/ close dataspace immediately before/ after each incremental write does 
not increase the speed. This is suggested in the HDF performance 
recommendations (cf. https://www.hdfgroup.org/HDF5/faq/perfissues.html).

2. Buffering my data items and write them at once does increase performance 
significantly. Then I can also observe the influence of the chunks: By varying 
chunksize (1000/10000), I can speed-up from 15seconds (iterative writing) to 
approx. 5/3 seconds (buffered writing). File size remains unaffected. 

This is pretty cool and somehow solves the problem!

Anyway, the bottom-line or trade-off is that buffering makes my overall 
hdf-driver a little bit more complicated.  It seems that accessing the file 
30000 times simply costs too much time. In this regard I'd still prefer 
incremental write, i.e. write one-by-one but I can certainly live with the 
programming overhead induced by buffering.

In conclusion I'd like to rephrase my original question: Can I tell the 
hdf-library to postpone file writing to the very end, i.e. let the library 
automatically buffer the items for me?

Thanks and best regards,

Daniel


From: [email protected]
To: [email protected]
Date: Mon, 20 Jul 2015 20:27:48 +0200
Subject: Re: [Hdf-forum] Incremental writing of compound dataset slow




Hi David,

playing around with the chunk size reveals that anything above 1000 does not 
really impact the time. In addition this reveals that my sample data requires 
around 15MB net space in the file. In this regard increaing the chunks size to 
1Mio. only increases the space allocated by the file. But there is still only 
the 30000 compounds to write.  I am not 100% sure, but these results seem 
somehow reasonable.

Anyway, I am still baffled with respect to the 15seconds which remains roughly 
constant for any chunksize above 1000. Does someone have respective experience? 
Is this fast or slow (actually it seems pretty slow to me)?

Thanks,
Daniel

> Date: Mon, 20 Jul 2015 10:06:09 -0700
> From: [email protected]
> To: [email protected]
> Subject: Re: [Hdf-forum] Incremental writing of compound dataset slow
> 
> Hi Daniel,
> 
> I'm not sure what's going on. You set the chunk size to 1,000,000 
> elements, that is one chunk is 1 million of the structs, the size of 
> each chunk in bytes depends on the compound size, which can vary due to 
> the variable len strings. Still though, since you are writing 30,000 
> compounds, I'd think you are only writing one chunk, and you say it is 
> just as slow to write, even though it takes up 10 times as much disk 
> space? Hopefully you'll get some other ideas from the forum.
> 
> best,
> 
> David
> 
> On 07/20/15 09:57, Daniel Rimmelspacher wrote:
> > Hi David,
> >
> > thanks for the quick reply!
> >
> > Tweaking chunk size does not do the trick:
> >
> > Setting
> >
> > hsize_t chunkDims[1] = {1000000};
> >
> > will increase the file size, but not the speed, i.e. it remains almost 
> > the same with a 150MB instead of 15MB file.
> >
> > Regarding the data type in my struct, each element is a variable 
> > length string.
> >
> > Best regards,
> >
> > Daniel
> >
> >
> >
> > > Date: Mon, 20 Jul 2015 09:13:41 -0700
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: Re: [Hdf-forum] Incremental writing of compound dataset slow
> > >
> > > Hi Daniel,
> > >
> > > It looks like you are writing chunks of size 100, where each struct is
> > > maybe 40 bytes? I'm not sure what all the types are in the struct - but
> > > if that is the case each chunk is about 4k. It is my understanding that
> > > each chunk equates to one system write to disk, and these are 
> > expensive.
> > > A good rule of thumb is to target 1MB chunks.
> > >
> > > best,
> > >
> > > David
> > > software engineer at SLAC
> > >
> > > On 07/19/15 06:26, Daniel Rimmelspacher wrote:
> > > > Dear hdf-forum,
> > > >
> > > > I am trying to write compound data to an extendible hdf-dataset. For
> > > > the code snippet below, I am writing ~30000 compound items 
> > one-by-one,
> > > > resulting in an approximately 15MB h5-file.
> > > >
> > > > For dumping this amount of data the hdf library requires roughly 15
> > > > seconds. This seems a little bit long to me. My guess is that
> > > > requesting the proper hyperslab for each new item wastes most of the
> > > > time.
> > > >
> > > > Here, however, I am struggling a little bit, since I don't manage to
> > > > find out more about this.
> > > >
> > > > I'd appreciate if someone would have a quick look at the code 
> > below in
> > > > order to give me a hint.
> > > >
> > > > Thanks and regards,
> > > >
> > > > Daniel
> > > >
> > > > ////////////////////////////////////////////////////////////////////
> > > > // Header: definition of struct type characteristic_t
> > > > //////////////////////////////////////////////// ////
> > > > ...
> > > >
> > > > /////////////////////////////////////////////////////////////////////
> > > > // This section initializes the dataset (once) for incremental read
> > > > /////////////////////////////////////////////////////////////////////
> > > > // initialize variable length string type
> > > > constStrType vlst(PredType::C_S1, H5T_VARIABLE);
> > > >
> > > > // Create memory space for compound datatype
> > > > memspace = CompType(sizeof(characteristic_t));
> > > > H5Tinsert(memspace.getId(), "Name", HOFFSET(characteristic_t ,
> > > > name), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "LongIdentifier",
> > > > HOFFSET(characteristic_t , longId), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "Type", HOFFSET(characteristic_t
> > > > , type), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "Address", HOFFSET(characteristic_t ,
> > > > address), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "Deposit", HOFFSET(characteristic_t ,
> > > > deposit), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "MaxDiff", HOFFSET(characteristic_t ,
> > > > maxDiff), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "Conversion",
> > > > HOFFSET(characteristic_t , conversion), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "LowerLimit",
> > > > HOFFSET(characteristic_t , lowerLimit), vlst.getId());
> > > > H5Tinsert(memspace.getId(), "UpperLimit",
> > > > HOFFSET(characteristic_t , upperLimit), vlst.getId());
> > > >
> > > > // Prepare data set
> > > > dims[0] = 0; // Initial size
> > > > hsize_t rank = 1;
> > > > // data will be
> > > > alligned in array style
> > > > hsize_t maxDims[1] = {H5S_UNLIMITED};
> > > > // dataset will be
> > > > extendible
> > > > hsize_t chunkDims[1] = {100}; // some random chunksize
> > > > DataSpace *dataspace = newDataSpace (rank, dims,
> > > > maxDims); // set dataspace for dataset
> > > >
> > > > // Modify dataset creation property to enable chunking
> > > > DSetCreatPropList prop;
> > > > prop.setChunk(rank, chunkDims);
> > > >
> > > > // Create the chunked dataset. Note the use of pointer.
> > > > charData = file.createDataSet( "Characteristic", memspace, 
> > *dataspace,
> > > > prop);
> > > >
> > > > // Init helper
> > > > hsize_t chunk[1] = {1};
> > > > chunkSpace = DataSpace(1, chunk, NULL);
> > > > filespace = DataSpace(charData.getSpace());
> > > >
> > > >
> > > > 
> > //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > > // This section will be called repeatadly in order to write the
> > > > compound items iteratively
> > > > 
> > /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > > // Create the new item.
> > > > characteristic_t s1[1];
> > > > s1[0].name = name;
> > > > s1[0].longId = Id;
> > > > s1[0].type = type;
> > > > s1[0].address = address;
> > > > s1[0].deposit = deposit;
> > > > s1[0].maxDiff = maxDiff;
> > > > s1[0].conversion = conversion;
> > > > s1[0].lowerLimit = lowerLimit;
> > > > s1[0].upperLimit = upperLimit;
> > > >
> > > > // Extend dataset
> > > > dims[0]++;
> > > > charData.extend(dims);
> > > >
> > > > // Compute new dims
> > > > hsize_t chunk[1] = {1};
> > > > hsize_t start[1] = {0};
> > > > start[0] = dims[0]-1;
> > > >
> > > > // Select a hyperslab in extended portion of the dataset.
> > > > filespace = charData.getSpace();
> > > > filespace.selectHyperslab(H5S_SELECT_SET, chunk, start);
> > > >
> > > > // Write data to the extended portion of the dataset.
> > > > charData.write(s1, memspace, chunkSpace, filespace);
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Hdf-forum is for HDF software users discussion.
> > > > [email protected]
> > > > 
> > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > > > Twitter: https://twitter.com/hdf5
> > >
> > >
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > [email protected]
> > > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > > Twitter: https://twitter.com/hdf5
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > Twitter: https://twitter.com/hdf5
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
                                          

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Incremental writing of compound dataset slow

Reply via email to