Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Jason Newton Tue, 09 Jun 2015 18:40:26 -0700

Werner,

Could you point me to the thread you mentioned? I figured this came up
before and I'd like to take a read of it.


Re small as possible - I see the reasoning but I think it just has to be
swallowed here - what is the true amount of complexity introduced?  Surely
not as bad as the types and conversion system but I know what you mean just
the same.  Driving this strategy is the consideration of how often are
people violating this very soft C order only guideline mentioned in the
documentation,  and said type of people.  And we're going to have violators
of this striving for performance and having less copies in memory of huge
datasets... One common thorn in industry is MATLAB.  These common MATLAB
user doesn't know any of the api's involved and expect things to just work
with a simple one liner; correcting behavior on their side is an
intractable problem and something I've seen introduce bumps in spreading
HDF to others I work with.  Matlab itself saves it's data in fortran-order
when using the mat serializers, I believe it was noted in the past they did
this for performance, although I cannot find references to this now and
that probably extends to not enforcing the C order on the low level
function IO.  I'll also note I had a difficult time supporting  generalized
corrections in MATLAB when dealing with multiple common cases, such as
nested 3x3 or 4x4 matrices in compound types.  I'd always have to write
preprocessing/postprocessing scripts that were very slow since MATLAB was
doing them.

I am receptive to first class support of c/fortran order is likely not
happening, and that is to me saddening because in my eyes it is an
investment to do it in the core libraries transparently with something like
properties that is going to pay off in support and user satisfaction.  On
the one side, it'll probably be thankless work, but on the other it'll
remove a very ugly wart when sharing data between teams/members.  I'd say
this wart has been my biggest barrier in getting scientific (MATLAB)  folks
at work to use HDF5 directly.

I guess in my library I will default the column-major matrices to convert
to/from row-major on the fly when simply outputing matrix datasets... but
this still doesn't work for column major nested types, inside
compounds/structs.  The only solution I can figure there needs to use type
information of the array types wrapping the matrix fields.  Putting the
burden on the struct designer to make HDF save views of compounds before IO
is not a good one from my experiences (leads to alot of code) so the only
thing I'm left with saying is don't store fortran-order matrices in structs.

-Jason

On Tue, Jun 9, 2015 at 1:24 PM, Werner Benger <[email protected]> wrote:

>  Jason,
>
>  the reason would be to keep the complexity of HDF5 as small as possible.
> Introducing indexing-reordering into HDF5 increases complexity and
> introduces possible sources of errors, especially as there is no need for
> HDF5 to do it. HDF5 can just concentrate on storing all datasets in C order
> and handling of fortran indexing to be separated out in an add-on library
> similar to h5lite library that is shipped with HDF5.
>
> Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api of
> course would have to make use of that addon-library to set and interpret
> such an "fortran-order" flag attribute. Using the "bare-bone" HDF5 would be
> limited to mere C-order I/O .
>
> Actually I had pretty much the same discussion ten years ago with other
> users of HDF5 as well. It was the same arguments, the desire to change HDF5
> to support different index schemes, versus considering HDF5 as C-only and
> doing anything else on top of it. Ultimately it's the decision of the HDF
> team whether HDF5 should support different indexing schemes in its core
> API. But the fact that it has never been done demonstrates that it's
> unlikely to happen, and since it can be done via an add-on library (which
> needs to be used by both the HDF5 tools and the HDF5 fortran api, but it
> would not affect the HDF5 core), this seems to be the easier and thus more
> realistic solution.
>
>       Werner
>
>
>
> On 09.06.2015 19:30, Jason Newton wrote:
>
>   Werner,
>
>  What is the argument for leaving this to yet another add-on library on
> top of HDF5?  This strategy would still require the user checks after
> reading for instance and calls another api. I believe this is going to make
> it a less than first-class citizen/feature at the least. Ideally we want
> most users reading to not even know this is happening, like when content is
> chunked or compressed, although the metadata should be there so the user
> can infer it will happen in their program..
>
>  Also, we want tools like hdfview, h5dump/h5ls to output the content
> correctly too.
>
>  -Jason
>
> On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <[email protected]> wrote:
>
>>  Basically what it needs is a convention such as an attribute to allow
>> identifying in which permutation order a dataset is stored...
>>
>> As they say in
>>
>> https://www.hdfgroup.org/HDF5/doc/fortran/index.html
>>
>> "When a C application reads data stored from a Fortran program, the data
>> will appear to be transposed due to the difference in the C and Fortran
>> storage orders. For example, if Fortran writes a 4x6 two-dimensional
>> dataset to the file, a C program will read it as a 6x4 two-dimensional
>> dataset into memory. The HDF5 C utilities h5dump and h5ls will also display
>> transposed data, if data is written from a Fortran program. "
>>
>> But there is no way to find out whether data had been stored by a C or
>> Fortran program. A simple agreement on an attribute would do, even better
>> shared dataspaces that can hold such an attribute.
>>
>> All the index-permutation or data transposing (if really required) can be
>> in some add-on library on top of HDF5 (similar to what F5 does, though F5
>> does more than just that).
>>
>>      Werner
>>
>>
>>
>> On 09.06.2015 11:00, Jason Newton wrote:
>>
>>   Was hoping more commentary would have happened but I also had some
>> timing issues getting back to this, my apologies.
>>
>>  Werner, thank you for you reply but your case is exactly the proof of
>> this as an issue that should be dealt with at the specification & library
>> level that I am talking about.  Permuting indices whenever accessing data
>> is a large burden to put on user code, especially considering how many
>> different bindings one might use to access the data. It leads to repeating
>> and intrusive handling which is not what the user should be dealing with.
>> It's tricky, automatable, isolatable (to the library), difficult out of C
>> (at least in python), and not what the tasks they should be spending time
>> on using an advanced software like HDF5.
>>
>>  If we look at the example of Eigen and Numpy we can see they have flags
>> set for dealing with column/row [
>> http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html ]
>> and c/fortran [ see order argument:
>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html &
>> http://docs.scipy.org/doc/numpy/reference/c-api.array.html ].  This
>> shows at least some numerical processing code deemed it important enough to
>> not only deal with the issue, but usually provide seamless usage or
>> conversion to the user's desired type.
>>
>>  I think defaults can be set to not change current behaviour but that
>> datasets & arrays could now be marked with a flag such as python's.  When
>> reading/writing, an optional flag is provided for the memory space's
>> requested interpretation (default to C or Fortran by language context).  We
>> could potentially put this in the dataset properties and type properties so
>> we wouldn't have to change API.  And ideally, hopefully performance being
>> pretty great and handled in C, the library permutes the storage for you as
>> it's IOing it in for hopefully negligible performance bump since IO is
>> likely the limiting factor.
>>
>>  I brought this up because I'm writing a generalized HDF C++ library and
>> when trying to support something like Eigen (and more!), which allows both
>> C and F orders in the same runtime, it gets confusing on how to IO to/from
>> HDF files as the current approach relies on language level wrappers to
>> decide what the right thing to do is, and weakly at that.   But the user
>> may genuinely want to IO in/out a fortran or C ordered dataset/array
>> to/from a C/fortran dataset/array in any combination for what makes sense
>> to them and this doesn't really work.  I can be left with baffling
>> scenarios like this failing unless all data written to HDF files is in C
>> order.:
>>
>>  Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) =
>>> 5;
>>>  Eigen::Matrix<double, 4, 5, ColMajor> A_f;
>>>  hdf.write("A", A_c);
>>>  hdf.read("A", A_f);
>>>  assert(A_c == A_f);
>>>
>>
>>   If in this scenario A was already written by a Fortran program, then
>> code making the above test case work would apply a conversion where none is
>> needed for a read like this, making this test cases' assertion fail:
>>
>>> Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero(); A_c.row(i) = 5;
>>> Eigen::Matrix<double, 4, 5, ColMajor> A_f;
>>> hdf.read("A", A_f);
>>> assert(A_c == A_f);
>>>
>>
>> And that's why flags need to be saved in the document... the content
>> needs to specify it's storage layout - guessing based on language cannot
>> cover all cases and user made attributes are not the way because that would
>> a be a standard nobody knows about or will use.
>>
>>  -Jason
>>
>> On Tue, May 12, 2015 at 12:16 AM, Werner Benger <[email protected]>
>> wrote:
>>
>>>  Hi Jason,
>>>
>>>  I was facing the same issues as pretty much all use case I know and
>>> have in my visualization software and context use and require "fortran"
>>> order of indexing, including OpenGL graphics. It's not really an issue with
>>> HDF5 as the only thing required is to permute the indices when accessing
>>> the HDF5 API. And the HDF5 tools of course will display data transposed
>>> then. This index permutation is supported in the F5 library via a generic
>>> permutation vector that is stored with a group of dataset sharing the same
>>> properties (the F5 library is a C library on top of HDF5 guiding towards a
>>> specific data model for various classes of data types occurring
>>> particularly in scientific visualization):
>>>
>>> http://www.fiberbundle.net/doc/structChartDomain__IDs.html
>>>
>>> So via the F5 API one would see the fortran-like indexing convention,
>>> whereas whenever accessing data with the lower-level HDF5 API, it's C-like
>>> convention (whereby the permutation vector gives the option of arbitrary
>>> permutations).
>>>
>>> I remember there had been plans by the HDF5 group to introduce "named
>>> dataspaces", similarly to "named datatypes", that could then be stored in
>>> the file as its own entity. Such would be a good place to store properties
>>> of a dataspace as attributes on a dataspace, and to have such shared among
>>> datasets. It would be a natural place to store a permutation vector, which
>>> could be reduced to a simple flag as well to just distinguish between C and
>>> fortran indexing conventions. Of course, all the related tools would also
>>> need to honor such an attribute then. Until then, one could use an
>>> attribute on each dataset and implement index permutation similar to how
>>> the F5 library does it. It may be safer to use new API functions anyway to
>>> not break old code that always expects C order indexing.
>>>
>>>           Werner
>>>
>>>
>>> On 12.05.2015 06:48, Jason Newton wrote:
>>>
>>>  Hi -
>>>
>>> I've been a evangelist for HDF5 for a few of years now, it is a noble
>>> and amazing library that solves data storage issues occurring with
>>> scientific and beyond applications - e.g. it can save many developers from
>>> wasting time and money so they can spend that on solving more original
>>> problems.  But you guys knew that already.  I think there's been a mistake
>>> though - that is the lack of first class column-vs-row major storage.  In a
>>> world where we are split down the middle on what format we used based on
>>> what application, library and language we use we work in one or the other
>>> it is an ongoing reality that there will never be one true standard to
>>> follow.  But HDF5 sought to only support row-major - and I can back that up
>>> - standardizing is a good thing.  But then as time has shown, that really
>>> didn't work for alot of folks - such as those in Matlab and fortran - when
>>> they read our data - it looks transposed to them!  When HDF5 utils/our code
>>> sees their data - it looks transposed to us!  These are arguably the users
>>> you do not want to face these difficulties  as it makes it down right
>>> embarrassing at times and hard to work around in within that language
>>> (ahem, Matlab again is painful to work with).  Not only that but it doesn't
>>> really scale - it will always take some manual fixing and there's no
>>> standardized mark for whether a dataset is one of these column major
>>> masquerading datasets.  So let me assure you this is quite ugly to deal
>>> with in Matlab/etc and doesn't seem to be the path many people take - and
>>> it can require skills many people don't have or understanding that they
>>> can't give.
>>>
>>> But then, why did we allow saving column major data in a row based
>>> standard in the first place?  Well, the answer seems to be performance.
>>> Surely it can't take that long to convert the datasets - most of the time
>>> at least - although there would for sure be some memory based limitations
>>> to allow transposing just as HDF IOs. But alas - the current state of the
>>> library indicates otherwise and thus is the users job to handle correctly
>>> transforming the data back and forth between application and party.  But
>>> wait - wasn't this kind of activity what HDF5 was built to alleviate in the
>>> first place?
>>>
>>>  So then how do we rectify the situation?  Well speaking as a developer
>>> using HDF5 extensively and writing libraries for it - it looks to me it
>>> should be in the core library as it is exceedingly messy to handle on the
>>> user side each time.  I think the interpretation of the dataset and it's
>>> dimensions should be based on dataset creation properties.  This would
>>> allow an official marking of what kind of interpretation the raw storage of
>>> the data (and dimensions?) are.  However, this is only half of the battle.
>>> We'd need something like the type conversion system to permute order in all
>>> the right places if the user needs to IO an opposing storage layout.  And
>>> it should be fast and light on memory.  Perhaps it would merely operate
>>> inplace as a new utility subroutine taking in the mem_type and user memory.
>>> However I can still think of one problem this does not address: compound
>>> types using  a mixture of philosophies with fields being the opposite to
>>> the dataset layout - and this case has me completely stumped as this
>>> indicates it should be type level as well.  The compound part of this is a
>>> sticky situation but I'd still motion that the dataset creation property
>>> works for most things that occur in practice.
>>>
>>>  So... has the HDF5 group tried to deal with this wart yet?  Let me
>>> know if anything is on the drawing board.
>>>
>>>
>>>  -Jason
>>>
>>>
>>>  _______________________________________________
>>> Hdf-forum is for HDF software users 
>>> [email protected]http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>>
>>>
>>> --
>>> ___________________________________________________________________________
>>> Dr. Werner Benger                Visualization Research
>>> Center for Computation & Technology at Louisiana State University (CCT/LSU)
>>> 2019  Digital Media Center, Baton Rouge, Louisiana 70803
>>> Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362
>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>>
>>
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users 
>> [email protected]http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>>
>>
>> --
>> ___________________________________________________________________________
>> Dr. Werner Benger                Visualization Research
>> Center for Computation & Technology at Louisiana State University (CCT/LSU)
>> 2019  Digital Media Center, Baton Rouge, Louisiana 70803
>> Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>>
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users 
> [email protected]http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
> --
> ___________________________________________________________________________
> Dr. Werner Benger                Visualization Research
> Center for Computation & Technology at Louisiana State University (CCT/LSU)
> 2019  Digital Media Center, Baton Rouge, Louisiana 70803
> Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Reply via email to