Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Werner Benger Tue, 09 Jun 2015 13:25:44 -0700

Jason,

the reason would be to keep the complexity of HDF5 as small aspossible. Introducing indexing-reordering into HDF5 increases complexityand introduces possible sources of errors, especially as there is noneed for HDF5 to do it. HDF5 can just concentrate on storing alldatasets in C order and handling of fortran indexing to be separated outin an add-on library similar to h5lite library that is shipped with HDF5.

Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api ofcourse would have to make use of that addon-library to set and interpretsuch an "fortran-order" flag attribute. Using the "bare-bone" HDF5 wouldbe limited to mere C-order I/O .

Actually I had pretty much the same discussion ten years ago with otherusers of HDF5 as well. It was the same arguments, the desire to changeHDF5 to support different index schemes, versus considering HDF5 asC-only and doing anything else on top of it. Ultimately it's thedecision of the HDF team whether HDF5 should support different indexingschemes in its core API. But the fact that it has never been donedemonstrates that it's unlikely to happen, and since it can be done viaan add-on library (which needs to be used by both the HDF5 tools and theHDF5 fortran api, but it would not affect the HDF5 core), this seems tobe the easier and thus more realistic solution.


      Werner


On 09.06.2015 19:30, Jason Newton wrote:

Werner,

What is the argument for leaving this to yet another add-on library ontop of HDF5? This strategy would still require the user checks afterreading for instance and calls another api. I believe this is going tomake it a less than first-class citizen/feature at the least. Ideallywe want most users reading to not even know this is happening, likewhen content is chunked or compressed, although the metadata should bethere so the user can infer it will happen in their program..

Also, we want tools like hdfview, h5dump/h5ls to output the contentcorrectly too.


-Jason

On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <[email protected]<mailto:[email protected]>> wrote:


    Basically what it needs is a convention such as an attribute to
    allow identifying in which permutation order a dataset is stored...

    As they say in

    https://www.hdfgroup.org/HDF5/doc/fortran/index.html

    "When a C application reads data stored from a Fortran program,
    the data will appear to be transposed due to the difference in the
    C and Fortran storage orders. For example, if Fortran writes a 4x6
    two-dimensional dataset to the file, a C program will read it as a
    6x4 two-dimensional dataset into memory. The HDF5 C utilities
    h5dump and h5ls will also display transposed data, if data is
    written from a Fortran program. "

    But there is no way to find out whether data had been stored by a
    C or Fortran program. A simple agreement on an attribute would do,
    even better shared dataspaces that can hold such an attribute.

    All the index-permutation or data transposing (if really required)
    can be in some add-on library on top of HDF5 (similar to what F5
    does, though F5 does more than just that).

         Werner



    On 09.06.2015 11:00, Jason Newton wrote:

    Was hoping more commentary would have happened but I also had
    some timing issues getting back to this, my apologies.

    Werner, thank you for you reply but your case is exactly the
    proof of this as an issue that should be dealt with at the
    specification & library level that I am talking about. Permuting
    indices whenever accessing data is a large burden to put on user
    code, especially considering how many different bindings one
    might use to access the data. It leads to repeating and intrusive
    handling which is not what the user should be dealing with.  It's
    tricky, automatable, isolatable (to the library), difficult out
    of C (at least in python), and not what the tasks they should be
    spending time on using an advanced software like HDF5.

    If we look at the example of Eigen and Numpy we can see they have
    flags set for dealing with column/row [
    http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html
    ]  and c/fortran [ see order argument:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html

& http://docs.scipy.org/doc/numpy/reference/c-api.array.html ].This shows at least some numerical processing code deemed it

    important enough to not only deal with the issue, but usually
    provide seamless usage or conversion to the user's desired type.

    I think defaults can be set to not change current behaviour but
    that datasets & arrays could now be marked with a flag such as
    python's.  When reading/writing, an optional flag is provided for
    the memory space's requested interpretation (default to C or
    Fortran by language context).  We could potentially put this in
    the dataset properties and type properties so we wouldn't have to
    change API.  And ideally, hopefully performance being pretty
    great and handled in C, the library permutes the storage for you
    as it's IOing it in for hopefully negligible performance bump
    since IO is likely the limiting factor.

    I brought this up because I'm writing a generalized HDF C++
    library and when trying to support something like Eigen (and
    more!), which allows both C and F orders in the same runtime, it
    gets confusing on how to IO to/from HDF files as the current
    approach relies on language level wrappers to decide what the
    right thing to do is, and weakly at that.   But the user may
    genuinely want to IO in/out a fortran or C ordered dataset/array
    to/from a C/fortran dataset/array in any combination for what
    makes sense to them and this doesn't really work.  I can be left
    with baffling scenarios like this failing unless all data written
    to HDF files is in C order.:

        Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
        A_c.row(i) = 5;
        Eigen::Matrix<double, 4, 5, ColMajor> A_f;
        hdf.write("A", A_c);
        hdf.read("A", A_f);
        assert(A_c == A_f);


      If in this scenario A was already written by a Fortran program,
    then code making the above test case work would apply a
    conversion where none is needed for a read like this, making this
    test cases' assertion fail:

        Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
        A_c.row(i) = 5;
        Eigen::Matrix<double, 4, 5, ColMajor> A_f;
        hdf.read("A", A_f);
        assert(A_c == A_f);


    And that's why flags need to be saved in the document... the
    content needs to specify it's storage layout - guessing based on
    language cannot cover all cases and user made attributes are not
    the way because that would a be a standard nobody knows about or
    will use.

    -Jason

    On Tue, May 12, 2015 at 12:16 AM, Werner Benger
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Jason,

         I was facing the same issues as pretty much all use case I
        know and have in my visualization software and context use
        and require "fortran" order of indexing, including OpenGL
        graphics. It's not really an issue with HDF5 as the only
        thing required is to permute the indices when accessing the
        HDF5 API. And the HDF5 tools of course will display data
        transposed then. This index permutation is supported in the
        F5 library via a generic permutation vector that is stored
        with a group of dataset sharing the same properties (the F5
        library is a C library on top of HDF5 guiding towards a
        specific data model for various classes of data types
        occurring particularly in scientific visualization):

        http://www.fiberbundle.net/doc/structChartDomain__IDs.html

        So via the F5 API one would see the fortran-like indexing
        convention, whereas whenever accessing data with the
        lower-level HDF5 API, it's C-like convention (whereby the
        permutation vector gives the option of arbitrary permutations).

        I remember there had been plans by the HDF5 group to
        introduce "named dataspaces", similarly to "named datatypes",
        that could then be stored in the file as its own entity. Such
        would be a good place to store properties of a dataspace as
        attributes on a dataspace, and to have such shared among
        datasets. It would be a natural place to store a permutation
        vector, which could be reduced to a simple flag as well to
        just distinguish between C and fortran indexing conventions.
        Of course, all the related tools would also need to honor
        such an attribute then. Until then, one could use an
        attribute on each dataset and implement index permutation
        similar to how the F5 library does it. It may be safer to use
        new API functions anyway to not break old code that always
        expects C order indexing.

                  Werner


        On 12.05.2015 06:48, Jason Newton wrote:

        Hi -

        I've been a evangelist for HDF5 for a few of years now, it
        is a noble and amazing library that solves data storage
        issues occurring with scientific and beyond applications -
        e.g. it can save many developers from wasting time and money

so they can spend that on solving more original problems.But you guys knew that already. I think there's been a

        mistake though - that is the lack of first class
        column-vs-row major storage.  In a world where we are split
        down the middle on what format we used based on what
        application, library and language we use we work in one or
        the other it is an ongoing reality that there will never be
        one true standard to follow.  But HDF5 sought to only
        support row-major - and I can back that up - standardizing
        is a good thing. But then as time has shown, that really
        didn't work for alot of folks - such as those in Matlab and
        fortran - when they read our data - it looks transposed to
        them!  When HDF5 utils/our code sees their data - it looks
        transposed to us!  These are arguably the users you do not
        want to face these difficulties  as it makes it down right
        embarrassing at times and hard to work around in within that
        language (ahem, Matlab again is painful to work with).  Not
        only that but it doesn't really scale - it will always take
        some manual fixing and there's no standardized mark for
        whether a dataset is one of these column major masquerading
        datasets.  So let me assure you this is quite ugly to deal
        with in Matlab/etc and doesn't seem to be the path many
        people take - and it can require skills many people don't
        have or understanding that they can't give.

        But then, why did we allow saving column major data in a row
        based standard in the first place? Well, the answer seems to
        be performance.  Surely it can't take that long to convert
        the datasets - most of the time at least - although there
        would for sure be some memory based limitations to allow
        transposing just as HDF IOs. But alas - the current state of
        the library indicates otherwise and thus is the users job to
        handle correctly transforming the data back and forth
        between application and party.  But wait - wasn't this kind
        of activity what HDF5 was built to alleviate in the first place?

        So then how do we rectify the situation?  Well speaking as a
        developer using HDF5 extensively and writing libraries for
        it - it looks to me it should be in the core library as it

is exceedingly messy to handle on the user side each time.I think the interpretation of the dataset and it'sdimensions should be based on dataset creation properties.This would allow an official marking of what kind of

        interpretation the raw storage of the data (and dimensions?)
        are.  However, this is only half of the battle.  We'd need
        something like the type conversion system to permute order
        in all the right places if the user needs to IO an opposing

storage layout. And it should be fast and light on memory.Perhaps it would merely operate inplace as a new utility

        subroutine taking in the mem_type and user memory. However I
        can still think of one problem this does not address:
        compound types using  a mixture of philosophies with fields
        being the opposite to the dataset layout - and this case has
        me completely stumped as this indicates it should be type
        level as well.  The compound part of this is a sticky
        situation but I'd still motion that the dataset creation
        property works for most things that occur in practice.

So... has the HDF5 group tried to deal with this wart yet?Let me know if anything is on the drawing board.



        -Jason


        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [email protected]  <mailto:[email protected]>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter:https://twitter.com/hdf5

--___________________________________________________________________________

        Dr. Werner Benger                Visualization Research
        Center for Computation & Technology at Louisiana State University 
(CCT/LSU)
        2019  Digital Media Center, Baton Rouge, Louisiana 70803

Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>


        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [email protected]
        <mailto:[email protected]>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter: https://twitter.com/hdf5




    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected]  <mailto:[email protected]>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter:https://twitter.com/hdf5

--___________________________________________________________________________

    Dr. Werner Benger                Visualization Research
    Center for Computation & Technology at Louisiana State University (CCT/LSU)
    2019  Digital Media Center, Baton Rouge, Louisiana 70803

Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>


    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5




_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


--
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Reply via email to