Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Werner Benger Wed, 10 Jun 2015 01:23:49 -0700

Jason,

that discussion that I just recalled was not in this forum here butelsewhere, and it was in German, so probably not too helpful here.Basically it was the same arguing: "wait for HDF5 to become better andimplement such a feature in its core" versus "do it ourselves on top ofHDF5 via some addon library layer". Technically it's as simple as - eg.- introducing a convention such that all datasets and datastructuresthat end with "_f" in their name, are considered fortran-order (inpractice I'm using attributes on named data types to store an indexpermutation vector).

The main problem is that whatever is introduced here, may it be a newHDF5 core feature or a new HDF5 addon library implementing such aconvention, such functionality also needs to be used. It would need tobe used by the HDF5 tools, by the HDF5 fortran API, by matlab, by anysoftware that has data in fortran order in memory and writes it toc-order HDF5. It's just this piece of information that needs to bestored to be able to interpret data correctly, it's not even aperformance problem.

If you're going to use type information to store such information - sameas I do in F5 - then you will probably also face the same burden thattransient types cannot hold attributes, only named types which are boundto a file. That is also some aspect that would be nice to be improved inHDF5. But "for the time being" it can be handled by "lots of code" ontop of HDF5, still avoiding applications to do it as well if it's donevia a reusable add-on library. However, such addon library cannot befully generic as it does introduce certain conventions on how to useHDF5, even if minimalistic. I'd see that like the HDF5 dimension scalesor image specification, which are HDF5-approved conventions on top ofHDF5 and supported by the HDF5 tools.


       Werner


On 10.06.2015 03:38, Jason Newton wrote:

Werner,

Could you point me to the thread you mentioned? I figured this came upbefore and I'd like to take a read of it.

Re small as possible - I see the reasoning but I think it just has tobe swallowed here - what is the true amount of complexity introduced?Surely not as bad as the types and conversion system but I know whatyou mean just the same. Driving this strategy is the consideration ofhow often are people violating this very soft C order only guidelinementioned in the documentation, and said type of people. And we'regoing to have violators of this striving for performance and havingless copies in memory of huge datasets... One common thorn in industryis MATLAB. These common MATLAB user doesn't know any of the api'sinvolved and expect things to just work with a simple one liner;correcting behavior on their side is an intractable problem andsomething I've seen introduce bumps in spreading HDF to others I workwith. Matlab itself saves it's data in fortran-order when using themat serializers, I believe it was noted in the past they did this forperformance, although I cannot find references to this now and thatprobably extends to not enforcing the C order on the low levelfunction IO. I'll also note I had a difficult time supportinggeneralized corrections in MATLAB when dealing with multiple commoncases, such as nested 3x3 or 4x4 matrices in compound types. I'dalways have to write preprocessing/postprocessing scripts that werevery slow since MATLAB was doing them.

I am receptive to first class support of c/fortran order is likely nothappening, and that is to me saddening because in my eyes it is aninvestment to do it in the core libraries transparently with somethinglike properties that is going to pay off in support and usersatisfaction. On the one side, it'll probably be thankless work, buton the other it'll remove a very ugly wart when sharing data betweenteams/members. I'd say this wart has been my biggest barrier ingetting scientific (MATLAB) folks at work to use HDF5 directly.

I guess in my library I will default the column-major matrices toconvert to/from row-major on the fly when simply outputing matrixdatasets... but this still doesn't work for column major nested types,inside compounds/structs. The only solution I can figure there needsto use type information of the array types wrapping the matrixfields. Putting the burden on the struct designer to make HDF saveviews of compounds before IO is not a good one from my experiences(leads to alot of code) so the only thing I'm left with saying isdon't store fortran-order matrices in structs.


-Jason

On Tue, Jun 9, 2015 at 1:24 PM, Werner Benger <[email protected]<mailto:[email protected]>> wrote:


    Jason,

     the reason would be to keep the complexity of HDF5 as small as
    possible. Introducing indexing-reordering into HDF5 increases
    complexity and introduces possible sources of errors, especially
    as there is no need for HDF5 to do it. HDF5 can just concentrate
    on storing all datasets in C order and handling of fortran
    indexing to be separated out in an add-on library similar to
    h5lite library that is shipped with HDF5.

    Both the HDF5 tools such as hdfview, h5ls and the HDF5 fortran api
    of course would have to make use of that addon-library to set and
    interpret such an "fortran-order" flag attribute. Using the
    "bare-bone" HDF5 would be limited to mere C-order I/O .

    Actually I had pretty much the same discussion ten years ago with
    other users of HDF5 as well. It was the same arguments, the desire
    to change HDF5 to support different index schemes, versus
    considering HDF5 as C-only and doing anything else on top of it.
    Ultimately it's the decision of the HDF team whether HDF5 should
    support different indexing schemes in its core API. But the fact
    that it has never been done demonstrates that it's unlikely to
    happen, and since it can be done via an add-on library (which
    needs to be used by both the HDF5 tools and the HDF5 fortran api,
    but it would not affect the HDF5 core), this seems to be the
    easier and thus more realistic solution.

          Werner



    On 09.06.2015 19:30, Jason Newton wrote:

    Werner,

    What is the argument for leaving this to yet another add-on
    library on top of HDF5?  This strategy would still require the
    user checks after reading for instance and calls another api. I
    believe this is going to make it a less than first-class
    citizen/feature at the least. Ideally we want most users reading
    to not even know this is happening, like when content is chunked
    or compressed, although the metadata should be there so the user
    can infer it will happen in their program..

    Also, we want tools like hdfview, h5dump/h5ls to output the
    content correctly too.

    -Jason

    On Tue, Jun 9, 2015 at 3:58 AM, Werner Benger <[email protected]
    <mailto:[email protected]>> wrote:

        Basically what it needs is a convention such as an attribute
        to allow identifying in which permutation order a dataset is
        stored...

        As they say in

        https://www.hdfgroup.org/HDF5/doc/fortran/index.html

        "When a C application reads data stored from a Fortran
        program, the data will appear to be transposed due to the
        difference in the C and Fortran storage orders. For example,
        if Fortran writes a 4x6 two-dimensional dataset to the file,
        a C program will read it as a 6x4 two-dimensional dataset
        into memory. The HDF5 C utilities h5dump and h5ls will also
        display transposed data, if data is written from a Fortran
        program. "

        But there is no way to find out whether data had been stored
        by a C or Fortran program. A simple agreement on an attribute
        would do, even better shared dataspaces that can hold such an
        attribute.

        All the index-permutation or data transposing (if really
        required) can be in some add-on library on top of HDF5
        (similar to what F5 does, though F5 does more than just that).

             Werner



        On 09.06.2015 11:00, Jason Newton wrote:

        Was hoping more commentary would have happened but I also
        had some timing issues getting back to this, my apologies.

        Werner, thank you for you reply but your case is exactly the
        proof of this as an issue that should be dealt with at the
        specification & library level that I am talking about.
        Permuting indices whenever accessing data is a large burden
        to put on user code, especially considering how many
        different bindings one might use to access the data. It
        leads to repeating and intrusive handling which is not what
        the user should be dealing with.  It's tricky, automatable,
        isolatable (to the library), difficult out of C (at least in
        python), and not what the tasks they should be spending time
        on using an advanced software like HDF5.

        If we look at the example of Eigen and Numpy we can see they
        have flags set for dealing with column/row [
        http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html
        ]  and c/fortran [ see order argument:
        http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
        & http://docs.scipy.org/doc/numpy/reference/c-api.array.html
        ].  This shows at least some numerical processing code
        deemed it important enough to not only deal with the issue,
        but usually provide seamless usage or conversion to the
        user's desired type.

        I think defaults can be set to not change current behaviour
        but that datasets & arrays could now be marked with a flag
        such as python's.  When reading/writing, an optional flag is
        provided for the memory space's  requested interpretation
        (default to C or Fortran by language context).  We could
        potentially put this in the dataset properties and type
        properties so we wouldn't have to change API.  And ideally,
        hopefully performance being pretty great and handled in C,
        the library permutes the storage for you as it's IOing it in
        for hopefully negligible performance bump since IO is likely
        the limiting factor.

        I brought this up because I'm writing a generalized HDF C++
        library and when trying to support something like Eigen (and
        more!), which allows both C and F orders in the same
        runtime, it gets confusing on how to IO to/from HDF files as
        the current approach relies on language level wrappers to

decide what the right thing to do is, and weakly at that.But the user may genuinely want to IO in/out a fortran or C

        ordered dataset/array to/from a C/fortran dataset/array in
        any combination for what makes sense to them and this
        doesn't really work.  I can be left with baffling scenarios
        like this failing unless all data written to HDF files is in
        C order.:

            Eigen::Matrix<double, 4, 5, RowMajor> A_c;
            A_c.setZero(); A_c.row(i) = 5;
            Eigen::Matrix<double, 4, 5, ColMajor> A_f;
            hdf.write("A", A_c);
            hdf.read("A", A_f);
            assert(A_c == A_f);


          If in this scenario A was already written by a Fortran
        program, then code making the above test case work would
        apply a conversion where none is needed for a read like
        this, making this test cases' assertion fail:

            Eigen::Matrix<double, 4, 5, RowMajor> A_c;
            A_c.setZero(); A_c.row(i) = 5;
            Eigen::Matrix<double, 4, 5, ColMajor> A_f;
            hdf.read("A", A_f);
            assert(A_c == A_f);


        And that's why flags need to be saved in the document... the
        content needs to specify it's storage layout - guessing
        based on language cannot cover all cases and user made
        attributes are not the way because that would a be a
        standard nobody knows about or will use.

        -Jason

        On Tue, May 12, 2015 at 12:16 AM, Werner Benger
        <[email protected] <mailto:[email protected]>> wrote:

            Hi Jason,

             I was facing the same issues as pretty much all use
            case I know and have in my visualization software and
            context use and require "fortran" order of indexing,
            including OpenGL graphics. It's not really an issue with
            HDF5 as the only thing required is to permute the
            indices when accessing the HDF5 API. And the HDF5 tools
            of course will display data transposed then. This index
            permutation is supported in the F5 library via a generic
            permutation vector that is stored with a group of
            dataset sharing the same properties (the F5 library is a
            C library on top of HDF5 guiding towards a specific data
            model for various classes of data types occurring
            particularly in scientific visualization):

            http://www.fiberbundle.net/doc/structChartDomain__IDs.html

            So via the F5 API one would see the fortran-like
            indexing convention, whereas whenever accessing data
            with the lower-level HDF5 API, it's C-like convention
            (whereby the permutation vector gives the option of
            arbitrary permutations).

            I remember there had been plans by the HDF5 group to
            introduce "named dataspaces", similarly to "named
            datatypes", that could then be stored in the file as its
            own entity. Such would be a good place to store
            properties of a dataspace as attributes on a dataspace,
            and to have such shared among datasets. It would be a
            natural place to store a permutation vector, which could
            be reduced to a simple flag as well to just distinguish
            between C and fortran indexing conventions. Of course,
            all the related tools would also need to honor such an
            attribute then. Until then, one could use an attribute
            on each dataset and implement index permutation similar
            to how the F5 library does it. It may be safer to use
            new API functions anyway to not break old code that
            always expects C order indexing.

                      Werner


            On 12.05.2015 06:48, Jason Newton wrote:

            Hi -

            I've been a evangelist for HDF5 for a few of years now,
            it is a noble and amazing library that solves data
            storage issues occurring with scientific and beyond
            applications - e.g. it can save many developers from
            wasting time and money so they can spend that on
            solving more original problems. But you guys knew that
            already.  I think there's been a mistake though - that
            is the lack of first class column-vs-row major
            storage.  In a world where we are split down the middle
            on what format we used based on what application,
            library and language we use we work in one or the other
            it is an ongoing reality that there will never be one
            true standard to follow.  But HDF5 sought to only
            support row-major - and I can back that up -
            standardizing is a good thing. But then as time has
            shown, that really didn't work for alot of folks - such
            as those in Matlab and fortran - when they read our
            data - it looks transposed to them!  When HDF5
            utils/our code sees their data - it looks transposed to
            us!  These are arguably the users you do not want to
            face these difficulties  as it makes it down right
            embarrassing at times and hard to work around in within
            that language (ahem, Matlab again is painful to work
            with).  Not only that but it doesn't really scale - it
            will always take some manual fixing and there's no
            standardized mark for whether a dataset is one of these
            column major masquerading datasets.  So let me assure
            you this is quite ugly to deal with in Matlab/etc and
            doesn't seem to be the path many people take - and it
            can require skills many people don't have or
            understanding that they can't give.

            But then, why did we allow saving column major data in
            a row based standard in the first place? Well, the
            answer seems to be performance.  Surely it can't take
            that long to convert the datasets - most of the time at
            least - although there would for sure be some memory
            based limitations to allow transposing just as HDF IOs.
            But alas - the current state of the library indicates
            otherwise and thus is the users job to handle correctly
            transforming the data back and forth between
            application and party.  But wait - wasn't this kind of
            activity what HDF5 was built to alleviate in the first
            place?

            So then how do we rectify the situation?  Well speaking
            as a developer using HDF5 extensively and writing
            libraries for it - it looks to me it should be in the
            core library as it is exceedingly messy to handle on
            the user side each time.  I think the interpretation of
            the dataset and it's dimensions should be based on
            dataset creation properties.  This would allow an
            official marking of what kind of interpretation the raw
            storage of the data (and dimensions?) are. However,
            this is only half of the battle.  We'd need something
            like the type conversion system to permute order in all
            the right places if the user needs to IO an opposing
            storage layout.  And it should be fast and light on
            memory. Perhaps it would merely operate inplace as a
            new utility subroutine taking in the mem_type and user
            memory. However I can still think of one problem this
            does not address: compound types using  a mixture of
            philosophies with fields being the opposite to the
            dataset layout - and this case has me completely
            stumped as this indicates it should be type level as
            well.  The compound part of this is a sticky situation
            but I'd still motion that the dataset creation property
            works for most things that occur in practice.

            So... has the HDF5 group tried to deal with this wart
            yet?  Let me know if anything is on the drawing board.


            -Jason


            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [email protected]  <mailto:[email protected]>
            
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
            Twitter:https://twitter.com/hdf5

--___________________________________________________________________________

            Dr. Werner Benger                Visualization Research
            Center for Computation & Technology at Louisiana State University 
(CCT/LSU)
            2019  Digital Media Center, Baton Rouge, Louisiana 70803

Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>


            _______________________________________________
            Hdf-forum is for HDF software users discussion.
            [email protected]
            <mailto:[email protected]>
            
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
            Twitter: https://twitter.com/hdf5




        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [email protected]  <mailto:[email protected]>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter:https://twitter.com/hdf5

--___________________________________________________________________________

        Dr. Werner Benger                Visualization Research
        Center for Computation & Technology at Louisiana State University 
(CCT/LSU)
        2019  Digital Media Center, Baton Rouge, Louisiana 70803

Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>


        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [email protected]
        <mailto:[email protected]>
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
        Twitter: https://twitter.com/hdf5




    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected]  <mailto:[email protected]>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter:https://twitter.com/hdf5

--___________________________________________________________________________

    Dr. Werner Benger                Visualization Research
    Center for Computation & Technology at Louisiana State University (CCT/LSU)
    2019  Digital Media Center, Baton Rouge, Louisiana 70803

Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>


    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5




_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


--
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Reply via email to