[Perldl] a non-PDL-specific issue #2: dimensions lack any semantic information

Ivan Shmakov Thu, 03 Feb 2011 13:58:10 -0800

        Contents

        0.  Introduction
        1.  A trivial example
        2.  Typical solutions
        3.  PDL vs. NetCDF Operators
        4.  Conclusion



    0.  Introduction

        The second problem I'd like to discuss is different from the
        first in that there /are/ known solutions for some particular
        cases, yet there aren't, to the best of my knowledge, a single
        approach, generic enough to become a part of a generic array
        processor, such as PDL.

        This problem is of a greater perceived significance, as it may
        easily lead to increased development (debugging) time.  Also,
        since this problem is, at least in part, solved by other
        software, it readily makes PDL seem inferior to such software.

        The problem is that the dimensions of a PDL variable lack any
        information whatsoever on the /meaning/ of the indices.


    1.  A trivial example

        Consider, e. g., that there're two PDL variables, $t1 and $t2,
        which contain series of regularly-sampled temperature at two
        distinct locations, which we've loaded from some data file or
        files.  Consider also that we're, for some reason, need to
        compute the difference between the temperatures sampled at the
        corresponing moments of time.  Can it be as simple as, say, the
        following?

    my $tdiff = $t1 - $t2;

        Unfortunately, it can't, as we're yet to be sure that the
        corresponding elements of $t1 and $t2 were sampled at the same
        time.  IOW, we're yet to be sure that the /mapping/ of indices
        to the values of a /physical quantity/ (time) is the same for
        both of the variables.


    2.  Typical solutions

        How this problem is typically solved?  First of all, we need a
        way to encode the mapping.  In the most common case, the mapping
        is assumed to be linear, and thus can be defined as a pair of
        scalars: the step, and the offset.  Only having ensured that
        both are the same for both of the variables, we can proceed with
        the computation.  Otherwise, we may choose to use subsampling or
        interpolation in order to get the mappings to match each other.

        For multi-dimensional data, the order of indices also becomes
        significant.  There're some differences on how this problem is
        addressed by the software.  In particular, the raster engine of
        the GRASS GIS assumes that, roughly speaking, the minor
        dimension corresponds to the west to east direction, while the
        major one corresponds to the south to north direction.  No other
        number of dimensions but two is allowed.  (Such a solution is
        clearly /not/ for a generic array processor, like PDL.)

        Many major image formats employ more or less the same solution,
        by requiring, e. g., that the inner dimension correspond to the
        primary color (red, green or blue), the middle dimension
        corresponds to the left to right direction, while the major one
        corresponds to the top to bottom direction.  (TIFF is among the
        notable exceptions, as it allows different layouts.)

        NetCDF, a prominent multi-dimensional data format, allows the
        individual dimensions to be explicitly named.  The software
        processing NetCDF files may then choose to /orient/ (i. e.,
        permute the dimensions of) the variables involved in a
        computation so that their dimensions having the same name will
        have matching positions in the list of indices.

        Some NetCDF-related materials mention the concept of a
        /coordinate variable/ — an one-dimensional variable associated
        with the named dimension, which holds the values of some
        physical quantity corresponding to the whole range of the index
        values.  This feature allows for completely arbitrary mappings.


    3.  PDL vs. NetCDF Operators

        The NetCDF Operators (NCO) implement the support for the named
        dimensions feature of NetCDF.  (And also for the NetCDF Climate
        and Forecast (CF) Metadata Conventions, which I'm not yet
        familiar with.)

        Thus, e. g., the user invoking the following command may be sure
        that the right thing is done, irrespective of the internal
        layout of the multi-dimensional data that inhabitates the source
        datasets:

$ ncbo --op_typ=sub data1.nc data2.nc difference.nc 

        A mere convenience?  Even more so for both the developers and
        data providers.

        For the first, this behavior means that the software based on
        the semantically-aware building blocks like the one above will
        not require modification should a data provider suddenly change
        the internal layout of a dataset.

        For the second, it, conversely, gives more freedom to change the
        internal layout as it becomes necessary, without any of: giving
        early warnings to the users of the data, providing the data in
        both the flavors, or risking losing compatibility.

        Unfortunately, reading the contents of a NetCDF variable into a
        PDL variable results in the loss of such a semantic information.
        Although this information may be read and tracked separately, it
        may imply extra burden on the developer, and reduce the
        readability of the code, perhaps to the point when it becomes
        impractical to pursue layout-independence.

        Previously, I've noted that there's a problem with software
        relying on some particular ordering of dimensions in the
        datasets created by some other software, as both it's limited as
        to the datasets it could be applied without modification, and it
        also constrains the data provider to the once-created data
        layout.  In fact, the same reasoning applies to the building
        blocks the software is made from: the functions.

        Thus, as of the current version of PDL, the order of the
        dimensions becomes a part of the function's signature, with all
        the (negative) consequences thereof.


    4.  Conclusion

        Now there, my question is: does it seem feasible to add semantic
        information to the PDL dimensions?

        The mere association of coordinate variables of a kind to the
        dimensions of the PDL regular variables shouldn't be hard to
        implement.  However, the necessity to maintain this information
        throughout the computation may imply some extra burden to the
        implementations of the PDL functions.

        Also, there's a question on how should the behavior be altered
        in presence of the semantically-tagged variables?  E. g., if the
        only dimension of $a is time, and the only dimension of $b is
        power, should $a + $b result in a variable having both of these
        dimensions?  (IOW, should an implicit cross product be
        computed?)

        TIA.

-- 
FSF associate member #7257

pgphDgVgx6K9Y.pgp
Description: PGP signature

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

[Perldl] a non-PDL-specific issue #2: dimensions lack any semantic information

Reply via email to