#78: CF aggregation rules
-----------------------------+------------------------------
  Reporter:  davidhassell    |      Owner:  cf-conventions@…
      Type:  enhancement     |     Status:  new
  Priority:  medium          |  Milestone:
 Component:  cf-conventions  |    Version:
Resolution:                  |   Keywords:
-----------------------------+------------------------------

Comment (by stevehankin):

 Jonathan and David,

 At your suggestion I'm picking up a thread from trac #145 here.

 The rules contained in this ticket (#78) provide guidance on when it is
 '''ok''' for a client application to merge fields.  This is fine as far as
 this goes, but it does not provide the right level of information to coach
 a client on which files should be considered as candidates for
 aggregation.

 Lets consider modelers' needs as our example.  Say the model outputs 1
 year of time steps per file, with each variable (temp, u, v, ...) in
 separate files.  For a single model run it creates a matrix of
 NtimePeriods x Mvariables files.  Lets say that at the client end we are
 confronted with a directory containing the outputs of 5 such model runs.
 CF does not currently provide a simple mechanism that would permit an
 application to scan this directory and infer what the file creator knew --
 that these files represent 5 model outputs, when suitably aggregated.

 CMIP has formalized a set of CF attribute conventions that address this
 problem in the full glory of CMIP models: multiple institutions, model
 names, scenarios, time periods, etc.  CF needs analogous machinery --
 simpler and more general.  Here's a straw man:

 1. define a new CF global attribute
 {{{
     aggregation_key = "some string";
 }}}


 2. Offer guidance on the creation of the aggregation key string such as:
 "It is important that the string that is generated be unique to this
 dataset.  We suggest generating an MD5 hash from a list of metadata items
 -- to be decided on by the file creator -- that guarantees uniqueness. For
 example the list may contain following pieces of information"

   - institution
   - project/scenario
   - name of code generating files
   - creation date

 The end result will be a global attribute such as
 {{{
     aggregation_key = "d131dd02c5e6eec4";
 }}}

 that will be found in common among all of the files needed in this
 aggregation.

 ===

 What needs to be added to this straw man, is a strategy to handle files
 that may be shared in common by multiple model runs.  For example, the
 cell_measures field that was the topic of #145, might be shared among all
 model runs that use a particular gridded coordinate system.  A solution to
 this problem might be simply to allow multiple aggregation keys.

 In all model outputs from a single run we find
 {{{
     aggregation_key = "d131dd02c5e6eec4 c69821bcb6a88393";
 }}}

 where the first key identifies the model run and the second identifies the
 grid geometry.  The grid-geometry files would contain only the single
 identifying key
 {{{
     aggregation_key = "c69821bcb6a88393";
 }}}

 From this information a client application can quickly scan a directory
 and infer which files the data creator intended to be aggregated.

--
Ticket URL: <http://cf-trac.llnl.gov/trac/ticket/78#comment:4>
CF Metadata <http://cf-convention.github.io/>
CF Metadata

Reply via email to