#78: CF aggregation rules -----------------------------+------------------------------ Reporter: davidhassell | Owner: cf-conventions@… Type: enhancement | Status: new Priority: medium | Milestone: Component: cf-conventions | Version: Resolution: | Keywords: -----------------------------+------------------------------
Comment (by stevehankin): Jonathan and David, At your suggestion I'm picking up a thread from trac #145 here. The rules contained in this ticket (#78) provide guidance on when it is '''ok''' for a client application to merge fields. This is fine as far as this goes, but it does not provide the right level of information to coach a client on which files should be considered as candidates for aggregation. Lets consider modelers' needs as our example. Say the model outputs 1 year of time steps per file, with each variable (temp, u, v, ...) in separate files. For a single model run it creates a matrix of NtimePeriods x Mvariables files. Lets say that at the client end we are confronted with a directory containing the outputs of 5 such model runs. CF does not currently provide a simple mechanism that would permit an application to scan this directory and infer what the file creator knew -- that these files represent 5 model outputs, when suitably aggregated. CMIP has formalized a set of CF attribute conventions that address this problem in the full glory of CMIP models: multiple institutions, model names, scenarios, time periods, etc. CF needs analogous machinery -- simpler and more general. Here's a straw man: 1. define a new CF global attribute {{{ aggregation_key = "some string"; }}} 2. Offer guidance on the creation of the aggregation key string such as: "It is important that the string that is generated be unique to this dataset. We suggest generating an MD5 hash from a list of metadata items -- to be decided on by the file creator -- that guarantees uniqueness. For example the list may contain following pieces of information" - institution - project/scenario - name of code generating files - creation date The end result will be a global attribute such as {{{ aggregation_key = "d131dd02c5e6eec4"; }}} that will be found in common among all of the files needed in this aggregation. === What needs to be added to this straw man, is a strategy to handle files that may be shared in common by multiple model runs. For example, the cell_measures field that was the topic of #145, might be shared among all model runs that use a particular gridded coordinate system. A solution to this problem might be simply to allow multiple aggregation keys. In all model outputs from a single run we find {{{ aggregation_key = "d131dd02c5e6eec4 c69821bcb6a88393"; }}} where the first key identifies the model run and the second identifies the grid geometry. The grid-geometry files would contain only the single identifying key {{{ aggregation_key = "c69821bcb6a88393"; }}} From this information a client application can quickly scan a directory and infer which files the data creator intended to be aggregated. -- Ticket URL: <http://cf-trac.llnl.gov/trac/ticket/78#comment:4> CF Metadata <http://cf-convention.github.io/> CF Metadata