Re: [CF-metadata] Towards recognizing and exploiting hierarchical groups (Charlie Zender - Steve Hankin - Richard Signell)

Cameron-smith, Philip Fri, 20 Sep 2013 12:53:23 -0700

Hi All,

I like Steve Hankin's point (below) about 'powerful' versus 'interoperable' .  
I hadn't thought about it quite that way before :-).

>From my point of view, I do see value in including hierarchical information.  
>The most useful case I have seen mentioned so far involves putting datasets 
>from different sources (eg different models and observations) into a single 
>file.  And I can see that there will be times when the choice of how to 
>organize the hierarchy is sufficiently clear that it would be helpful.  Hence, 
>I think this is a valid discussion to be having :-).

What I have not seen mentioned so far is the impact on file sizes.   Our output 
simulations generate large datasets and it is impracticable to put all the data 
into a single file.  Even if the operating system can handle Terabyte, or even 
Petabyte files, one will have problems transferring them and reading them into 
memory. Hence, for the datasets I deal with, we normally work with files each 
containing one variable from one source (and the hierarchy within a file of 
only one variable isn't very interesting ;-).

Hence, the best use case for using hierarchical structures _inside_ a file that 
I have seen so far is limited to situations where all the following are true:

a)  there are several different datasets which people would like to 
intercompare, and
b) there is a clear and obvious way to organize the hierarchy, and
c) the datasets are fairly small.

I think the case for putting the hierarchical information _outside_ the files 
is stronger.  There is clearly no file size problem, and in fact might help by 
reducing the need to access large files.  It would also be easier to update.  
There would still be a challenge to make sure that externally stored 
information stays synchronized with the actual datafiles, but I don't see that 
this should be a show stopper.

In summary, I am not yet convinced that the value of allowing hierarchy inside 
files is worth it.  I do see greater value in storing hierarchy information 
externally or allowing it to be generated from something like the 
'dot-appending' system suggested by Steve Hankin (in his Sept 16 email).

Best wishes,

     Philip

-----------------------------------------------------------------------
Dr Philip Cameron-Smith, p...@llnl.gov, Lawrence Livermore National Lab.
-----------------------------------------------------------------------

From: CF-metadata [mailto:cf-metadata-boun...@cgd.ucar.edu] On Behalf Of Steve 
Hankin
Sent: Thursday, September 19, 2013 10:03 AM
To: Corey Bettenhausen
Cc: CF Metadata List
Subject: Re: [CF-metadata] Towards recognizing and exploiting hierarchical 
groups (Charlie Zender - Steve Hankin - Richard Signell)

On 9/19/2013 9:05 AM, Corey Bettenhausen wrote:

On Sep 19, 2013, at 11:29 AM, Karl Taylor wrote:

Hi all,

Again, I may be unaware of all the possible uses of hierarchies, but here's our 
experience with CMIP.

It seems to me if hierarchies are for the purpose of "organizing" datasets (or 
organizing a bunch of files), this should fall outside CF's purview because a 
single hierarchy is rarely ideal for all purposes.

I wasn't under the impression that CF would dictate how these datasets are 
organized into hierarchies. Rather, the organization of datasets within the 
file would be left to the producers or users. However, CF-aware software should 
be able to traverse the hierarchy and perform the same functions as if the file 
were flat (assuming the datasets are described appropriately with CF metadata).

Did I misunderstand the original proposal?

Cheers,

-Corey

Hi Corey,

Your question hits on the underlying dilemma.  CF is more powerful when it 
offers the greatest possible flexibility for creators of files;  like a 
programming language it enables you to go wherever your imagination can lead 
you.   But CF is more interoperable when it restricts the ways you may organize 
your file in enough to ensure that both the people and the machines receiving 
it will know (without exploration) how to pull semantically meaningful data 
from it.  I think most everone would agree that the reason we create 
conventions is in order to restrict behavior.  The battle lines get drawn over 
how severely we restrict it.  In these email dialogs I have several time used 
the quotation 'To create quality software [standards], the ability to say "no" 
is usually far more important than the ability to say "yes."'  (The Rise and 
Fall of CORBA (*)<http://queue.acm.org/detail.cfm?id=1142044>).   It's a bummer 
to be a wet blanket, but it's a bummer to watch a standard go south, too.  And 
plenty of them do ....

We have not yet touched on the impacts that embedding groups and hierarchies 
into files may have on the need to aggregate files along their time axes;  or 
on how to make sure that the way groups and hierarchies are used doesn't stand 
in the way of generating quality metadata that describes the contents of a CF 
file.   NASA and other HDF5 projects no doubt have tons of experiences in these 
issues that would be very interesting to hear about.  What have been the down 
sides to the use of groups and hierarchies?  How could those downsides have 
been minimized through more restrictive conventions?

    - Steve

(*)  thanks to Russ Rew for contributing this citation into the CF discussions 
long ago

For CMIP we place files in a hierarchical directory structure based on the 
global attributes stored.  We also bundle collections of files into datasets, 
but that's for practical reasons imposed by the ESGF search engine that can't 
efficiently handle millions of files, but is able to handle 10's of thousands 
of datasets.  The collections imply a single level hierarchy.  Note that 
outside of ESGF users would normally choose not to define "datasets" in the 
same way that we do in ESGF.

In general I think hierarchies can be useful in organizing data, but rarely 
will everyone agree on what hierarchy is most convenient, so I don't see why 
such hierarchies need to be included in CF.  The global attributes, on the 
other hand, are fundamental and can be used in flexible ways to produce 
whatever hierarchy might be best for a given situation.  In CMIP some of the 
global attributes normally used to construct directory structures are:  
institution name, model name, experiment name, sampling frequency (e.g., 
monthly, daily, 3-hourly), realm (e.g., atmosphere, ocean, land), "realization" 
(for ensembles of runs differing only slightly), variable name.   The hierarchy 
suited to the CMIP archive places the model name at a fairly high level 
(because the data are stored at nodes hosted by individual modeling centers; 
the distributed dataset can be accessed through a single ESGF portal).  Once 
the user downloads the data, however, a more appropriate structure might be to 
pla

 ce the

variable name at a high level and then near the bottom of the hierarchy you 
would find out which models had output that variable.

I agree hierarchies of directories can be quite useful when trying to find what 
you need, but the need for flexibility suggests to me that those hierarchies 
should appear outside CF.  Hierarchies don't seem to me to be intrinsically 
needed to make data files self-describing.  [In CMIP the data gets associated 
with "groups" simply by defining the global attributes I listed above.]

best regards,

Karl

On 9/19/13 6:55 AM, Corey Bettenhausen wrote:

On Sep 18, 2013, at 12:32 PM, Steve Hankin wrote:

On 9/18/2013 7:56 AM, Roy Mendelssohn - NOAA Federal wrote:

Hi All:

NASA has used hierarchies for years, and appears committed to them.  So, either 
it is done in an ad hoc way, or through a standard.  That doesn't mean CF is 
the place for the standard, just that it would be nice to have one.

Roy,

Lets explore the avenue you have opened here:  "that doesn't mean CF is the 
place for the standard".  The need for hierarchies as tools for programming is 
indisputable.  But will hierarchical groups advance the interoperability 
objectives of CF?

Steve,

Speaking for myself, I use groups in data files to organize the various 
datasets so that a person looking at the file via the commandline (h5dump, 
ncdump) or application (HDFView, Panoply) can find the dataset they're 
interested in easily. For instance, in our swath-level (L2) data, we have a 
number of datasets that aren't really that relevant to our end users, but could 
come in handy when diagnosing a problem with the algorithm or to monitor 
algorithm performance. So these diagnostic datasets don't clutter up the 
output, we've put them into a separate group from the main datasets.

So, in this case, do the groups make the files more interoperable? Not really, 
if we're talking about a completely software-driven system. But this *does* 
make them more user-friendly, and we'd definitely like to maximize our 
compatibility as well with those software-driven processes. Why not have the 
best of both worlds?  Hence, I'm fully supporting CF incorporate groups into 
the conventions. I think Charlie's proposal is an excellent starting point.

Cheers,

-Corey

At the start of this discussion I had assumed that there would be compelling 
examples that supported the introduction of hierarchies to CF.  Thus far all 
that have been put on display seem to be counter-examples(*):

    * For CMIP5 any given hierarchy is an arbitrary, brittle representation.  
The CMIP5 collection is better modeled by facets (metadata tags) than by 
hierarchies.

    * The suitcase analogy serves best to illustrate the problems that 
hierarchies can bring -- to locate the black socks in a suitcase usually 
involves rummaging the entire suitcase.

            * ==>  Which speaks to Rich's valid concern that the 
data-discovery-to-data-access transition may be very negatively impacted if 
hierarchies are not used carefully.

    * NASA hierarchies that are 10 levels deep strike me as by definition an 
"insider" view of a data collection.  These hierarchies may add clarity for the 
specific satellite program communicating with its designated science groups, 
but they are likely a barrier to an outsider wanting to utilize the data.

To proceed forward we need to see some compelling use cases that will help us 
to understand the costs and benefits?

    - Steve

(*) with the exception of Feature Collections types already contained in CF

=================================================

I would point out that every major modern  programming language has structures, 
which are essentially hierarchies.  Matlab was criticized for years about not 
having structures, and finally added them a few years back.  R has them, C has 
them, Python has them, even modern Fortran has them.  So clearly there must be 
situations where hierarchies make sense, and are more efficient than having 
everything flat.  There are clearly situations where flattening everything 
makes sense.

My $0.02.

-Roy

On Sep 18, 2013, at 4:52 AM, "Signell, Richard"

<rsign...@usgs.gov><mailto:rsign...@usgs.gov>

 wrote:

All,

I'm glad we are discussing this topic, but the fact that large data

providers are already distributing data using groups and hierarchies

is not a compelling reason to endorse this practice through CF.  After

all, a lot of data providers are currently distributing scientific

data in any number of forms, and the point of CF (along with OGC

standards) is to help clean up the mess!

I agree that groups make sense for metadata and for certain types of

datasets.  For example, the discrete sampling geometry featureTypes

like profile collection would be easier to understand and deal with as

a netcdf4 group of profiles rather than as a netcdf3 ragged array.

But the choice was made for CF 1.6 that backward compatibility was

more important.

I don't think it's cowardly to belive that the more folks use groups

to organize their data in an ad hoc way (the suitcase analogy), the

more it will hinder the remarkable progress that has been made

recently on finding and utilizing distributed CF data via the catalog

services (e.g. the geonetwork, gi-cat, geoportal, CKAN instances) that

many governments are setting up.   When we open the data service

endpoints that our query returns, we need to have known data

structures, and that's what the CF featureTypes provide.

To return to the suitcase/clothing analogy again, we are rapidly

gaining the capability via good metadata and catalog services to find

all the black socks owned by Jim and Martin that have been washed in

the last week.  But if our catalog query returns fourteen of Jim's

suitcases and twelve of Martin's, then we have more work to do.

Unlike socks, luckily we don't need actual suitcases to organize data,

we can construct collections on the fly using whatever attributes we

desire.

I would hope that our job as the CF community would be to identify

compelling additional specific featureTypes that we should support.

And if these identified featureTypes demand groups for efficiency or

some other reason, well, let's have that discussion.

-Rich

On Wed, Sep 18, 2013 at 12:08 AM, Roy Mendelssohn - NOAA Federal

<roy.mendelss...@noaa.gov><mailto:roy.mendelss...@noaa.gov>

 wrote:

Hi All:

I am old and slow, and I must be missing something, because at this point most 
of the discussion has been about the desirability of files with groups and 
hierarchies.  Again, unless I am missing something, there already are data 
providers who are distributing data using groups and hierarchies, including at 
least one very large data provider,  and they obviously feel that there is a 
benefit to such structures.  I am not arguing whether they are right or wrong, 
just that is the reality.

If we start from that premise, then the real questions for discussion are 
should there be conventions on how groups and hierarchies are used in netcdf4 
and hdf5 files, so that a user or software provider will know what to expect, 
and the second question is if it is deemed desirable to have such conventions, 
is CF the  proper place for them to be developed.

My sense it that this is what the original proposers are after.

-Roy

**********************

"The contents of this message do not reflect any position of the U.S. 
Government or NOAA."

**********************

Roy Mendelssohn

Supervisory Operations Research Analyst

NOAA/NMFS

Environmental Research Division

Southwest Fisheries Science Center

1352 Lighthouse Avenue

Pacific Grove, CA 93950-2097

e-mail:

roy.mendelss...@noaa.gov<mailto:roy.mendelss...@noaa.gov>

 (Note new e-mail address)

voice: (831)-648-9029

fax: (831)-648-8440

www:

http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."

"From those who have been given much, much will be expected"

"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

_______________________________________________

CF-metadata mailing list

CF-metadata@cgd.ucar.edu<mailto:CF-metadata@cgd.ucar.edu>

http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

--

Dr. Richard P. Signell   (508) 457-2229

USGS, 384 Woods Hole Rd.

Woods Hole, MA 02543-1598

**********************

"The contents of this message do not reflect any position of the U.S. 
Government or NOAA."

**********************

Roy Mendelssohn

Supervisory Operations Research Analyst

NOAA/NMFS

Environmental Research Division

Southwest Fisheries Science Center

1352 Lighthouse Avenue

Pacific Grove, CA 93950-2097

e-mail:

roy.mendelss...@noaa.gov<mailto:roy.mendelss...@noaa.gov>

 (Note new e-mail address)

voice: (831)-648-9029

fax: (831)-648-8440

www:

http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."

"From those who have been given much, much will be expected"

"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

_______________________________________________

CF-metadata mailing list

CF-metadata@cgd.ucar.edu<mailto:CF-metadata@cgd.ucar.edu>

http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

_______________________________________________

CF-metadata mailing list

CF-metadata@cgd.ucar.edu<mailto:CF-metadata@cgd.ucar.edu>

http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

_______________________________________________

CF-metadata mailing list

CF-metadata@cgd.ucar.edu<mailto:CF-metadata@cgd.ucar.edu>

http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

_______________________________________________
CF-metadata mailing list
CF-metadata@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Re: [CF-metadata] Towards recognizing and exploiting hierarchical groups (Charlie Zender - Steve Hankin - Richard Signell)

Reply via email to