Re: [CF-metadata] CF-metadata Digest, Vol 164, Issue 12

Little, Chris Wed, 21 Dec 2016 10:22:00 -0800

Dear CF Community

Thanks for sight of this interesting debate and approach to something practical 
in a complicated domain.


Over the years, I have seen various patterns (and anti-patterns) develop, 
whether in computer graphics or GIS systems.

Can I recommend the mathematical concept of 'completion' or 'closure' to help 
your decisions. Apologies if anyone is being taught to suck eggs.

Multi-lines or multi-polylines are, in some sense, more fundamental entities 
than lines or polylines, because the result of various operations such as cut 
or intersect may converts lines into multi-lines. The same operation on 
multi-lines always ends up with a multiline.

Similarly for polygons. Multi-polygons are more fundamental, as operating on a 
single  polygon may result in several polygons, possibly with holes e.g. Lake 
with falling water levels may convert from a single polygon into several 
disjoint puddles or lakes with islands.

Similar patterns occur in 3D graphics, where various splines are not 
fundamental, but Non-Uniform Rational Beta Splines (NURBS) are. Special cases 
of anti-patterns are adding circles but not ellipses, or rectangles and squares 
but not quadrilaterals to graphics libraries.

If I understand the threads below correctly you seem to have converged to a 
sensible set of closures.

HTH, Chris  

-----Original Message-----
From: CF-metadata [mailto:cf-metadata-boun...@cgd.ucar.edu] On Behalf Of 
cf-metadata-requ...@cgd.ucar.edu
Sent: Wednesday, December 21, 2016 5:42 PM
To: cf-metadata@cgd.ucar.edu
Subject: CF-metadata Digest, Vol 164, Issue 12

Send CF-metadata mailing list submissions to
        cf-metadata@cgd.ucar.edu

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
or, via email, send a message with subject or body 'help' to
        cf-metadata-requ...@cgd.ucar.edu

You can reach the person managing the list at
        cf-metadata-ow...@cgd.ucar.edu

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of CF-metadata digest..."


Today's Topics:

   1. Re: Feedback requested on proposed CF Simple      Geometries
      (Ben Koziol - NOAA Affiliate)


----------------------------------------------------------------------

Message: 1
Date: Wed, 21 Dec 2016 10:41:04 -0700
From: Ben Koziol - NOAA Affiliate <ben.koz...@noaa.gov>
To: Chris Barker <chris.bar...@noaa.gov>
Cc: Jonathan Gregory <j.m.greg...@reading.ac.uk>,
        "cf-metadata@cgd.ucar.edu" <cf-metadata@cgd.ucar.edu>
Subject: Re: [CF-metadata] Feedback requested on proposed CF Simple
        Geometries
Message-ID:
        <CADckZ8dwjwOt+em_z46y_kx1jvA5MnsMs=qdj3p4nx+rth4...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Jonathan, Chris, and CF-Metadata,

Thank you kindly for the replies and apologies for the delay on the response. 
Please see responses below. The comments were very useful in introducing some 
conceptual hangups and will help us moving forward. As usual, looking forward 
to continued discussion.

A quick note. We?ve revised our thinking a bit and recorded our current design 
in an AGU poster: http://goo.gl/0NI4Sd. Based on the feedback in this thread 
and what we?ve learned in the process of preparing the poster and software, we 
need to discuss a bit more and will prepare a proposal for the community to 
review soon.

I was asking whether this means that for each *collection* (of points, lines or 
polygons) there is a *single* timeseries. For instance, in your example of a 
single geometry composed of several polygons, there is a single number for each 
time. But that is not the case for weather stations; for each weather station 
there is a timeseries, and at each time there is a different number (value of 
temperature, precipitation or whatever) for each weather station. You also 
write, ?The US National Weather Service?s National Water Model (NWM) ? 
forecasts streamflow rates in about 2.7 million stream segments averaging 2km.? 
The stream network is a MultiLineString geometry, but I don?t think there is 
just one value of streamflow applying to the entire network at any given time; 
I guess there is a different timeseries for each stream segment. But in my 
example above, the Atlantic Ocean is a single polygon with a single timeseries 
for its average temperature, not a different timeseries for each node
 . Thus I am unclear about the dimensions of the data. In terms of your 
original example, does the data have dimensions (time,geometry, where 
geometry=1) or (time,node)?

Before diving in, it?s critical to define some terminology. A geometry is meant 
to refer to a potentially multipart geometric entity that might otherwise be 
called a feature. A geometry is made up of one or more points, lines, or 
polygons.

That said, we are thinking the dimensions of time-varying data would be (time,
geometry) where time and geometry may have arbitrary lengths. Hence, multiple 
time-varying variables could be associated with each geometry.
Chris addressed this in his response.

How geometry data is ?exploded? is up to the client-software. The 2.7 million 
stream segments would likely not be a single MultiLineString geometry. The 
geometry counts could be 2.7 million Linestrings and 2.7 million Polygons. One 
could collapse all this geometry data into single multi-geometries, but this 
would prove unwieldy. Some of the LineStrings could be discontinuous 
multilinestring geometries (only requiring one index on the geometry dimension 
but consisting of two physical LineStrings).

This seems to me to be a crucial difference. In the former case the simple 
geometry can be regarded as a more complex alternative to cells bounds - the 
cell has a complicated geometry of nodes and lines, but it?s still a single 
cell. In the latter case you?re providing many timeseries in an unstructured 
geometry, which is what ugrid describes. Which do you have in mind?

We intend for this proposal to fit in the Discrete Sampling Geometry timeSeries 
featureType. So this proposal does not contain any new mechanism to link a 
time-varying data variable with a network composed of polygons, points, and 
lines (a whole hydrologic system for example). UGRID provides some mechanisms 
for this similar to other CF conventions (data is associated with a grid center 
point and its bounds for example - or a ?face? has a center point in UGRID).

It?s possible your question is still not being addressed. ?Nodes? are used in 
all geometries. We would never associate time-varying data with nodes or the 
edges between them. Data would always be associated with the geometry (a 
feature comprised of nodes).

You propose the index variable in order for the convention to be like ugrid. 
However this still seems to me to be an unnecessary complexity and use of space 
if you aren?t going to have many shared nodes. I think the case for having 
another convention, distinct from ugrid, is stronger if it is *unlike* ugrid in 
this respect, and therefore simpler as well.

Sharing nodes should be possible within the spec in our opinion. There may be 
overhead for some dataset encodings, but if one is willing to sacrifice 
computational time when writing complex geometric datasets with 
shared-node-topology, considerable disk space and memory may be saved.

Reusing coordinate indexing indirection does not seem like a duplication of 
UGRID. In fact, it makes sense to align with UGRID as much as possible to 
facilitate data exchange.

I agree that repeating the inside/outside flag many times is wasteful.
That, coupled with your clarification that you may have several geometries, 
each consisting of several elements (points, lines, polygons), means that you 
need, in effect, a ragged array of ragged arrays (geometry,element,node). This 
is more complicated than DSGs, but it seems to me it would be reasonably easy 
to understand if your multi-geometry example 
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
was stored something like this:

geom=3;
part=11;
node=36;
int number_of_parts(geom);
  number_of_parts:parts="number_of_nodes";
int number_of_nodes(part);
  number_of_nodes:inout="inout";
char inout(part);
float x(node);
float y(node);
number_of_parts=6, 3, 2;
number_of_nodes=4, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3; inout="OIIIOOOOIOO"; x=0, 20, 
20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, -40, -20, -45, -20, 
-10, -10, -30, -45, -30, -20, -20, 30, 45, 10, 25, 50, 30; y = 0, 0, 20, 20, 1, 
5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29, -40, -45, -30, -35, -30, 
-10, -5, -20, -20, -15, -25, 20, 40, 40, 5, 10, 15;

where I assume that all polygons are closed. What do you think?

This is not a bad approach and, with some modifications, it could be used.
A few thoughts:

   - In regards to ?ragged array of ragged arrays?, this is solved in our
   proposal by using multiple coordinate index / geometry variables indicating
   a ?geom/instance? dimension to identify associated data variables. These
   variables may use their own coordinate vectors or index into shared
   coordinate vectors. Some more discussion on this when responding to Chris?s
   comments.
   - This approach is more complex, in our opinion, than the indexing
   approach. In the case of multiple geometries in a single file, it is not
   clear how these variables are all linked together. The coordinate index
   variable hosts the attributes necessary for self-description.
   - We had considered adopting something along these lines to avoid the
   use of ?break values?. We wanted to avoid variables where possible. The
   contiguous ragged approach requires only stop indexing (one additional
   variable). For NetCDF4 ragged arrays, no ragged indexing is required
   keeping the schema clean. Geometries lend themselves naturally to NetCDF4
   ragged arrays. We agree break values are confusing. Break values present a
   reasonable solution without adding considerable instrumentation for
   multi-part geometries which will generally be an exception.
   - The nice thing about the coordinate index approach is that one
   approach basically works for all geometry types. In your example,
   number_of_parts and inout are only present for multi-geometries
   (multipolygons).
   - It is more difficult to extract a single geometry using this approach.
   Not that big of deal necessarily, but the contiguous ragged and NetCDF4
   variable length make the process relatively simple. With geometries, we
   view single element / random access as an important schema characteristic.

For comparison, your CDL would look something like the following when 
transformed into our proposal. Note, the ragged index and nodes have been 
broken onto one line per geometry:

geom=3;
index=44;
node=36;
int start_of_geom(geom);
  start_of_geom:contiguous_ragged_dimension = "index"
int coordinate_index(index);
  coordinate_index:geom_type="multipolygon";
  coordinate_index:geom_coordinates="x y";
  coordinate_index:geom_dimension="geom";
  coordinate_index:stop_encoding="cra";
  coordinate_index:multipart_break_value=-1;
  coordinate_index:hole_break_value=-2;
  coordinate_index:outer_ring_order="anticlockwise";
  coordinate_index:closure_convention="last_node_equals_first";
float x(node);
float y(node);

start_of_geom=0, 24, 37 ;
coordinate_index=0,1,2,3,-2,4,5,6,-2,7,8,9,-2,10,11,12,-1,13,14,15,-1,16,17,18,
19,20,21,-1,22,23,24,25,26,-2,27,28,29,
30,31,32,-1,33,34,35 ;
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, -40, -20, 
-45, -20, -10, -10, -30, -45, -30, -20, -20, 30, 45, 10, 25, 50, 30; y = 0, 0, 
20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29, -40, -45, -30, 
-35, -30, -10, -5, -20, -20, -15, -25, 20, 40, 40, 5, 10, 15;

On to Chris?s comments.

I think it may be helpful to borrow terminology (and the data model) from the 
GIS world here. In this case, I am referencing the geoJSON spec, as I happen to 
be working with that at the moment, but the basic data model is pretty 
consistent.

Agreed. We want to, at the very least, maintain consistency with geometry types 
(not case sensitive). The ?break values? are a poor man?s parentheses 
corresponding to their uses in arrays used by GeoJSON and WKT.

Note that they have ?geometries? which can be things like points, polygons, 
polyllines. IIUC (and I?m no osgeo mavin) geometries represent a ?single?
entity. Then there are ?Features?: a Feature is essentially data associated 
with a particular geometry. But note: there are ?Collections? ? both Geometry 
and Feature Collections ? that is what you use to ?bundle? various data 
together.

I think we may be well served by thinking in terms of mapping the GIS data 
model to CF/netcdf ? for instance it would be great to be able to write a 
netcdf<->geoJSON converter that was lossless, AND would be fairly ?native?
in both cases.

Agree in principle. In practice, this proves difficult of course. :-) A 
FeatureCollection may contain features with different geometry types. We would 
need to add an additional dimension variable describing the geometry type. A 
GeometryCollection itself may nest inside a FeatureCollection. We think of 
simple geometry variables as FeatureCollections with a static geometry type. 
NetCDF groups may help provide a crosswalk, but I am not sure we are ready to 
go there.

GeoJSON also tends to break down with time series - repeating geometries 
coordinates ad nauseum for each time coordinate. TopoJSON offers some solutions 
in this regard. TopoJSON also incorporates node sharing. Chris did link to this 
JSON spec already.

Also, how do NetCDF attributes, dimensions, and data types fit into GeoJSON? No 
good answer for this one.

What?s important is that we establish a way to encode simple/basic geometry 
types. Collections can be created using indexing or hierarchies. The basic 
static geometry-typed FeatureCollection should be sufficient for most 
applications.

(though I?m still confused, maybe you can have an ?array? of data associated 
with a GeometryCollection?)

as for MultiLineString ? you could associate an array of data with the 
Multilinestring ? so one value per segment. But I think that violates the 
intent of the data model ? you should have a GeometryCollection of linestrings 
instead. and then each segment has its own geometry and you can associate an 
array of data with that. (or it should be a FeatureCollection?
I?m getting confused now!

Yup, makes your head hurt. No good answer for this. This leads back to the 
?ragged array of ragged arrays? comment by Jonathan earlier. It is likely 
beyond the scope of our proposal. The solution may lie in turning geometry 
variable attributes and names into arrays themselves adding another layer of 
indexing in the process. We hope that the simple geometry encoding methods 
could be reused in an effort like this.

Of course, CF doesn?t need to follow this data model, but it?s a good idea to 
be informed by it.

Yes, absolutely. Your descriptions of geometry types and collections is 
correct. GeometryCollections can also be used for the *same* types of 
geometries - a minor point.

In the GIS data model, nodes are not shared between geometries, and you are 
quite right that keeping nodes separate with geometries indexing into it is an 
added complication and would not be space-efficient.

However, there is another reason to do it ? it makes it definitive that two (or 
more) geometries share the exact same node, rather than them being distinct 
points that happened to be at the same location (Or worse, with FP error and 
all, two points that are very close)

Yes, this is why we are for the coordinate indexing approach used in UGRID.

This is actually a major limitation in the standard GIS model.

Also yes!
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mailman.cgd.ucar.edu/pipermail/cf-metadata/attachments/20161221/28c0b6e9/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
CF-metadata mailing list
CF-metadata@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata


------------------------------

End of CF-metadata Digest, Vol 164, Issue 12
********************************************
_______________________________________________
CF-metadata mailing list
CF-metadata@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Re: [CF-metadata] CF-metadata Digest, Vol 164, Issue 12

Reply via email to