On 11/4/2010 5:50 AM, Ben Hetland wrote:
Hello List,

Since I'm also dealing a bit with data sets holding "particle
collections" through time, I'd like to contribute some thoughts
regarding this. Our primary use here at SINTEF is for inputs to and
results from oil drift simulations, so we're dealing mostly at the sea
surface and below, although it appears to me that similar data sets are
just as applicable up in the atmosphere as well. (Particle collections
and bounding polygons for the ash cloud from the Eyjafjallajökull
eruption springs to mind as a fairly recent example.)


On 04.11.2010 03:19, John Caron wrote:
1) It seems clear that at each time step, you need to write out the
data for whatever particles currently exist.

This is a very fair assessment in our case. One could generalize a bit
more: We have data organized as a series of time steps (as the primary
dimension). At each time step we have a number of "data" to store, and
they are of various sizes and types. Particles are but one of these
kinds of objects. Most of them may probably be treated similar to
particles, though, where a fixed set of properties describing each
object can simply be represented by a separate netCDF variable per property.

A more nasty example could be to represent an oil slick's shape and
position with a polygon. The number of vertices of that polygon would be
highly variable through time. (This is a typical GIS-like representation.)


I assume that if you wanted to break up the data for a very
long run, you would partition by time, ie time steps 1-1000,
1001-2000, etc. would be in seperate files.

How one decides to partition I think can depend a lot on the
application. Sometimes splitting them on data type can be more
appropriate. In a recent case I had, the data were to be transferred to
a client computer over the Internet for viewing locally. In that case
reducing the content of the file to the absolute minimum set of
properties (that the client needed in order to visualize) became
paramount. Even a fast Internet connection does have bandwidth
limitations... :-)

im thinking more of the file as its written by the model.

but it seems like an interesting use case is to then be able to transform it 
automatically to some other optimized layout.



2) Apparently the common read pattern is to retrieve the set of
particles at a given time step. If so, that makes things easier.

Yes, often sequentially by time as well.

sequential is very fast.



3) I assume that you want to be able to figure out an individual
particle's trajectory, even if that doesnt need to be optimized
for speed.

Not my primary need, but if an object is "tracked" like that it would
not be unlikely that the trajectory might need to be accessed
"interactively", eg. while a user is viewing a visualization of the data
directly on screen. Does that count as "optimized for speed"?

Well, its impossible to optimize for both "get me all data for this time step" and 
"get me all data for this trajectory" unless you write the data twice.

So im hearing we should do the former, and make the trajectory visualization as 
fast as possible. If you really needed to make that really fast, i really would 
write the data twice (really). That could be done in a post-processing step, so 
we dont have to let it complicate too much right now.




1) is the avg number (Navg) of particles that exist much smaller,
or approx the same as the max number (Nmax) that exist at one time
step?

This varies a lot. Sometimes it is like you suggest, but sometimes maybe
only a few. Sometimes there isn't any defined Nmax either (dynamic
implementations), or such a limit can be difficult to know beforehand.

Even where an Nmax is set, would it be unreasonable to require the
_same_ value to be used every time if the netCDF dataset was accumulated
through _multiple_ simulation runs?

without knowing Nmax, you couldnt use netcdf-3 multidimension arrays, but you could use 
the new "ragged array" conventions. because the grouping is reversed (first 
time, then station dimension) from the current trajectory feature type (first trajectory, 
then time), i need to think how to indicate that so that the software knows how to 
optimize it.

Also, we could explore netcdf-4 which has variable length arrays, although 
these have to be read atomically, so its not obvious what performance would be 
for any given read scenario.  for sure worth trying.



2)  How much data is associated with each particle at a given
time step (just an estimate needed here - 10 bytes? 1000 bytes?)

In our case this varies a lot with type of particle, and how the
simulation was set up. A quick assessment indicates that some are only
16 bytes per particle, while others may currently require up to 824
bytes. (This does not account for shared info like the time itself,
which we don't store per particle.) It also wouldn't be very atypical if
this amount is then to be multiplied by say 20000 particles per time step.

20K particles x 100 bytes/particle = 2M / time step. So every access to a time 
step would cost you a disk seek = 10 ms. So 1000 time steps = 10 seconds to 
retrieve a trajectory. Fast enough?

Otherwise, putting it on an SSD (solid state disk) is interesting thing to try.

Not sure how long it would take to rewrite a 2Gb file to reverse dimensions.



Hope that provides some useful ideas of the real-life needs!
:-)

thanks!

_______________________________________________
CF-metadata mailing list
CF-metadata@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Reply via email to