Re: [Geotools-devel] distributed rendering

Rich Fecher Mon, 16 Feb 2015 08:27:13 -0800

Wow, a lot of great activity on this thread over the weekend.  On my end, a
family issue came up so I'm playing catch up now - first I'll try to give a
little background and then try to start from the top and work down...


As far as background, one thought here that's a bit bigger picture than my
specific request, is if there's interest, how can we make the public API of
geotools and geoserver more amenable to distributed processing?  And yes, I
realize to introduce broader concepts like this into a baseline that
primarily requires solid and consistent behavior will require patience.
But you can count me in for the ride, if there is enough interest.

I think there are a variety of choices for my specific use case.  I chose
one that used spring bean exclusion to introduce a map output format
extending the default RenderedImageMapOutputFormat that could understand a
flag that can be set per layer (only if that layer is backed by geowave's
datastore) with additional parameters that give you basic control, over the
subsampling such as an alpha threshold per pixel, using number of features
rendered to a pixel, and whether you only want to subsample on the topmost
featuretypestyle - considering multiple featuretypestyles define the layers
of the resulting image, you can really only truly subsample using the
topmost featuretypestyle, but you could say if other featuretypestyles
render to a pixel that this is good enough).  We can call that flag
"distributed rendering mode"  for lack of a better term.  As Chris points
out, "distributed rendering" is oversimplifying as there is some
interesting sub-sampling we can do on the back end.  I can dig into the
concept of GetMapCallback building a DirectLayer - from that standpoint we
will ill fit into the spirit of geoserver/geotools public API's much better
I think, and still have full control, similar to the full control I jumped
on with the custom map output format approach.  However, my concerns are
that we want to re-use as much code as possible, and we'd prefer not to be
a unique snowflake in the geotools ecosystem if we can make this concept of
broader appeal.  As a direct layer, I feel we may essentially end up in the
same boat of wanting to re-use components of the streaming renderer in ways
that are not available through the public API, which in the end is not
really a maintainable proposition for us.  But I also feel we will likely
be a unique snowflake if we just talk specifically "WMS rendering" when we
talk about exposing/expressing distributed computation intuitively in
geoserver/geotools.  It may end up being hard to justify any further hooks
than you have already suggested without the broader appeal of perhaps
thinking about the problem generally as rendering is the special case of
"my data store wants to expose distributed processing."

Here's discussion for the group that I don't think is on this thread:

Myself: I know I tried to use a render transform first, much like the use
> case where I subsample/decimate the results.  One key difference that made
> a render transform awkward to use for the distributed rendering was that
> the style rules are outside of the render transform and you can have
> multiple featuretypestyle's with different render transforms for each
> layer.  I want to be able to send one query that projects the
> processing/rendering of the data, local to the data and not do it multiple
> times.  The render transform technique was outside of this scope.  I say
> "processing" because I wonder if it makes sense to expose as part of the
> public API something at the DataStore level or the FeatureCollection level
> that somehow marks it as distributed, and the same intuitive hooks for
> distributing WPS processes can be used for distributing rendering too?
>
>> Andrea: Datastore and collection are just data providers, if you have a
>> good distributed implementation of them more power to you, but I don't
>> see how marking them as parallel will benefit the user of the collection?
>> As data providers, processing is not their job (again, if you are hiding
>> processing below the surface, that's fine, the normal code should not be
>> bothered by that though), but you can pass them FeatureVisitor, and there
>> we might have a more natural integration with map/reduce
>> style processes, provided we figure out some way to mark a visitor as
>> something serializable (so that it can be sent to other nodes), maybe
>> with an indication of the preferred distribution strategy (slice over
>> space, time, attribute ranges?),
>> and "reduceable" (given the results of the N distributed visits, who do I
>> put it back?).
>> And the same could be of course used locally to leverage multiple
>> cpu/cores.
>
>
To clarify, yes we have a good distributed implementation of the data
provider hooks. As a point of reference as to the backend approaches we're
leveraging, essentially we are looking at initially supporting the stores
that most closely resemble Google's BigTable (Accumulo right now, then
HBase because its very similar, then perhaps Cassandra, with the
understanding that there could be a bit more to take on there).  Who know
we may look into more in the future.  But the idea that our feature
collection can stream back features and lots of them to a client (in this
case geoserver) is not quite enough in our situation.  We need to be able
to distribute the processing too.  I really think Andrea is getting at
something with the visitor pattern, particularly when you look at the
ContentFeatureSource accepts() method. The real win is if we can decouple
the processing from the data source, because as a source I would want to
run any process defined by a third party that can work on my feature data
(or grid coverages :-) ...or am I getting too greedy there). I think the
distribution strategy can be encapsulated by the data provider, but that
"reduceable" part of how do I put the results back together really has to
be encapsulated by the process.

Re:

> PS: in any case we might have a problems with labelling, if the style has
> any
> label the distributed rendering thing cannot be applied... I guess your
> GetMapCallback
> could split the style in two parts and create two layers, a traditional one
> for layers, and a distributed one for everything else.
>

Did I miss something in assuming that as long as I composite the
distributed layers in the same order as StreamingRenderer ordinarily does
with the drawOptimized() method and MergeLayersRequest I can handle any
style, including labels?  Of course have to keep the images rendered by my
distributed nodes separate for each featuretypestyle and for the labels,
but then in my "reduceable" part, I composite the results for each
featuretypestyle and the labels in the usual order, trying to re-use the
existing StreamingRenderer pattern.

Re:

> In terms of how the initial request is handled, I am a little unclear on
> the choices here.  I believe we can and should restrict ourselves to SLDs
> which are 'easy to distribute'.  I think trading some style options for
> fast WMS responses with data sets involving billions of features is
> reasonable.  As we get a better handle on system preformance, I'd be
> interested in supporting more styling as we zoom into a range which we know
> we can range render quickly enough.


It seems to me to add an unnecessary level of complication to choose your
style based on zoom level, when as Chris eloquently pointed out we're
really going for "data size independence." We're solving the problem not of
being restricted by our style, we're solving the problem of being
restricted by the amount of data.  Really what we are trying to leverage
here is twofold: distribution of processing and short-circuiting traversal
(taking advantage of the fact that an image represents a finite amount of
information, we're talking pixels here or more generally samples organized
by bands in multiple dimensions).  You can only render a pixel once for a
map request.

>From the GeoTools side, is there anything we can/should be doing to make
> the process of passing hints and getting back synthetic SimpleFeatures more
> clear?  GeoMesa does this in a few places to help generate heatmaps and
> time series.  (I mention that since inserting ourselves in the GetMap step
> will likely let us call out to the database directly and return
> RenderedImages more cleanly.)


To clarify, I did just use this synthetic simple feature approach as a way
to fit in for now trying to re-use as much of the streaming renderer as I
could. We try to make use of the render transform for this summary type of
information and you may be able to get by with it.  I just don't want this
kinda more specific discussion to lead you in the wrong direction, for
example check out a vector to raster format render transform if you want to
calculate heatmaps dynamically.  We like to be able to have a globally
characterized heatmap (support for quantiles) which we calculate through
map-reduce (essentially KDE with the final reducer "ranking" each grid
cell, which ends up being a "Percentile" GridSampleDimension in our
datastore, giving us the ability to use the CSS styling to define a color
ramp by "Percentile", so the distribution of colors is pretty
straightforward).
For the time series, have you guys looked at WCS?  I don't think there are
many clients for it, but it sounds like you guys are doing something fairly
specific anyhow.

Andrea, I'll support Jody's multi-threaded/parallel rendering ideas.  Chris
> and Rich have shared results from a 5 node cluster.  If I am understanding
> correctly, each of those servers will produce a RenderedImage with part of
> the final image.  In particular, I don't expect the rendered images to
> overlap all that much.  One server would have data for the Americas and
> another would paint the ranges for Europe.  GeoMesa uses a random sharding
> approach, so when we do this, our tiles may look very similar.
> Anyhow, on the GeoServer side, GeoWave and GeoMesa would both be providing
> at least one RenderedImage per cloud node.  If there are exactly 5 images,
> whatever approach would be fine.  If there are hundreds of images to merge,
> would having a concurrent merging process help?  I'd conjecture that a
> local process which would help uDig use multiple cores may also help our
> distributed rendering pipeline.


I just wanted to clarify my perspective on partitioning data.  First let me
say that we deploy to clusters orders of magnitude beyond 5 nodes and,
while I know the GeoMesa team understands this, I did want to clarify for
the group that the "well beyond 5 node case" is the horizontal scalability
that we're designing for here. From our perspective, 5 nodes is roughly the
minimum system where it makes sense to use GeoWave (ie. its likely much
simpler to use PostGIS if you're not looking to scale beyond a few nodes).
As far as partitioning data, random sharding should definitely be an
option.  I don't think it will scale as well for this distributed rendering
discussion here but benchmarks and flexible configuration will shed better
light on that.  By keeping the partitioning geographic rather than random,
we will be only hitting a certain subset of our nodes for each tile
request, so the number of images to composite won't be the number of nodes
- depending on the size of the request envelope, it should be a significant
subset of the number of nodes.  To clarify for the group, random sharding
is generally a technique to avoid what I'll call hot spotting on any
particular nodes, whereas concurrent use is part of the story to not have
to randomly shard to avoid this hot spotting.  The other part for us is
something we call tiering - which is to say we support multiple
n-dimensional space filling curves.  Where this comes into play is that
lines and polygons aren't going to fit exactly one space filling curve
value.  We support a configurable number of space filling curves, just
prefixed by a "tier" identifier, right now just utilizing different
resolutions (variable cardinality per dimension). Chris actually did a lot
of research into space filling curves, so I'm just clarifying the results
of his awesome research, but we basically settled on the compact hilbert
curve as our default space filling curve implementation but the space
filling curve implementation completely pluggable (
https://web.cs.dal.ca/~arc/publications/2-43/paper.pdf). Anyways, back to
the point, polygons and lines can overlap many values on the space filling
curve.  Here's kinda a fun example of a z-order curve:
http://bl.ocks.org/jaredwinick/5073432 but really range decomposition can
cut that down a bit.  We choose from one of our "tiers" the space filling
curve that can relatively concisely represent the polygon/line, which
introduces a new variable, tier ID that causes some distribution beyond
purely geographic, also alleviating some hot spotting concerns.

Per Jody's use cases in uDig, any client-side rendering improvements are
always great to bonus off of, particularly when we talk about merging the
images together in the end - because as was mentioned, there could be a lot
of images.  I'd at least say that if the data for the request envelope is
distributed across so many nodes that merging the request together is a
major bottleneck, the dataset is likely so tremendously large that typical
client-side rendering approaches would be off the table anyway.

Anyways in re-reading this, its quite a lot for anyone to digest!  Sorry!

Rich


On Mon, Feb 16, 2015 at 2:50 AM, Andrea Aime <andrea.a...@geo-solutions.it>
wrote:

> On Mon, Feb 16, 2015 at 1:04 AM, Jim Hughes <jn...@ccri.com> wrote:
>
>>  Hi all,
>>
>> I'd like to try and summarize the discussion so far to make sure that I
>> understand.
>>
>> As a request comes into GeoServer, it would either be handled by a custom
>> GetMapCallback or by a distributed-render-aware Renderer by GetMap.  From
>> there, the distributed GeoTools data/feature store would accept sufficient
>> rendering hints so that it could return artificial SimpleFeatures
>> containing a RenderedImage.  On GeoServer, those multiple RenderedImages
>> would be combined into the final image to satisfy the WMS request.
>>
>
> The first (GetMapCallback building a DirectLayer) being the preferred way,
> since it does not add into the StreamingRenderer code
> paths that we don't have a real implementation for in GeoTools (or
> anything in the build server that would inform us of issues).
> Also, as Jody often suggests, one needs three different implementations to
> setup an interface, we don't get to one here.
>
>
>>
>> In terms of how the initial request is handled, I am a little unclear on
>> the choices here.  I believe we can and should restrict ourselves to SLDs
>> which are 'easy to distribute'.  I think trading some style options for
>> fast WMS responses with data sets involving billions of features is
>> reasonable.  As we get a better handle on system preformance, I'd be
>> interested in supporting more styling as we zoom into a range which we know
>> we can range render quickly enough.
>>
>> At the moment, can we ask the Style classes how they can be simplified to
>> be 'distributable'?
>>
>
> This is more of a GeoServer question, but regardless, I guess it depends
> on how you want to play this, I see two ways, logically distinct.
> * "I want only the styles we provided to be usable" -> Someone should
> implement in GeoServer a new flag, at the WMS service and layer level, that
> makes GeoServer restrict the styles usable against a layer to the ones
> configured (right now there is no such validation)
> * "Any style, even user provided, will be accepted, provided it's
> distributable" -> you do your own checks in your GetMapCallback
> implementation and throw a ServiceException if the style is not
> distributable
>
>
>> Andrea, I'll support Jody's multi-threaded/parallel rendering ideas.
>> Chris and Rich have shared results from a 5 node cluster.  If I am
>> understanding correctly, each of those servers will produce a RenderedImage
>> with part of the final image.  In particular, I don't expect the rendered
>> images to overlap all that much.  One server would have data for the
>> Americas and another would paint the ranges for Europe.  GeoMesa uses a
>> random sharding approach, so when we do this, our tiles may look very
>> similar.
>>
>
> Mind, "support" in open source means doing work, we work as a do-ocracy (
> http://wiki.osgeo.org/wiki/Do-ocracy), just saying you like something has
> some "warm and fuzzy" effect on others, but litlte practical consequence ;-)
>
> Cheers
> Andrea
>
>
> --
> ==
> GeoServer Professional Services from the experts! Visit
> http://goo.gl/NWWaa2 for more information.
> ==
>
> Ing. Andrea Aime
> @geowolf
> Technical Lead
>
> GeoSolutions S.A.S.
> Via Poggio alle Viti 1187
> 55054  Massarosa (LU)
> Italy
> phone: +39 0584 962313
> fax: +39 0584 1660272
> mob: +39  339 8844549
>
> http://www.geo-solutions.it
> http://twitter.com/geosolutions_it
>
> *AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*
>
> Le informazioni contenute in questo messaggio di posta elettronica e/o
> nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
> loro utilizzo è consentito esclusivamente al destinatario del messaggio,
> per le finalità indicate nel messaggio stesso. Qualora riceviate questo
> messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
> darcene notizia via e-mail e di procedere alla distruzione del messaggio
> stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
> divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
> utilizzarlo per finalità diverse, costituisce comportamento contrario ai
> principi dettati dal D.Lgs. 196/2003.
>
>
>
> The information in this message and/or attachments, is intended solely for
> the attention and use of the named addressee(s) and may be confidential or
> proprietary in nature or covered by the provisions of privacy act
> (Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
> Code).Any use not in accord with its purpose, any disclosure, reproduction,
> copying, distribution, or either dissemination, either whole or partial, is
> strictly forbidden except previous formal approval of the named
> addressee(s). If you are not the intended recipient, please contact
> immediately the sender by telephone, fax or e-mail and delete the
> information in this message that has been received in error. The sender
> does not give any warranty or accept liability as the content, accuracy or
> completeness of sent messages and accepts no responsibility  for changes
> made after they were sent or for other risks which arise as a result of
> e-mail transmission, viruses, etc.
>
> -------------------------------------------------------
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
> _______________________________________________
> GeoTools-Devel mailing list
> GeoTools-Devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/geotools-devel
>
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk

_______________________________________________
GeoTools-Devel mailing list
GeoTools-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geotools-devel

Re: [Geotools-devel] distributed rendering

Reply via email to