[Geoserver-devel] GSIP 69 - Catalog scalability enhancements

Justin Deoliveira Fri, 27 Apr 2012 17:08:48 -0700

On Fri, Apr 27, 2012 at 1:29 PM, Andrea Aime
<[email protected]>wrote:


I could make things more explicit on the main proposal page, but each
section contains links to more detailed information, including the
actual "patch" to be seen inline. Namely:
<http://geoserver.org/display/GEOS/GSIP+69+-+API+Proposal>
As it's all about new stuff, when it comes strictly to the API
proposal, I judged it'd be convenient to have the whole proposal to be
seed inline. Yet there's also a link to the github branch where the
whole work lives, both API proposal, code migration for exemplary use
cases, and alternative jdbc backend:
<https://github.com/groldan/geoserver/tree/GSIP69>


All right, took half a day to go though the proposal and the code.

The proposal direction is good, agree with all the needs it expresses
and the general idea that we should be able to filter and page and load
stuff interactively.

I strongly disagree on the idea that rolling a new filter API is better than
using the OGC one, this is a first show stopper for me.
The Predicate API is very limited and has no spatial filter support,
GeoServer core already depends on GeoTools heavily so the whole
reasoning about Predicate advantages is pretty empty, I actually
see a lot more weaknesses in rolling a new API:


Yeah I agree there are definitely some upsides to using the existing
geotools filter api. But some reservations inline.

- it violates the KISS principle, adding more stuff does not make anything
  simpler

Not sure I would classify the geotools filter api as simple... the
predicate api looks simpler to me. But it sounds like by simple you mean
minimal additions, then yes i agree.

- it does not make writing encoders any easier, on the contrary, demands
  more code to be written while we already have a pretty simple way to
  split a complex OGC filter into a supported and unsupported part,
  lots of encoders that we can mimick and/or copy code from



- it does not avod external dependencies, as geotools is already there
- it misses a lot of expressiveness, instead of writing clumsly Predicates
  that can only run in memory (since they are not well known) we can
  actually use a API that can get translated to SQL (thinking about the
  name matching filters in the GUI here)


Well I think the geotools filter is a bit too "expressive"... in that
writing a simple filter requires too much code in my opinion. It really
lacks a solid builder like we have for feature stuff. If i remember were
you working on one a while back? Part of a new style builder or something?
I guess we also have cql which solves that one too.

- the idea that the domain is different is quite surprising, most of the
elements
  that grow in big numbers have a bbox attached to them, so they are
  indeed spatial. One of the things you normally want to do in a security
  subsystem is restrict access by geographic area, and we could not
  express that with Predicate

While I agree that making use of the spatial aspect of the catalog makes a
lot of sense, its not surprising to me it is looked over. The catalog has
always just been considered a configuration store providing simple crud
operations so I don't think people readily jump to seeing it as a spatial
store of information. And I can't think of a use case in geoserver today
where to do lookup of a layer based on its bounding box. But I think the
idea is actually really cool and really powerful. And if geoserver ever
wants to provide a CSW view or implementation one that will be crucial.

Moreover, with OGC filters it would get really easy to create a datastore
based catalog implementation if we want to, and it would be much better
of a proof of concept than the current ones (more on this later).

The only drawback of Filter is that it is supposed to be a "closed" API,
with no way to implement a new filter, but that is actually less of a
limitation
since the model is rich, and easily worked around by implementing
whatever filter function is missing.

I think the reliance on datastore is one of the downfalls as well... yes
there is lots of good infrastructure for splitting filters up based on
capabilities and the like... but its pretty tied to the feature model no?
Like for instance unless you have feature types around you can't really get
any information about types of attributes specified in predicates. Also the
in memory implementations of the filter are pretty heavily based on feature
objects. I believe this has been extracted so its now possible to
execute filters directly on java bean like objects but personally i have
never really used that so have no idea how well it works.

In the end I can see us writing a lot of code to turn catalog objects into
Features and FeatureType representations to pull this off. Doable, but
could also be rather clumsy.

So while I agree going with the filter api is very tempting but i am not
totally sold on it. Although I am interested to hear more about your idea
of a datastore backed catalog implementation.


Moving forward, I would advise against having to check if the store
 can data sort or not, it just makes the caller code more complicated
and forces it to do work arounds if sorting is really needed.
In GeoTools we have code that does merge-sort using disk space
if necessary that can sort whatever amount of data with little memory
footprint (of course, at the price of performance).
It would be better to have a method that checks if sorting can be done
fast instead, so if the code needs sorting as an optimization it can
leverage it or use an alternate path, but code that really needs sorting
will just ask it and have it done by the catalog impl without having to
repeat that in all places that do need sorting for good.


I agree here totally. We made this mistake with datastores and it led to
chaos. We shouldn't add any filtering or querying capability to the api
without a default implementation to go in places where native capabilities
are not available. Even if that default implementation is
horribly inefficient i think it is better than just throwing an exception
back when a user tries to do something, or as we see here have to check a
flag before usage. It defeats of purpose of having an api to abstract data
access.


A small other thing is that these methods are supposed to access file system
or the network, but they don't throw any exception... I can live with that,
most likely the calling code does not have anything meaningful to do
in case of exception anyways, but thought I'd point it out anyways.

Right, i guess this stems from the fact the original catalog api throws no
exceptions. I don't have a strong opinion either way but i know much of the
time having to deal with checked exceptions just means rethrowing them back
wrapped in a runtime exception anyways. This is the everlasting checked vs
non checked argument.


A thing that I find instead surprising is seeing no trace of a transaction
concept,
if the intent is to move to a model where multiple GeoServer share the same
db
and write on it in parallel, being able to use transactions seems quite
important,
there is a need for coordination that is not addressed by this proposal.


This is an interesting one. Indeed some notion of transaction is needed,
that is for sure. But not sure it has to be a first class citizen in the
api. Look at how spring approaches transactions. It encourages declarative
transaction management and keeping transaction handling isolated to an
aspect, keeping transaction handling code out of the main data access api
and a separate concern.

Anyways, imo making the catalog api supporting transactions will warrant
its own proposal. And in the interest of making incremental progress
practically something that could be put off to a future iteration.


The modifications done below and above the API changes are simple proofs
of concept, meaning the validation of the API is low and the checks on its
quality low as well, not something we'd want to fast track on a code base
that we want to make more stable.


Let's start by what's above the API. All we have is a handful of examples,
but the code base is largerly unchanged. On one side this means the new
code is not going to be touched very much, on the other side it means we
get no benefit from the new API and we're not validating it and its
implementation
at all. Looks like a trojan horse to get in the higher level modifications
later,
which will actually destabilize the code base as we are already in RC or
bugfix release mode.
Moreover various of the predicates created have no chance to be encoded
in native mode since they are not "well known".
In fact the authorization subsystem should be changed too in order to
leverage
the new scalability api, so that it returns a filter instead of checking
point by
point a single layer.
Same goes for the GUI filtering code, which:
- loads the full data set in memory in case the store is not able to sort on
  the desired attribute
- builds a predicate that is not encodable  (with OGC filter we could
  actually encode it instead).


Fair enough, but another way of looking at this is that it is low risk. It
seems historically common that a developer has developed some new
functionality and wants to get it into the codebase to start getting it
wider exposure. As long as users are not forced to use it or can easily
turn it off I think we have always considered that acceptable.

The other approach is what I think you are saying here is to ensure the new
api is used everywhere, in order to ensure it meets the requirements of the
system. This is the approach more in line with the new catalog for 2.0. We
ripped out the core and replaced it but had all the client code help to
validate the new stuff. Had the benefit of a large variety of unit tests
ready and waiting, etc...But it was still painful in the early going , and
something users had no way of avoiding. So not sure which is approach is
better. Both have upsides and downsides.


The bits below the API are baffling too. Both the JE and JDBC
implementations
are based on key/value store where the value is the XML dump of each
CatalogInfo.
This makes the whole point about filter encoding moot, as there is almost
no filter
being encoded down at the DB level.
Say I want to do a GetMap request with a single layer, we know the name, we
end
up scanning the whole database, load all the blobs, parse them back in
memory,
apply the filter in memory. Sure it scales to 100M items, but nobody will
want to wait
for the time it takes this implementation to do a simple GetMap.
I know they are community modules, but even a community module should have
a split
chance of being used, this implementation seems so weak that I don't
believe anyone
will want actually want to use it, and in order to validate the API we
should have an
implementation that actually makes use of some of its concept (some actual
native filtering
for example).

I agree that the approach of serializing as a single blob and maintaining
it by key doesn't map well to a relational database so the jdbc
implementation seems weird. But it does map more naturally to non
relational stores, document databases, etc... Which could also be a nice
fit. Thinking something like couch that allows for easily working with json
documents. Could be a nice fit since we already easily emit json
representations of all the catalog objects.


(little aside, nice to see bonecp in the mix, I actually wanted to try out
that connection pool
me too)

Long story short, the proposal seems weak in some API points, and the
implementation is
proof of concept which I don't think should be allowed to land on trunk
right now.

But I want to offer a reasonable alternative: shall we move to a time boxed
6 months
release cycle? 4 months to add new stuff, 2 months to stabilize, rinse and
repeat,
push out 3-4 bugfix releases in the stable series.


I would love to see this, but am skeptical about it. Strict time boxed
iterations sound great on paper but have practical issues. We did try to
maintain this a few years back and did for a while, but the process was
pretty frustrating. It's hard to timebox on a project like geoserver in
which so much large scale feature development happens, the mandates and
deadlines for which are driven solely by customer requirements and
schedules.

There is also the question of resourcing. Managing a process like this
takes organizations stepping up with resources to actually do releases,
help review and manage proposals, etc...  all in a timley fashion. its
significant. In the past it is has generally been the same people tasked
with doing releases. People have talked about stepping up to share the
burden but i have yet to see it really happen.

Anyways, looking forward to trying to make this work. Hopefully we can draw
on past experience to come up with something that will work long term.


This way we don't have to have these long waits for new features to show
up, this
much needed scalability improvement can land on a stable series in the next
8-9 months (assuming 2 months to release 2.2.0 and 6 months cycle), be
vetted
and improved so that it's an API and an implementation we can all be proud
of.

I really want to GSIP in, just not now and not in its current state.
But I'm willing to put forward resources to help out making it become a
reality

I really do hope that the rest of the PSC chimes in as well, this is an
important
GSIP and it deserves other people opinions (besides my personal rants).

Ah, next week I'll also try to prepare a GSIP for the 6 months release
cycle,
unless of course there are very negative reactions to the idea.

Nope, would love the idea, it will remove ambiguity as to what development
is appropriate and when. GeoServer has been drifting away from an actual
process over the last couple of years and it is starting to show. We could
definitely do with a bit more structure. Thanks for stepping up and taking
this on.


Cheers
Andrea

-- 
-------------------------------------------------------
Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054  Massarosa (LU)
Italy

phone: +39 0584 962313
fax:      +39 0584 962313
mob:    +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
 <http://www.youtube.com/user/GeoSolutionsIT>



-- 
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Geoserver-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

[Geoserver-devel] GSIP 69 - Catalog scalability enhancements

Reply via email to