Re: [Geoserver-devel] Live backup and restore and concurrency issues

Gabriel Roldan Wed, 09 Nov 2011 06:31:33 -0800

Hey Andrea,

interesting issue.

Just as a heads up, so that some of the comments bellow make more
sense, I'm starting to investigate on what it would take for a
scalable catalog/configuration backend, possibly with HA and on a
cluster environment.
The only I know for sure now is that I want to leverage the
infrastructure already in place in terms of catalog objects
marshaling/unmarshaling.
That is, the back end, either a rdbms or a key/value db would just
need to be able of storing the serialized catalog objects in the same
way we store them now, in some sort of clob, and provide indexes for
the common queries (id and name mostly, given the current queries the
catalog performs).
That is because the hibernate catalog strikes me as overkill
complexity for a rather simple issue, and there's just too much
knowledge and effort put on the xstream persistence that I think it
would be nice to leverage it.

On Wed, Nov 9, 2011 at 7:12 AM, Andrea Aime
<[email protected]> wrote:
> Hi,
> me and Alessio are looking for a way to perform a live backup and restore
> of the GeoServer data directory without having to stop the instance.
>
> Now, there is a number of issues to be faced but let me start with the
> major one: concurrent catalog modifications.
> During a backup nothign should be able to write, during a restore nothing
> should be able to even read.

So I think the easiest way of preventing that is to serialize write
access to the catalog, which is basically what you're proposing.
An alternate approach could be based on commands instead of the client
code being responsible of explicitly acquiring and releasing the
locks.
There's the usual case, for example, of adding a new layer from
scratch, where perhaps a StoreInfo needs to be added, as well as a
ResourceInfo, a LayerInfo, and a StyleInfo. Or the cascade delete of a
StoreInfo, for instance.

So my idea would be to encapsulate all the locking logic inside the
catalog, so that client code doesn't need to worry about acquiring a
read lock, but any operation that does write lock would inherently
lock read operations until done. Internally it could have a queue of
commands, and the client code that needs to do multiple operations
like in the examples above, would need to issue a command on the
catalog. The call would be synchronous, but the catalog would put the
command on the queue and wait for it to finish before returning.

I think this model would also make it easier to implement any message
passing needed when on a clustered environment, like in acquiring a
cluster wide write lock, or notifying other nodes of a configuration
change so that they release stuff from their own resource pool, etc.

>
> Actually the same problem occurs in GeoServer today in other occasions
> and I have the impression it's also the cause of those null pointer exceptions
> that are getting reported on the user list:
> - concurrent admin operations, either via GUI or REST (hello Justin and his
>  multi-admin proposal)
> - concurrent reload/reset operations (eventually with other stuff going on)
>
> Now, if every single configuration bit was in a db and we had a
> transaction api for
> config changes this would not be a problem, but we're not there yet,
>  and that would require a very significant investment that we don't have
> the funding for, so we are going to propose a simpler solution instead.
>
> What about a global configuration read/write lock? GS 1.7 used to have it.
>
> At the very least GUI and REST config would take a read lock when accessing
> configuration data, and a write lock every time they are about to write on
> the catalog, to protect the whole sequence of oprations and the associated
> side effects happening via catalog/configuration listeners.
> Reload would get a write lock, reset a read one.
> The backup operation (probably implemented via rest) a read one, the restore
> one a write one.
>
> OGC operations might also get a read lock, that would prevent at least part
> of the reasons why the stores die on their back while they are running
> (the other would be the lru cache pushing out the stores, that's another 
> story).
> However I'm concerned that a long WFS download or a long WPS request
> might make the configuration untouchable for a very long time.
> So we might want to let these fail in face of configuration changes.

agreed. The lru expiration being a separate issue, I think it's
generally fair to let an ongoing operation fail in face of a
configuration change that affects the resource(s) it's using (like in
a wfs request serving a feature type that's removed meanwhile).
Otherwise we would be making too much effort for a use case that we
don't even know is the  _correct_ one. Perhaps the admin shuts down
that resource exactly due to client abuse and wants to stop any usage
of it immediately, but we would be preventing that. Anyway, just
thinking out loud on this regard

>
> Btw, backup wise we'd like to backup the entire data directory (data, fonts,
> styling icons and whatnot included), and restore
> would wipe it out, replace it with the new contentes, and force a reload.
>
> GWC is actually putting a wrench in this plan since it keeps a database open,
> wondering if there is any way to have it give up on the db connections
> for some time?
There could be, just lets sync up on what's needed, what the workflow
would be, and lets make it possible.
Would you be backing up the tile caches too?

>
> As an alternative we could have the backup and restore tools work on
> selected directories
> and files, configurable in the request (which could be a post) and
> with some defaults
> that only include the gwc xml config file, I guess GWC would
> just reconfigure itself after the reload and wipe out the caches, right?
oh I see, makes sense. Yeah, it should reconfigure itself.

>
> Anyways... opinions? Other things that might go pear shaped that we
> did not account for?

I think it's an issue worth taking the time to get it right.
Fortunately we do have the funding to do so now, starting firmly two
weeks from now as I'll be on vacation for the next two weeks, so I'd
be glad to run the whole process, exploratory prototyping included, by
the community.

Do you have a rush to get something working asap? or would be ok in
going for a larger but steady devel cycle?

Best regards,
Gabriel

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Geoserver-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Re: [Geoserver-devel] Live backup and restore and concurrency issues

Reply via email to