Re: [OSM-dev] TRAPI status

Lars Francke Fri, 12 Mar 2010 04:26:46 -0800

>> I currently use AMQP (RabbitMQ) for message processing and it works
>> very well. It is very flexible and it'd be easy to extend it with a
>> PubSubHubBub or XMPP output.
>>
>> Mitja (of OpenStreetBugs) proposed just yesterday a filter that
>> filters changes by the tags/changes involved so it would be very easy
>> to subscribe to only the events you are interested in. This can be
>> implemented in just a few lines of code.
>>
>> While we only thought about doing this on the german dev servers I've
>> since gotten multiple requests/questions and suggestions that this
>> should be integrated into the main OSM site. All that'd be needed
>> would be a call in the Ruby API that sends a message (asynchronously,
>> very fast) once a change has been made. This would make the generation
>> of the diff files a lot easier and everything more flexible. I haven't
>> yet asked anyone if this would be a possible. I know that this isn't
>> the right topic (although the TRAPI could also use this system) but I
>> wanted to take the opportunity to inform about our ideas.
>
> Thanks for the info.  I'll add a few comments (because I can't help myself
> ;-).


That's exactly what I was looking/hoping for :)

> Most OSM systems tend to have a large number of disorganised and
> uncontrolled clients.  Does this work well with the AMQP paradigm?  In other
> words, does it take administrative overhead to register new subscriptions to
> a queue?  What happens if large numbers of subscriptions are created then
> the clients disappear?  Is AMQP targetted at a world where the clients are
> relatively controlled and small in number?  It's important to minimise
> administration overhead where possible.

I'll try to answer those in order.
Subscriptions in AMQP are defined client side[1] so there is _no_
administrative overhead at all. Clients only need to know which
exchange/queue they should bind to and that's just a string they need
to know.

Queues can be declared in different ways, one way is to declare it as
autoDelete[2]: "true if we are declaring an autodelete queue (server
will delete it when no longer in use)". Again no administrative
overhead. It just disappears when no longer used so no messages are
routed there. So depending on the use case there are multiple options.

I have no huge real-world experience with AMQP but I believe it is
targeted at both worlds: Small and controlled and large and
uncontrolled. The structure they have found seems to work really well
in any case.

Since the installation of the RabbitMQ server I did not have to touch
it again. The only thing one _could_ do is to add some kind of user
management (currently everyone is allowed to do everything for example
send fake messages[4]) to allow only certain users to publish to
certain exchanges but that are two lines in the admin console.

> Clients will experience outages whether that be due to a network problem,
> server reboot, or just not running a system 24x7.  Presumably they need a
> way to catch up on missed events.  There are a few options: 1. The server
> holds events for clients until they become available again, 2. The client
> catches up using an out of band mechanism (eg. downloads diffs directly), or
> 3. The client can request that the server begin sending data from a specific
> point.  I think that only options 1 and 2 are possible using AMQP.  1 is not
> scalable, and 2 adds additional client complexity.  3 is what I'd like to
> see, but I don't think it can be done using a typical reliable messaging
> system such as AMQP.  I hope I'm wrong though.

Options 1) and 3) are kind of interchangeable and it depends on the
circumstances which one to use. Both are possible.

I'm glad to inform you that you are indeed kind of wrong. AMQP is used
for this kind of things through private "unnamed" temporary reply
queues. An example: Program A needs all diffs since time T so it
creates a private reply queue the server generates a temporary name
and then it just sends a message to the "Query queue" with the routing
key "osm.query.osc" and the payload is just the timestamp (or any
extra options one might think of) with the "replyTo"[3] field set to
the previously created private queue. Some (might be a different one
depending on the query kind, could for example decide dynamically to
use an API or a XAPI reply handler depending on the request) process
reads the message and begins sending the reply on the private queue
which is automatically destroyed at the end.

> Something to note about the current replication mechanism is that it doesn't
> use any transactional capabilities other than creating files in a particular
> order.  All replication state tracking is client side where transactions are
> actually occurring (eg. writing to a database, updating a planet file, etc)
> which keeps the server highly scalable and agnostic of client reliability.

I didn't mean to say that what you're doing is wrong or bad in any
way! I'm sorry if it came across as if I want to disregard everything
you have done with Osmosis. The replication diffs are a huge step
forward.

> I don't know how you'd hook into the Ruby API effectively and reliably.  You
> can't just wait for changeset closure events because changesets can remain
> open for large periods of time.  You really want to be replicating data as
> soon as possible after it becomes available to API queries.  This may mean
> receiving notification about every single entity as it is created, modified
> or deleted from the db, but this will result in huge numbers of events which
> will be difficult to process in an efficient manner.

That's exactly what I wanted to do: Send one message for each
successful API call (create, update, delete). Yes it'll be a lot of
messages but nothing even remotely (judging by the minutely diffs)
hard to handle. Those messages should of course not be processed on
the API server but if the messages are routed to the dev server it
shouldn't have a huge impact. It would obviously put _some_ burden on
the API server(s) but we'd gain a lot of flexibility and
opportunities. We'd certainly be the first big open source project
that has this kind of (for lack of a better term) fire-hose stream of
its data. All it takes to make it even more accessible is for someone
to write the PubSubHubBub/XMPP implementation.

As to the closing of changesets I thought of another small nicety
(which I've not yet implemented so this might still fall apart): I
planed a tool that subscribes to all the API messages and just holds a
list of timers (even with a million open changesets at a time it
shouldn't be a problem) which are adjusted accordingly. So when it
gets a "changeset open" message it creates a timer that expires on the
"closed at" time of the changeset and that is reset on every change to
the changeset (of course factoring in the limitations like 24h, 50000
elements). At some time the timer will fire and it can check the API
to fetch all the associated metadata and send a message "changeset
close". This would have the benefit that there is only one call to the
API for every opened changeset (which I hope would be okay, load wise)
and all consumers can profit from this information.

> I also think you'll run into a fair bit of
> resistance trying to incorporate changes into the Ruby API, it's simpler at
> least to remain independent where possible.  Unless you want to achieve
> sub-second replication, the current approach could be run with a very short
> replication interval.  The main restriction on replication interval now is
> downloading large numbers of files from the planet server, not the
> extraction of data from the database.

Yes I think that's going to be the problem (incorporate it into the
API). That's why I currently implement it by running of the diff files
you produce. That generates the same messages and allows for a very
realistic testing scenario while at the same time not having to change
anything in the API for now.

As I've said before: We thought about this solely for the dev servers
(and the wikimedia toolserver(s)) but I've since come to believe that
something(!) like this would be very nice and an innovative step for
OpenStreetMap's future.

> I guess something to consider is who are the clients of the mechanism.
> Somebody wanting to see activity in a geographical area may not care about
> reliability and perhaps something like XMPP is appropriate here.  But
> anybody wanting reliable replication (ie. TRAPI) will need something robust
> that guarantees delivery and data ordering.

I agree. AMQP is the solution I chose but there are of course other
ways. AMQP allows for robust, reliable and ordered delivery of
messages (it even has transactions if needed) and XMPP and others can
be (easily from what I've read, not done it myself) integrated into
this.

> Anyway, it's good to hear that some fresh minds are interested in the
> problem of changeset distribution.  I'm very interested to hear what comes
> out of it.

Me too :)
Thanks for your comments. It's always good to get a set of fresh eyes
on an idea and I'd be glad to listen to any further input you (and
others) might have.

Cheers,
Lars

[1] 
<http://www.rabbitmq.com/releases/rabbitmq-java-client/v1.7.2/rabbitmq-java-client-javadoc-1.7.2/com/rabbitmq/client/Channel.html#queueBind(java.lang.String,
java.lang.String, java.lang.String, java.util.Map)>
[2] 
<http://www.rabbitmq.com/releases/rabbitmq-java-client/v1.7.2/rabbitmq-java-client-javadoc-1.7.2/com/rabbitmq/client/Channel.html#queueDeclare(java.lang.String,
boolean, boolean, boolean, boolean, java.util.Map)>
[3] 
<http://www.rabbitmq.com/releases/rabbitmq-java-client/v1.7.2/rabbitmq-java-client-javadoc-1.7.2/com/rabbitmq/client/AMQP.BasicProperties.html#setReplyTo(java.lang.String)>
[4] Should this ever become a problem I see two easy solutions: User
managements with "official" exchanges where only authenticated
programs can write to or digital signatures of the important messages

_______________________________________________
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] TRAPI status

Reply via email to