Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

Zane Bitter Fri, 19 Sep 2014 13:14:39 -0700

On 18/09/14 10:55, Flavio Percoco wrote:

On 09/18/2014 04:24 PM, Clint Byrum wrote:

Great job highlighting what our friends over at Amazon are doing.


It's clear from these snippets, and a few other pieces of documentation
for SQS I've read, that the Amazon team approached SQS from a _massive_
scaling perspective. I think what may be forcing a lot of this frustration
with Zaqar is that it was designed with a much smaller scale in mind.

I think as long as that is the case, the design will remain in question.
I'd be comfortable saying that the use cases I've been thinking about
are entirely fine with the limitations SQS has.


I think these are pretty strong comments with not enough arguments to
defend them.

I actually more or less agree with Clint here. As Joe noted (and *thankyou* Joe for starting this thread - the first one to compare Zaqar tosomething relevant!), SQS offers very, very limited guarantees, and it'sclear that the reason for that is to make it massively, massivelyscalable in the way that e.g. S3 is scalable while also remainingcomparably durable (S3 is supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms ofguarantees. (And then taking it away in the fine print, where it saysthat the operator can disregard many of them, potentially without theuser's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature("Pools") which is its answer to the massive scaling question. I don'tknow enough details to comment further except to say that it evidentlyhas been carefully thought about at some level, and it's reallyfrustrating for the Zaqar folks when people just assume that it hasn'twithout doing any research. On the face of it sharding is a decentsolution for this problem. Maybe we need to dig into the details andmake sure folks are satisfied that there are no hidden dragons.

Saying that Zaqar was designed with a smaller scale in mid without
actually saying why you think so is not fair besides not being true. So
please, do share why you think Zaqar was not designed for big scales and
provide comments that will help the project to grow and improve.

- Is it because the storage technologies that have been chosen?
- Is it because of the API?
- Is it because of the programing language/framework ?

I didn't read Clint and Devananda's comments as an attack on any ofthese things (although I agree that there have been far too many suchattacks over the last 12 months from people who didn't bother to dotheir homework first). They're looking at Zaqar from first principlesand finding that it promises too much, raising the concern the team mayin future reach a point where they are unable to meet the needs offuture users (perhaps for scaling reasons) without breaking existingusers who have come to rely on those promises.

So far, we've just discussed the API semantics and not zaqar's
scalability, which makes your comments even more surprising.

What guarantees you offer can determine pretty much all of the designtradeoffs (e.g. latency vs. durability) that you have to make. Some ofthose (e.g. random access to messages) are baked in to the API, butothers are not. It's actually a real concern to me to see elsewhere inthis thread that you are punting to operators on many of the latter.

IMO the Zaqar team needs to articulate an opinionated vision of justwhat Zaqar actually is, and why it offers value. And by 'value' here Imean there should be $ signs attached.

For example, it makes no sense to me that Zaqar should ever be able torun in a mode that doesn't guarantee delivery of messages. There are amillion and one easy, cheap ways to set up a system that _might_ deliveryour message. One server running a message broker is sufficient. But ifyou want reliable delivery, then you'll probably need at least 3 (forstill pretty low values of "reliable"). I did some back-of-the-envelopemath with the AWS pricing and _conservatively_ for any applicationreceiving <100k messages per hour (~30 per second) it's cheaper to useSQS than to spin up those servers yourself.

In other words, a service that *guarantees* delivery of messages *has*to be run by the cloud operator because for the overwhelming majority ofapplications, the user cannot do so economically.

(I'm assuming here that AWS's _relative_ pricing accurately reflectstheir _relative_ cost basis, which is almost certainly not strictlytrue, but I expect a close enough approximation for these purposes.)


What I would like to hear in this thread is:

"Zaqar is We-Never-Ever-Ever-Ever-Lose-Your-Message as a Service(WNEEELYMaaS), and it has to be in OpenStack because only the cloudoperator can provide that cost-effectively."


What I'm hearing instead is:

- We'll probably deliver your message.

- We can guarantee that we'll deliver your message, but only on cloudswhere the operator has chosen to configure Mongo with some non-defaultsetting, and it's really all up to them.- We have "Flavors" so your app can be broken in a different way onevery cloud you port it to.- When your client goes down for any reason, all bets are off (in somemodes).

- At least the messages will be in order!

I encourage the Zaqar core team to share their vision for what Zaqar is(I know they have one, and I know what I said above isn't it), and IMHOthat vision should be along the lines of "it's orders of magnitude morelikely that your entire region will go down than that you'll ever lose amessage, on _any_ cloud (at any scale) where Zaqar runs, even when yourclient dies while processing it". Everything, and I mean *everything*,else - FIFO, pub-sub, low latency, browsability - is gravy. Tasty, tastygravy in some cases (FIFO in particular seems like a potentialgame-changer for many applications), but if it impacts durability,scalability or deployability it should be gone.

Right now the vision that is coming through is "Zaqar is whatever sortof messaging the operator wants it to be". Trying to please everyone isa common trap for Open Source projects to fall into (and I think the TChas to take a large measure of blame, for promoting multiple competingvisions for what Zaqar should be in lieu of trying to understand theoriginal one), but I don't think it can be allowed to fly here. Whetheryour messaging system is reliable or unreliable in large part determinesthe design of your application. If the answer is different on eachcloud, then the answer is 'unreliable' (on multiple levels), andunreliable messaging is not something that needs to be in OpenStackbecause users can implement it themselves with little difficulty, atrelatively low cost and with much more flexibility.

However, unlike (I suspect) some people in this thread, I don't believethat major technical changes are required to Zaqar to achieve what wewant. Zaqar's design is already very flexible and can probably handlejust about any back-end architecture we choose for it. I wouldpersonally love to see a Swift back-end, since that would solve thedeployment-of-another-architecture problem, thesemantics-depend-on-operator-configuration problem and themassive-scalability problem at a stroke, but I don't think this is acritical priority. The Zaqar team can be trusted to do that sort ofimplementation work if and when makes sense. What *is* critical is thatwe lock down the guarantees that Zaqar provides to the user to thesmallest useful set, chisel them in stone, and explicitly disclaim allother guarantees.

Only by doing that can we be sure that Zaqar will have the freedom tomove to other back-ends, should they be deemed more appropriate, withoutbreaking users. There will be some minor tweaks to the API for v2 toachieve that, but a lot of it is about just setting expectations.

This thread has been really helpful because we're discussing thesemantics presented to the user from first principles, and I think theway forward is to collect those together in one place and debate whichare essential and which can - and, it almost follows, should - be dropped.

As an aside, in thinking about this I have for the first time identifieda criterion by which I would be prepared to see all new projects judged.Some members of the TC have suggested that projects must be able topoint to existing, large-scale deployments as a condition of incubation,and that would be a mistake because it presents a chicken-and-eggproblem that would keep a lot of really useful projects out ofOpenStack. But I don't think it's unreasonable to say that projects thathaven't yet had the benefit of that feedback loop with operators andusers must maximise their flexibility to take advantage of it when itappears without breaking the upgrade and API compatibility requirementsthat come with being a part of OpenStack. i.e. if your project isunproven, it had better be the simplest thing that could possibly workwhile still addressing the same problem; conversely if your project isbreathtaking in scope and ambition (Zaqar is not in this boat, but...ahem... I could name you some examples) it had better come with somepretty strong evidence that it is really needed by users and accepted byoperators.


cheers,
Zane.

Flavio


Excerpts from Joe Gordon's message of 2014-09-17 13:36:18 -0700:

Hi All,

My understanding of Zaqar is that it's like SQS. SQS uses distributed
queues, which have a few unusual properties [0]:
Message Order

Amazon SQS makes a best effort to preserve order in messages, but due to
the distributed nature of the queue, we cannot guarantee you will receive
messages in the exact order you sent them. If your system requires that
order be preserved, we recommend you place sequencing information in each
message so you can reorder the messages upon receipt.
At-Least-Once Delivery

Amazon SQS stores copies of your messages on multiple servers for
redundancy and high availability. On rare occasions, one of the servers
storing a copy of a message might be unavailable when you receive or delete
the message. If that occurs, the copy of the message will not be deleted on
that unavailable server, and you might get that message copy again when you
receive messages. Because of this, you must design your application to be
idempotent (i.e., it must not be adversely affected if it processes the
same message more than once).
Message Sample

The behavior of retrieving messages from the queue depends whether you are
using short (standard) polling, the default behavior, or long polling. For
more information about long polling, see Amazon SQS Long Polling
<http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html>
.

With short polling, when you retrieve messages from the queue, Amazon SQS
samples a subset of the servers (based on a weighted random distribution)
and returns messages from just those servers. This means that a particular
receive request might not return all your messages. Or, if you have a small
number of messages in your queue (less than 1000), it means a particular
request might not return any of your messages, whereas a subsequent request
will. If you keep retrieving from your queues, Amazon SQS will sample all
of the servers, and you will receive all of your messages.

The following figure shows short polling behavior of messages being
returned after one of your system components makes a receive request.
Amazon SQS samples several of the servers (in gray) and returns the
messages from those servers (Message A, C, D, and B). Message E is not
returned to this particular request, but it would be returned to a
subsequent request.

Presumably SQS has these properties because it makes the system scalable,
if so does Zaqar have the same properties (not just making these same
guarantees in the API, but actually having these properties in the
backends)? And if not, why? I looked on the wiki [1] for information on
this, but couldn't find anything.

[0]
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/DistributedQueues.html
[1] https://wiki.openstack.org/wiki/Zaqar


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

Reply via email to