I've taken a bit of time out of this thread, and I'd like to jump back in now and attempt to summarize what I've learned and hopefully frame it in such a way that it helps us to answer the question Thierry asked:
On Fri, Sep 19, 2014 at 2:00 AM, Thierry Carrez <thie...@openstack.org> wrote: > > The underlying question being... can Zaqar evolve to ultimately reach > the massive scale use case Joe, Clint and Devananda want it to reach, or > are those design choices so deeply rooted in the code and architecture > that Zaqar won't naturally mutate to support that use case. I also want to sincerely thank everyone who has been involved in this discussion, and helped to clarify the different viewpoints and uncertainties which have surrounded Zaqar lately. I hope that all of this helps provide the Zaqar team guidance on a path forward, as I do believe that a scalable cloud-based messaging service would greatly benefit the OpenStack ecosystem. Use cases -------------- So, I'd like to start from the perspective of a hypothetical user evaluating messaging services for the new application that I'm developing. What does my application need from a messaging service so that it can grow and become hugely popular with all the hipsters of the world? In other words, what might my architectural requirements be? (This is certainly not a complete list of features, and it's not meant to be -- it is a list of things that I *might* need from a messaging service. But feel free to point out any glaring omissions I may have made anyway :) ) 1. Durability: I can't risk losing any messages Example: Using a queue to process votes. Every vote should count. 2. Single Delivery - each message must be processed *exactly* once Example: Using a queue to process votes. Every vote must be counted only once. 3. Low latency to interact with service Example: Single threaded application that can't wait on external calls 4. Low latency of a message through the system Example: Video streaming. Messages are very time-sensitive. 5. Aggregate throughput Example: Ad banner processing. Remember when sites could get slash-dotted? I need a queue resilient to truly massive spikes in traffic. 6. FIFO - When ordering matters Example: I can't "stop" a job that hasn't "start"ed yet. So, as a developer, I actually probably never need all of these in a single application -- but depending on what I'm doing, I'm going to need some of them. Hopefully, the examples above give some idea of what I have in mind for different sorts of applications I might develop which would require these guarantees from a messaging service. Why is this at all interesting or relevant? Because I think Zaqar and SQS are actually, in their current forms, trying to meet different sets of requirements. And, because I have not actually seen an application using a cloud which requires the things that Zaqar is guaranteeing - which doesn't mean they don't exist - it frames my past judgements about Zaqar in a much better way than simply "I have doubts". It explains _why_ I have those doubts. I'd now like to offer the following as a summary of this email thread and the available documentation on SQS and Zaqar, as far as which of the above requirements are satisfied by each service and why I believe that. If there are fallacies in here, please correct me. SQS ------ Requirements it meets: 1, 5 The SQS documentation states that it guarantees durability of messages (1) and handles "unlimited" throughput (5). It does not guarantee once-and-only-once delivery (2) and requires applications that care about this to de-duplicate on the receiving side. It also does not guarantee message order (6), making it unsuitable for certain uses. SQS is not an application-local service nor does it use a wire-level protocol, so from this I infer that (3) and (4) were not design goals. Zaqar -------- Requirements it meets: 1*, 2, 6 Zaqar states that it aims to guarantee message durability (1) but does so by pushing the guarantee of durability into the storage layer. Thus, Zaqar will not be able to guarantee durability of messages when a storage pool fails, is misconfigured, or what have you. Therefor I do not feel that message durability is a strong guarantee of Zaqar itself; in some configurations, this guarantee may be possible based on the underlying storage, but this capability will need to be exposed in such a way that users can make informed decisions about which Zaqar storage back-end (or "flavor") to use for their application based on whether or not they need durability. Single delivery of messages (2) is provided for by the claim semantics in Zaqar's API. FIFO (6) ordering was an architectural choice made based on feedback from users. Aggregate throughput of a single queue (5) is not scalable beyond the reach of a single storage pool. This makes it possible for an application to outgrow Zaqar when its total throughput needs exceed the capacity of a single pool. This would also make it possible for one user to DOS other users who share the same storage pool (unless rate-limits are implemented, which would further indicate that (5) was not a design goal). Also, as with durability, pushing this problem down to the storage layer is not the same as _solving_ it. Zaqar relies on a store-and-forward architecture, which is not amenable to low-latency message processing (4). Again, as with SQS, it is not a wire-level protocol, so I don't believe low-latency connectivity (3) was a design goal. Summary ------------- It looks like Zaqar should be very well suited to "small to mid sized" clouds. At this scale, I believe its architecture will meet some use-cases that SQS does not and all the ones it does. That's great for private clouds. And, as far as I can tell, the developer team is dedicated to making the project easy to use and administer at this scale. That's also great. However, I continue to believe that the current architecture of Zaqar is not going to handle the needs of public cloud providers who want to offer an alternative messaging service to SQS within their OpenStack clouds, primarily because the project is not itself directly addressing durability and throughput. It does not appear to be any more durable or more performant than the storage implementation underneath it, which, to a user who requires durability and throughput, makes it no better than those technologies. While we have several other projects today with well-known scaling limitations (*cough* Nova *cough*), the difference is that the current scaling limitations of Zaqar are a result of design principles of the project. As far as advice to the project moving forward, I would offer this: any design decision you make that limits either aggregate throughput or durability, when operating this service in a public cloud, at "unlimited" scale, is going to draw concern from potential operators and users, even if they're not (yet) at that scale. Because design decisions are much harder to fix later on than bugs. -Devananda _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev