subject:"\[openstack\-dev\] \[Zaqar\] Zaqar and SQS Properties of Distributed Queues"

On 09/17/2014 10:36 PM, Joe Gordon wrote:
> Hi All,
> 
> My understanding of Zaqar is that it's like SQS. SQS uses distributed
> queues, which have a few unusual properties [0]:
> 
> 
> Message Order
> 
> Amazon SQS makes a best effort to preserve order in messages, but due to
> the distributed nature of the queue, we cannot guarantee you will
> receive messages in the exact order you sent them. If your system
> requires that order be preserved, we recommend you place sequencing
> information in each message so you can reorder the messages upon receipt.
> 

Zaqar guarantees FIFO. To be more precise, it does that relying on the
storage backend ability to do so as well. Depending on the storage used,
guaranteeing FIFO may have some performance penalties.

We have discussed on adding a way to enable/disable some features - like
FIFO - in a per-queue bases and now that we have flavors, we may work on
allowing things like enabling/disabling FIFO on a per-flavor basis.

> At-Least-Once Delivery
> 
> Amazon SQS stores copies of your messages on multiple servers for
> redundancy and high availability. On rare occasions, one of the servers
> storing a copy of a message might be unavailable when you receive or
> delete the message. If that occurs, the copy of the message will not be
> deleted on that unavailable server, and you might get that message copy
> again when you receive messages. Because of this, you must design your
> application to be idempotent (i.e., it must not be adversely affected if
> it processes the same message more than once).

As of now, Zaqar fully relies on the storage replication/clustering
capabilities to provide data consistency, availability and fault
tolerance. However, as far as consuming messages is concerned, it can
guarantee once-and-only-once and/or at-least-once delivery depending on
the message pattern used to consume messages. Using pop or claims
guarantees the former whereas streaming messages out of Zaqar guarantees
the later.

> Message Sample
> 
> The behavior of retrieving messages from the queue depends whether you
> are using short (standard) polling, the default behavior, or long
> polling. For more information about long polling, see Amazon SQS Long
> Polling
> .
> 
> With short polling, when you retrieve messages from the queue, Amazon
> SQS samples a subset of the servers (based on a weighted random
> distribution) and returns messages from just those servers. This means
> that a particular receive request might not return all your messages.
> Or, if you have a small number of messages in your queue (less than
> 1000), it means a particular request might not return any of your
> messages, whereas a subsequent request will. If you keep retrieving from
> your queues, Amazon SQS will sample all of the servers, and you will
> receive all of your messages.
> 
> The following figure shows short polling behavior of messages being
> returned after one of your system components makes a receive request.
> Amazon SQS samples several of the servers (in gray) and returns the
> messages from those servers (Message A, C, D, and B). Message E is not
> returned to this particular request, but it would be returned to a
> subsequent request.

Image here:
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/images/ArchOverview_Receive.png

Currently the best and only way to pull messages from Zaqar is by using
short polling. We'll work on long-polling and websocket support.
Requests are guaranteed to return available messages if there are any.

> Presumably SQS has these properties because it makes the
> system scalable, if so does Zaqar have the same properties (not just
> making these same guarantees in the API, but actually having these
> properties in the backends)? And if not, why?

Based on our short conversation on IRC last night, I understand you're
concerned that FIFO may result in performance issues. That's a valid
concern and I think the right answer is that it depends on the storage.
If the storage has a built-in FIFO guarantee then there's nothing Zaqar
needs to do there. In the other hand, if the storage does not have a
built-in support for FIFO, Zaqar will cover it in the driver
implementation. In the mongodb driver, each message has a marker that is
used to guarantee FIFO.

> I looked on the wiki [1]
> for information on this, but couldn't find anything.
> 

We have this[0] section in the FAQ but it definitely doesn't answer your
questions. I'm sure we had this documented way better but I can't find
the link. :(

[0]
https://wiki.openstack.org/wiki/Zaqar/Frequently_asked_questions#How_does_Zaqar_compare_to_AWS_.28SQS.2FSNS.29.3F

Hope the above helps,
Flavio

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/l

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Gordon Sim


On 09/18/2014 12:31 PM, Flavio Percoco wrote:

On 09/17/2014 10:36 PM, Joe Gordon wrote:

My understanding of Zaqar is that it's like SQS. SQS uses distributed
queues, which have a few unusual properties [0]:


 Message Order

Amazon SQS makes a best effort to preserve order in messages, but due to
the distributed nature of the queue, we cannot guarantee you will
receive messages in the exact order you sent them. If your system
requires that order be preserved, we recommend you place sequencing
information in each message so you can reorder the messages upon receipt.



Zaqar guarantees FIFO. To be more precise, it does that relying on the
storage backend ability to do so as well. Depending on the storage used,
guaranteeing FIFO may have some performance penalties.


Would it be accurate to say that at present Zaqar does not use 
distributed queues, but holds all queue data in a storage mechanism of 
some form which may internally distribute that data among servers but 
provides Zaqar with a consistent data model of some form?


[...]

As of now, Zaqar fully relies on the storage replication/clustering
capabilities to provide data consistency, availability and fault
tolerance.


Is the replication synchronous or asynchronous with respect to client 
calls? E.g. will the response to a post of messages be returned only 
once the replication of those messages is confirmed? Likewise when 
deleting a message, is the response only returned when replicas of the 
message are deleted?



However, as far as consuming messages is concerned, it can
guarantee once-and-only-once and/or at-least-once delivery depending on
the message pattern used to consume messages. Using pop or claims
guarantees the former whereas streaming messages out of Zaqar guarantees
the later.


From what I can see, pop provides unreliable delivery (i.e. its similar 
to no-ack). If the delete call using pop fails while sending back the 
response, the messages are removed but didn't get to the client.


What do you mean by 'streaming messages'?

[...]

Based on our short conversation on IRC last night, I understand you're
concerned that FIFO may result in performance issues. That's a valid
concern and I think the right answer is that it depends on the storage.
If the storage has a built-in FIFO guarantee then there's nothing Zaqar
needs to do there. In the other hand, if the storage does not have a
built-in support for FIFO, Zaqar will cover it in the driver
implementation. In the mongodb driver, each message has a marker that is
used to guarantee FIFO.


That marker is a sequence number of some kind that is used to provide 
ordering to queries? Is it generated by the database itself?



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Clint Byrum

Great job highlighting what our friends over at Amazon are doing.

It's clear from these snippets, and a few other pieces of documentation
for SQS I've read, that the Amazon team approached SQS from a _massive_
scaling perspective. I think what may be forcing a lot of this frustration
with Zaqar is that it was designed with a much smaller scale in mind.

I think as long as that is the case, the design will remain in question.
I'd be comfortable saying that the use cases I've been thinking about
are entirely fine with the limitations SQS has.

Excerpts from Joe Gordon's message of 2014-09-17 13:36:18 -0700:
> Hi All,
> 
> My understanding of Zaqar is that it's like SQS. SQS uses distributed
> queues, which have a few unusual properties [0]:
> Message Order
> 
> Amazon SQS makes a best effort to preserve order in messages, but due to
> the distributed nature of the queue, we cannot guarantee you will receive
> messages in the exact order you sent them. If your system requires that
> order be preserved, we recommend you place sequencing information in each
> message so you can reorder the messages upon receipt.
> At-Least-Once Delivery
> 
> Amazon SQS stores copies of your messages on multiple servers for
> redundancy and high availability. On rare occasions, one of the servers
> storing a copy of a message might be unavailable when you receive or delete
> the message. If that occurs, the copy of the message will not be deleted on
> that unavailable server, and you might get that message copy again when you
> receive messages. Because of this, you must design your application to be
> idempotent (i.e., it must not be adversely affected if it processes the
> same message more than once).
> Message Sample
> 
> The behavior of retrieving messages from the queue depends whether you are
> using short (standard) polling, the default behavior, or long polling. For
> more information about long polling, see Amazon SQS Long Polling
> 
> .
> 
> With short polling, when you retrieve messages from the queue, Amazon SQS
> samples a subset of the servers (based on a weighted random distribution)
> and returns messages from just those servers. This means that a particular
> receive request might not return all your messages. Or, if you have a small
> number of messages in your queue (less than 1000), it means a particular
> request might not return any of your messages, whereas a subsequent request
> will. If you keep retrieving from your queues, Amazon SQS will sample all
> of the servers, and you will receive all of your messages.
> 
> The following figure shows short polling behavior of messages being
> returned after one of your system components makes a receive request.
> Amazon SQS samples several of the servers (in gray) and returns the
> messages from those servers (Message A, C, D, and B). Message E is not
> returned to this particular request, but it would be returned to a
> subsequent request.
> 
> 
> 
> Presumably SQS has these properties because it makes the system scalable,
> if so does Zaqar have the same properties (not just making these same
> guarantees in the API, but actually having these properties in the
> backends)? And if not, why? I looked on the wiki [1] for information on
> this, but couldn't find anything.
> 
> 
> 
> 
> 
> [0]
> http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/DistributedQueues.html
> [1] https://wiki.openstack.org/wiki/Zaqar

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/18/2014 04:09 PM, Gordon Sim wrote:
> On 09/18/2014 12:31 PM, Flavio Percoco wrote:
>> On 09/17/2014 10:36 PM, Joe Gordon wrote:
>>> My understanding of Zaqar is that it's like SQS. SQS uses distributed
>>> queues, which have a few unusual properties [0]:
>>>
>>>
>>>  Message Order
>>>
>>> Amazon SQS makes a best effort to preserve order in messages, but due to
>>> the distributed nature of the queue, we cannot guarantee you will
>>> receive messages in the exact order you sent them. If your system
>>> requires that order be preserved, we recommend you place sequencing
>>> information in each message so you can reorder the messages upon
>>> receipt.
>>>
>>
>> Zaqar guarantees FIFO. To be more precise, it does that relying on the
>> storage backend ability to do so as well. Depending on the storage used,
>> guaranteeing FIFO may have some performance penalties.
> 
> Would it be accurate to say that at present Zaqar does not use
> distributed queues, but holds all queue data in a storage mechanism of
> some form which may internally distribute that data among servers but
> provides Zaqar with a consistent data model of some form?

I think this is accurate. The queue's distribution depends on the
storage ability to do so and deployers will be able to choose what
storage works best for them based on this as well. I'm not sure how
useful this separation is from a user perspective but I do see the
relevance when it comes to implementation details and deployments.

> [...]
>> As of now, Zaqar fully relies on the storage replication/clustering
>> capabilities to provide data consistency, availability and fault
>> tolerance.
> 
> Is the replication synchronous or asynchronous with respect to client
> calls? E.g. will the response to a post of messages be returned only
> once the replication of those messages is confirmed? Likewise when
> deleting a message, is the response only returned when replicas of the
> message are deleted?

It depends on the driver implementation and/or storage configuration.
For example, in the mongodb driver, we use the default write concern
called "acknowledged". This means that as soon as the message gets to
the master node (note it's not written on disk yet nor replicated) zaqar
will receive a confirmation and then send the response back to the
client. This is also configurable by the deployer by changing the
default write concern in the mongodb uri using `?w=SOME_WRITE_CONCERN`[0].

[0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w

> 
>> However, as far as consuming messages is concerned, it can
>> guarantee once-and-only-once and/or at-least-once delivery depending on
>> the message pattern used to consume messages. Using pop or claims
>> guarantees the former whereas streaming messages out of Zaqar guarantees
>> the later.
> 
> From what I can see, pop provides unreliable delivery (i.e. its similar
> to no-ack). If the delete call using pop fails while sending back the
> response, the messages are removed but didn't get to the client.

Correct, pop works like no-ack. If you want to have pop+ack, it is
possible to claim just 1 message and then delete it.

> 
> What do you mean by 'streaming messages'?

I'm sorry, that went out wrong. I had the "browsability" term in my head
and went with something even worse. By streaming messages I meant
polling messages without claiming them. In other words, at-least-once is
guaranteed by default, whereas once-and-only-once is guaranteed just if
claims are used.

> 
> [...]
>> Based on our short conversation on IRC last night, I understand you're
>> concerned that FIFO may result in performance issues. That's a valid
>> concern and I think the right answer is that it depends on the storage.
>> If the storage has a built-in FIFO guarantee then there's nothing Zaqar
>> needs to do there. In the other hand, if the storage does not have a
>> built-in support for FIFO, Zaqar will cover it in the driver
>> implementation. In the mongodb driver, each message has a marker that is
>> used to guarantee FIFO.
> 
> That marker is a sequence number of some kind that is used to provide
> ordering to queries? Is it generated by the database itself?

It's a sequence number to provide ordering to queries, correct.
Depending on the driver, it may be generated by Zaqar or the database.
In mongodb's case it's generated by Zaqar[0].

[0]
https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/queues.py#L103-L185

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/18/2014 04:24 PM, Clint Byrum wrote:
> Great job highlighting what our friends over at Amazon are doing.
> 
> It's clear from these snippets, and a few other pieces of documentation
> for SQS I've read, that the Amazon team approached SQS from a _massive_
> scaling perspective. I think what may be forcing a lot of this frustration
> with Zaqar is that it was designed with a much smaller scale in mind.
> 
> I think as long as that is the case, the design will remain in question.
> I'd be comfortable saying that the use cases I've been thinking about
> are entirely fine with the limitations SQS has.

I think these are pretty strong comments with not enough arguments to
defend them.

Saying that Zaqar was designed with a smaller scale in mid without
actually saying why you think so is not fair besides not being true. So
please, do share why you think Zaqar was not designed for big scales and
provide comments that will help the project to grow and improve.

- Is it because the storage technologies that have been chosen?
- Is it because of the API?
- Is it because of the programing language/framework ?

So far, we've just discussed the API semantics and not zaqar's
scalability, which makes your comments even more surprising.

Flavio

> 
> Excerpts from Joe Gordon's message of 2014-09-17 13:36:18 -0700:
>> Hi All,
>>
>> My understanding of Zaqar is that it's like SQS. SQS uses distributed
>> queues, which have a few unusual properties [0]:
>> Message Order
>>
>> Amazon SQS makes a best effort to preserve order in messages, but due to
>> the distributed nature of the queue, we cannot guarantee you will receive
>> messages in the exact order you sent them. If your system requires that
>> order be preserved, we recommend you place sequencing information in each
>> message so you can reorder the messages upon receipt.
>> At-Least-Once Delivery
>>
>> Amazon SQS stores copies of your messages on multiple servers for
>> redundancy and high availability. On rare occasions, one of the servers
>> storing a copy of a message might be unavailable when you receive or delete
>> the message. If that occurs, the copy of the message will not be deleted on
>> that unavailable server, and you might get that message copy again when you
>> receive messages. Because of this, you must design your application to be
>> idempotent (i.e., it must not be adversely affected if it processes the
>> same message more than once).
>> Message Sample
>>
>> The behavior of retrieving messages from the queue depends whether you are
>> using short (standard) polling, the default behavior, or long polling. For
>> more information about long polling, see Amazon SQS Long Polling
>> 
>> .
>>
>> With short polling, when you retrieve messages from the queue, Amazon SQS
>> samples a subset of the servers (based on a weighted random distribution)
>> and returns messages from just those servers. This means that a particular
>> receive request might not return all your messages. Or, if you have a small
>> number of messages in your queue (less than 1000), it means a particular
>> request might not return any of your messages, whereas a subsequent request
>> will. If you keep retrieving from your queues, Amazon SQS will sample all
>> of the servers, and you will receive all of your messages.
>>
>> The following figure shows short polling behavior of messages being
>> returned after one of your system components makes a receive request.
>> Amazon SQS samples several of the servers (in gray) and returns the
>> messages from those servers (Message A, C, D, and B). Message E is not
>> returned to this particular request, but it would be returned to a
>> subsequent request.
>>
>>
>>
>> Presumably SQS has these properties because it makes the system scalable,
>> if so does Zaqar have the same properties (not just making these same
>> guarantees in the API, but actually having these properties in the
>> backends)? And if not, why? I looked on the wiki [1] for information on
>> this, but couldn't find anything.
>>
>>
>>
>>
>>
>> [0]
>> http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/DistributedQueues.html
>> [1] https://wiki.openstack.org/wiki/Zaqar
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Steve Lewis

On September 18, 2014 7:45 AM, Flavio Percoco wrote:

> On 09/18/2014 04:09 PM, Gordon Sim wrote:
>
>> However, as far as consuming messages is concerned, it can
>> guarantee once-and-only-once and/or at-least-once delivery depending on
>> the message pattern used to consume messages. Using pop or claims
>> guarantees the former whereas streaming messages out of Zaqar guarantees
>> the later.
>>
>> From what I can see, pop provides unreliable delivery (i.e. its similar
>> to no-ack). If the delete call using pop fails while sending back the
>> response, the messages are removed but didn't get to the client.
> 
> Correct, pop works like no-ack. If you want to have pop+ack, it is
> possible to claim just 1 message and then delete it.

Having some experience using SQS I would expect that there would be a mode
something like what SQS provides where a message gets picked up by one 
consumer (let's say by short-polling) and is hidden for a timeout duration from 
other consumers, so that the consumer has time to ack. Is that how you are 
using the term 'claim' in this case?

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/18/2014 05:42 PM, Steve Lewis wrote:
> On September 18, 2014 7:45 AM, Flavio Percoco wrote:
> 
>> On 09/18/2014 04:09 PM, Gordon Sim wrote:
>>
>>> However, as far as consuming messages is concerned, it can
>>> guarantee once-and-only-once and/or at-least-once delivery depending on
>>> the message pattern used to consume messages. Using pop or claims
>>> guarantees the former whereas streaming messages out of Zaqar guarantees
>>> the later.
>>>
>>> From what I can see, pop provides unreliable delivery (i.e. its similar
>>> to no-ack). If the delete call using pop fails while sending back the
>>> response, the messages are removed but didn't get to the client.
>>
>> Correct, pop works like no-ack. If you want to have pop+ack, it is
>> possible to claim just 1 message and then delete it.
> 
> Having some experience using SQS I would expect that there would be a mode
> something like what SQS provides where a message gets picked up by one 
> consumer (let's say by short-polling) and is hidden for a timeout duration 
> from 
> other consumers, so that the consumer has time to ack. Is that how you are 
> using the term 'claim' in this case?

Correct, that's the name of the feature in Zaqar. You can find a bit of
extra information here[0]. Hope this helps.

[0] https://wiki.openstack.org/wiki/Zaqar/specs/api/v1.1#Claims

Cheers,
Flavio

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Devananda van der Veen

On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco  wrote:
> On 09/18/2014 04:09 PM, Gordon Sim wrote:
>> On 09/18/2014 12:31 PM, Flavio Percoco wrote:
>>> Zaqar guarantees FIFO. To be more precise, it does that relying on the
>>> storage backend ability to do so as well. Depending on the storage used,
>>> guaranteeing FIFO may have some performance penalties.
>>
>> Would it be accurate to say that at present Zaqar does not use
>> distributed queues, but holds all queue data in a storage mechanism of
>> some form which may internally distribute that data among servers but
>> provides Zaqar with a consistent data model of some form?
>
> I think this is accurate. The queue's distribution depends on the
> storage ability to do so and deployers will be able to choose what
> storage works best for them based on this as well. I'm not sure how
> useful this separation is from a user perspective but I do see the
> relevance when it comes to implementation details and deployments.

Guaranteeing FIFO and not using a distributed queue architecture
*above* the storage backend are both scale-limiting design choices.
That Zaqar's scalability depends on the storage back end is not a
desirable thing in a cloud-scale messaging system in my opinion,
because this will prevent use at scales which can not be accommodated
by a single storage back end.

And based on my experience consulting for companies whose needs grew
beyond the capabilities of a single storage backend, moving to
application-aware sharding required a significant amount of
rearchitecture in most cases.

-Devananda

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Devananda van der Veen

On Thu, Sep 18, 2014 at 7:55 AM, Flavio Percoco  wrote:
> On 09/18/2014 04:24 PM, Clint Byrum wrote:
>> Great job highlighting what our friends over at Amazon are doing.
>>
>> It's clear from these snippets, and a few other pieces of documentation
>> for SQS I've read, that the Amazon team approached SQS from a _massive_
>> scaling perspective. I think what may be forcing a lot of this frustration
>> with Zaqar is that it was designed with a much smaller scale in mind.
>>
>> I think as long as that is the case, the design will remain in question.
>> I'd be comfortable saying that the use cases I've been thinking about
>> are entirely fine with the limitations SQS has.
>
> I think these are pretty strong comments with not enough arguments to
> defend them.
>

Please see my prior email. I agree with Clint's assertions here.

> Saying that Zaqar was designed with a smaller scale in mid without
> actually saying why you think so is not fair besides not being true. So
> please, do share why you think Zaqar was not designed for big scales and
> provide comments that will help the project to grow and improve.
>
> - Is it because the storage technologies that have been chosen?
> - Is it because of the API?
> - Is it because of the programing language/framework ?

It is not because of the storage technology or because of the
programming language.

> So far, we've just discussed the API semantics and not zaqar's
> scalability, which makes your comments even more surprising.

- guaranteed message order
- not distributing work across a configurable number of back ends

These are scale-limiting design choices which are reflected in the
API's characteristics.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Joe Gordon

On Thu, Sep 18, 2014 at 9:02 AM, Devananda van der Veen <
devananda@gmail.com> wrote:

> On Thu, Sep 18, 2014 at 7:55 AM, Flavio Percoco  wrote:
> > On 09/18/2014 04:24 PM, Clint Byrum wrote:
> >> Great job highlighting what our friends over at Amazon are doing.
> >>
> >> It's clear from these snippets, and a few other pieces of documentation
> >> for SQS I've read, that the Amazon team approached SQS from a _massive_
> >> scaling perspective. I think what may be forcing a lot of this
> frustration
> >> with Zaqar is that it was designed with a much smaller scale in mind.
> >>
> >> I think as long as that is the case, the design will remain in question.
> >> I'd be comfortable saying that the use cases I've been thinking about
> >> are entirely fine with the limitations SQS has.
> >
> > I think these are pretty strong comments with not enough arguments to
> > defend them.
> >
>
> Please see my prior email. I agree with Clint's assertions here.
>
> > Saying that Zaqar was designed with a smaller scale in mid without
> > actually saying why you think so is not fair besides not being true. So
> > please, do share why you think Zaqar was not designed for big scales and
> > provide comments that will help the project to grow and improve.
> >
> > - Is it because the storage technologies that have been chosen?
> > - Is it because of the API?
> > - Is it because of the programing language/framework ?
>
> It is not because of the storage technology or because of the
> programming language.
>
> > So far, we've just discussed the API semantics and not zaqar's
> > scalability, which makes your comments even more surprising.
>
> - guaranteed message order
> - not distributing work across a configurable number of back ends
>
> These are scale-limiting design choices which are reflected in the
> API's characteristics.
>

I agree with Clint and Devananda


>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Devananda van der Veen

On Thu, Sep 18, 2014 at 8:54 AM, Devananda van der Veen
 wrote:
> On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco  wrote:
>> On 09/18/2014 04:09 PM, Gordon Sim wrote:
>>> On 09/18/2014 12:31 PM, Flavio Percoco wrote:
 Zaqar guarantees FIFO. To be more precise, it does that relying on the
 storage backend ability to do so as well. Depending on the storage used,
 guaranteeing FIFO may have some performance penalties.
>>>
>>> Would it be accurate to say that at present Zaqar does not use
>>> distributed queues, but holds all queue data in a storage mechanism of
>>> some form which may internally distribute that data among servers but
>>> provides Zaqar with a consistent data model of some form?
>>
>> I think this is accurate. The queue's distribution depends on the
>> storage ability to do so and deployers will be able to choose what
>> storage works best for them based on this as well. I'm not sure how
>> useful this separation is from a user perspective but I do see the
>> relevance when it comes to implementation details and deployments.
>
> Guaranteeing FIFO and not using a distributed queue architecture
> *above* the storage backend are both scale-limiting design choices.
> That Zaqar's scalability depends on the storage back end is not a
> desirable thing in a cloud-scale messaging system in my opinion,
> because this will prevent use at scales which can not be accommodated
> by a single storage back end.
>

It may be worth qualifying this a bit more.

While no single instance of any storage back-end is infinitely
scalable, some of them are really darn fast. That may be enough for
the majority of use cases. It's not outside the realm of possibility
that the inflection point [0] where these design choices result in
performance limitations is at the very high end of scale-out, eg.
public cloud providers who have the resources to invest further in
improving zaqar.

As an example of what I mean, let me refer to the 99th percentile
response time graphs in Kurt's benchmarks [1]... increasing the number
of clients with write-heavy workloads was enough to drive latency from
<10ms to >200 ms with a single service. That latency significantly
improved as storage and application instances were added, which is
good, and what I would expect. These benchmarks do not (and were not
intended to) show the maximal performance of a public-cloud-scale
deployment -- but they do show that performance under different
workloads improves as additional services are started.

While I have no basis for comparing the configuration of the
deployment he used in those tests to what a public cloud operator
might choose to deploy, and presumably such an operator would put
significant work into tuning storage and running more instances of
each service and thus shift that inflection point "to the right", my
point is that, by depending on a single storage instance, Zaqar has
pushed the *ability* to scale out down into the storage
implementation. Given my experience scaling SQL and NoSQL data stores
(in my past life, before working on OpenStack) I have a knee-jerk
reaction to believing that this approach will result in a
public-cloud-scale messaging system.

-Devananda

[0] http://en.wikipedia.org/wiki/Inflection_point -- in this context,
I mean the point on the graph of throughput vs latency where the
derivative goes from near-zero (linear growth) to non-zero
(exponential growth)

[1] https://wiki.openstack.org/wiki/Zaqar/Performance/PubSub/Redis

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Joe Gordon

On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco  wrote:

> On 09/18/2014 04:09 PM, Gordon Sim wrote:
> > On 09/18/2014 12:31 PM, Flavio Percoco wrote:
> >> On 09/17/2014 10:36 PM, Joe Gordon wrote:
> >>> My understanding of Zaqar is that it's like SQS. SQS uses distributed
> >>> queues, which have a few unusual properties [0]:
> >>>
> >>>
> >>>  Message Order
> >>>
> >>> Amazon SQS makes a best effort to preserve order in messages, but due
> to
> >>> the distributed nature of the queue, we cannot guarantee you will
> >>> receive messages in the exact order you sent them. If your system
> >>> requires that order be preserved, we recommend you place sequencing
> >>> information in each message so you can reorder the messages upon
> >>> receipt.
> >>>
> >>
> >> Zaqar guarantees FIFO. To be more precise, it does that relying on the
> >> storage backend ability to do so as well. Depending on the storage used,
> >> guaranteeing FIFO may have some performance penalties.
> >
> > Would it be accurate to say that at present Zaqar does not use
> > distributed queues, but holds all queue data in a storage mechanism of
> > some form which may internally distribute that data among servers but
> > provides Zaqar with a consistent data model of some form?
>
> I think this is accurate. The queue's distribution depends on the
> storage ability to do so and deployers will be able to choose what
> storage works best for them based on this as well. I'm not sure how
> useful this separation is from a user perspective but I do see the
> relevance when it comes to implementation details and deployments.
>
>
> > [...]
> >> As of now, Zaqar fully relies on the storage replication/clustering
> >> capabilities to provide data consistency, availability and fault
> >> tolerance.
> >
> > Is the replication synchronous or asynchronous with respect to client
> > calls? E.g. will the response to a post of messages be returned only
> > once the replication of those messages is confirmed? Likewise when
> > deleting a message, is the response only returned when replicas of the
> > message are deleted?
>
> It depends on the driver implementation and/or storage configuration.
> For example, in the mongodb driver, we use the default write concern
> called "acknowledged". This means that as soon as the message gets to
> the master node (note it's not written on disk yet nor replicated) zaqar
> will receive a confirmation and then send the response back to the
> client. This is also configurable by the deployer by changing the
> default write concern in the mongodb uri using `?w=SOME_WRITE_CONCERN`[0].
>

This means that by default Zaqar cannot guarantee a message will be
delivered at all. A message can be acknowledged and then the 'master node'
crashes and the message is lost. Zaqar's ability to guarantee delivery is
limited by the reliability of a single node, while something like swift can
only loose a piece of data if 3 machines crash at the same time.


>
> [0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w
>
> >
> >> However, as far as consuming messages is concerned, it can
> >> guarantee once-and-only-once and/or at-least-once delivery depending on
> >> the message pattern used to consume messages. Using pop or claims
> >> guarantees the former whereas streaming messages out of Zaqar guarantees
> >> the later.
> >
> > From what I can see, pop provides unreliable delivery (i.e. its similar
> > to no-ack). If the delete call using pop fails while sending back the
> > response, the messages are removed but didn't get to the client.
>
> Correct, pop works like no-ack. If you want to have pop+ack, it is
> possible to claim just 1 message and then delete it.
>
> >
> > What do you mean by 'streaming messages'?
>
> I'm sorry, that went out wrong. I had the "browsability" term in my head
> and went with something even worse. By streaming messages I meant
> polling messages without claiming them. In other words, at-least-once is
> guaranteed by default, whereas once-and-only-once is guaranteed just if
> claims are used.
>
> >
> > [...]
> >> Based on our short conversation on IRC last night, I understand you're
> >> concerned that FIFO may result in performance issues. That's a valid
> >> concern and I think the right answer is that it depends on the storage.
> >> If the storage has a built-in FIFO guarantee then there's nothing Zaqar
> >> needs to do there. In the other hand, if the storage does not have a
> >> built-in support for FIFO, Zaqar will cover it in the driver
> >> implementation. In the mongodb driver, each message has a marker that is
> >> used to guarantee FIFO.
> >
> > That marker is a sequence number of some kind that is used to provide
> > ordering to queries? Is it generated by the database itself?
>
> It's a sequence number to provide ordering to queries, correct.
> Depending on the driver, it may be generated by Zaqar or the database.
> In mongodb's case it's generated by Zaqar[0].
>
> [0]
>
> https://githu

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Gordon Sim

On 09/18/2014 03:45 PM, Flavio Percoco wrote:

On 09/18/2014 04:09 PM, Gordon Sim wrote:

Is the replication synchronous or asynchronous with respect to client
calls? E.g. will the response to a post of messages be returned only
once the replication of those messages is confirmed? Likewise when
deleting a message, is the response only returned when replicas of the
message are deleted?

It depends on the driver implementation and/or storage configuration.
For example, in the mongodb driver, we use the default write concern
called "acknowledged". This means that as soon as the message gets to
the master node (note it's not written on disk yet nor replicated) zaqar
will receive a confirmation and then send the response back to the
client.

So in that mode it's unreliable. If there is failure right after the
response is sent the message may be lost, but the client believes it has
been confirmed so will not resend.

This is also configurable by the deployer by changing the
default write concern in the mongodb uri using `?w=SOME_WRITE_CONCERN`[0].

[0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w

So you could change that to majority to get reliable publication
(at-least-once).

[...]

From what I can see, pop provides unreliable delivery (i.e. its similar
to no-ack). If the delete call using pop fails while sending back the
response, the messages are removed but didn't get to the client.

Correct, pop works like no-ack. If you want to have pop+ack, it is
possible to claim just 1 message and then delete it.

Right, claim+delete is ack (and if the claim is replicated and
recoverable etc you can verify whether deletion occurred to ensure
message is processed only once). Using delete-with-pop is no-ak, i.e.
at-most-once.

What do you mean by 'streaming messages'?

I'm sorry, that went out wrong. I had the "browsability" term in my head
and went with something even worse. By streaming messages I meant
polling messages without claiming them. In other words, at-least-once is
guaranteed by default, whereas once-and-only-once is guaranteed just if
claims are used.

I don't see that the claim mechanism brings any stronger guarantee, it
just offers a competing consumer behaviour where browsing is
non-competing (non-destructive). In both cases you require the client to
be able to remember which messages it had processed in order to ensure
exactly once. The claim reduces the scope of any doubt, but the client
still needs to be able to determine whether it has already processed any
message in the claim already.

[...]

That marker is a sequence number of some kind that is used to provide
ordering to queries? Is it generated by the database itself?

It's a sequence number to provide ordering to queries, correct.
Depending on the driver, it may be generated by Zaqar or the database.
In mongodb's case it's generated by Zaqar[0].

Zaqar increments a counter held within the database, am I reading that
correctly? So mongodb is responsible for the ordering and atomicity of
multiple concurrent requests for a marker?

[0]
https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/queues.py#L103-L185

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/18/2014 09:25 PM, Gordon Sim wrote:
> On 09/18/2014 03:45 PM, Flavio Percoco wrote:
>> On 09/18/2014 04:09 PM, Gordon Sim wrote:
>>> Is the replication synchronous or asynchronous with respect to client
>>> calls? E.g. will the response to a post of messages be returned only
>>> once the replication of those messages is confirmed? Likewise when
>>> deleting a message, is the response only returned when replicas of the
>>> message are deleted?
>>
>> It depends on the driver implementation and/or storage configuration.
>> For example, in the mongodb driver, we use the default write concern
>> called "acknowledged". This means that as soon as the message gets to
>> the master node (note it's not written on disk yet nor replicated) zaqar
>> will receive a confirmation and then send the response back to the
>> client.
> 
> So in that mode it's unreliable. If there is failure right after the
> response is sent the message may be lost, but the client believes it has
> been confirmed so will not resend.
> 
>> This is also configurable by the deployer by changing the
>> default write concern in the mongodb uri using
>> `?w=SOME_WRITE_CONCERN`[0].
>>
>> [0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w
> 
> So you could change that to majority to get reliable publication
> (at-least-once).

Right, to help with the fight for a world with saner defaults, I think
it'd be better to use majority as the default write concern in the
mongodb driver.


>>> What do you mean by 'streaming messages'?
>>
>> I'm sorry, that went out wrong. I had the "browsability" term in my head
>> and went with something even worse. By streaming messages I meant
>> polling messages without claiming them. In other words, at-least-once is
>> guaranteed by default, whereas once-and-only-once is guaranteed just if
>> claims are used.
> 
> I don't see that the claim mechanism brings any stronger guarantee, it
> just offers a competing consumer behaviour where browsing is
> non-competing (non-destructive). In both cases you require the client to
> be able to remember which messages it had processed in order to ensure
> exactly once. The claim reduces the scope of any doubt, but the client
> still needs to be able to determine whether it has already processed any
> message in the claim already.

The client needs to remember which messages it had processed if it
doesn't delete them (ack) after it has processed them. It's true the
client could also fail after having processed the message which means it
won't be able to ack it.

That said, being able to prevent other consumers to consume a specific
message can bring a stronger guarantee depending on how messages are
processed. I mean, claiming a message guarantees that throughout the
duration of that claim, no other client will be able to consume the
claimed messages, which means it allows messages to be consumed only once.

> 
> [...]
>>> That marker is a sequence number of some kind that is used to provide
>>> ordering to queries? Is it generated by the database itself?
>>
>> It's a sequence number to provide ordering to queries, correct.
>> Depending on the driver, it may be generated by Zaqar or the database.
>> In mongodb's case it's generated by Zaqar[0].
> 
> Zaqar increments a counter held within the database, am I reading that
> correctly? So mongodb is responsible for the ordering and atomicity of
> multiple concurrent requests for a marker?

Yes.

The message posting code is here[0] in case you'd like to take a look at
the logic used for mongodb (any feedback is obviously very welcome):

[0]
https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/messages.py#L494

Cheers,
Flavio

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/18/2014 07:19 PM, Joe Gordon wrote:
> 
> 
> On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco  > wrote:
> 
> On 09/18/2014 04:09 PM, Gordon Sim wrote:
> > On 09/18/2014 12:31 PM, Flavio Percoco wrote:
> >> On 09/17/2014 10:36 PM, Joe Gordon wrote:
> >>> My understanding of Zaqar is that it's like SQS. SQS uses distributed
> >>> queues, which have a few unusual properties [0]:
> >>>
> >>>
> >>>  Message Order
> >>>
> >>> Amazon SQS makes a best effort to preserve order in messages, but due 
> to
> >>> the distributed nature of the queue, we cannot guarantee you will
> >>> receive messages in the exact order you sent them. If your system
> >>> requires that order be preserved, we recommend you place sequencing
> >>> information in each message so you can reorder the messages upon
> >>> receipt.
> >>>
> >>
> >> Zaqar guarantees FIFO. To be more precise, it does that relying on the
> >> storage backend ability to do so as well. Depending on the storage 
> used,
> >> guaranteeing FIFO may have some performance penalties.
> >
> > Would it be accurate to say that at present Zaqar does not use
> > distributed queues, but holds all queue data in a storage mechanism of
> > some form which may internally distribute that data among servers but
> > provides Zaqar with a consistent data model of some form?
> 
> I think this is accurate. The queue's distribution depends on the
> storage ability to do so and deployers will be able to choose what
> storage works best for them based on this as well. I'm not sure how
> useful this separation is from a user perspective but I do see the
> relevance when it comes to implementation details and deployments.
> 
> 
> > [...]
> >> As of now, Zaqar fully relies on the storage replication/clustering
> >> capabilities to provide data consistency, availability and fault
> >> tolerance.
> >
> > Is the replication synchronous or asynchronous with respect to client
> > calls? E.g. will the response to a post of messages be returned only
> > once the replication of those messages is confirmed? Likewise when
> > deleting a message, is the response only returned when replicas of the
> > message are deleted?
> 
> It depends on the driver implementation and/or storage configuration.
> For example, in the mongodb driver, we use the default write concern
> called "acknowledged". This means that as soon as the message gets to
> the master node (note it's not written on disk yet nor replicated) zaqar
> will receive a confirmation and then send the response back to the
> client. This is also configurable by the deployer by changing the
> default write concern in the mongodb uri using
> `?w=SOME_WRITE_CONCERN`[0].
> 
> 
> This means that by default Zaqar cannot guarantee a message will be
> delivered at all. A message can be acknowledged and then the 'master
> node' crashes and the message is lost. Zaqar's ability to guarantee
> delivery is limited by the reliability of a single node, while something
> like swift can only loose a piece of data if 3 machines crash at the
> same time.

Correct, as mentioned in my reply to Gordon, I also think `majority` is
a saner default for the write concern in this case.

I'm glad you mentioned Swift. We discussed a while back about having a
storage driver for it. I thought we had a blueprint for that but we
don't. Last time we discussed it, swift seemed to cover everything we
needed, IIRC. Anyway, just a thought.

Flavio

> [0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w
> 
> >
> >> However, as far as consuming messages is concerned, it can
> >> guarantee once-and-only-once and/or at-least-once delivery depending on
> >> the message pattern used to consume messages. Using pop or claims
> >> guarantees the former whereas streaming messages out of Zaqar 
> guarantees
> >> the later.
> >
> > From what I can see, pop provides unreliable delivery (i.e. its similar
> > to no-ack). If the delete call using pop fails while sending back the
> > response, the messages are removed but didn't get to the client.
> 
> Correct, pop works like no-ack. If you want to have pop+ack, it is
> possible to claim just 1 message and then delete it.
> 
> >
> > What do you mean by 'streaming messages'?
> 
> I'm sorry, that went out wrong. I had the "browsability" term in my head
> and went with something even worse. By streaming messages I meant
> polling messages without claiming them. In other words, at-least-once is
> guaranteed by default, whereas once-and-only-once is guaranteed just if
> claims are used.
> 
> >
> > [...]
> >> Based on our short conversation on IRC last night, I understand you're
> >> concerned that FIFO may result in performance is

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/18/2014 07:16 PM, Devananda van der Veen wrote:
> On Thu, Sep 18, 2014 at 8:54 AM, Devananda van der Veen
>  wrote:
>> On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco  wrote:
>>> On 09/18/2014 04:09 PM, Gordon Sim wrote:
 On 09/18/2014 12:31 PM, Flavio Percoco wrote:
> Zaqar guarantees FIFO. To be more precise, it does that relying on the
> storage backend ability to do so as well. Depending on the storage used,
> guaranteeing FIFO may have some performance penalties.

 Would it be accurate to say that at present Zaqar does not use
 distributed queues, but holds all queue data in a storage mechanism of
 some form which may internally distribute that data among servers but
 provides Zaqar with a consistent data model of some form?
>>>
>>> I think this is accurate. The queue's distribution depends on the
>>> storage ability to do so and deployers will be able to choose what
>>> storage works best for them based on this as well. I'm not sure how
>>> useful this separation is from a user perspective but I do see the
>>> relevance when it comes to implementation details and deployments.
>>
>> Guaranteeing FIFO and not using a distributed queue architecture
>> *above* the storage backend are both scale-limiting design choices.
>> That Zaqar's scalability depends on the storage back end is not a
>> desirable thing in a cloud-scale messaging system in my opinion,
>> because this will prevent use at scales which can not be accommodated
>> by a single storage back end.
>>
> 
> It may be worth qualifying this a bit more.
> 
> While no single instance of any storage back-end is infinitely
> scalable, some of them are really darn fast. That may be enough for
> the majority of use cases. It's not outside the realm of possibility
> that the inflection point [0] where these design choices result in
> performance limitations is at the very high end of scale-out, eg.
> public cloud providers who have the resources to invest further in
> improving zaqar.
> 
> As an example of what I mean, let me refer to the 99th percentile
> response time graphs in Kurt's benchmarks [1]... increasing the number
> of clients with write-heavy workloads was enough to drive latency from
> <10ms to >200 ms with a single service. That latency significantly
> improved as storage and application instances were added, which is
> good, and what I would expect. These benchmarks do not (and were not
> intended to) show the maximal performance of a public-cloud-scale
> deployment -- but they do show that performance under different
> workloads improves as additional services are started.
> 
> While I have no basis for comparing the configuration of the
> deployment he used in those tests to what a public cloud operator
> might choose to deploy, and presumably such an operator would put
> significant work into tuning storage and running more instances of
> each service and thus shift that inflection point "to the right", my
> point is that, by depending on a single storage instance, Zaqar has
> pushed the *ability* to scale out down into the storage
> implementation. Given my experience scaling SQL and NoSQL data stores
> (in my past life, before working on OpenStack) I have a knee-jerk
> reaction to believing that this approach will result in a
> public-cloud-scale messaging system.

Thanks for the more detailed explanation of your concern, I appreciate it.

Let me start by saying that I agree with the fact that pushing messages
distribution down to the storage may end up in some scaling limitations
for some scenarios.

That said, Zaqar already has the knowledge of pools. Pools allow
operators to add more storage clusters to Zaqar and with that balance
the load between them. It is possible to distribute the data across
these pools in a per-queue basis. While the messages of a queue are not
distributed across multiple *pools* - all the messages for queue X will
live in a single pool - I do believe this per-queue distribution helps
to address the above concern and pushes that limitation farther away.

Let me explain how pools currently work a bit better. As of now, each
pool has a URI pointing to a storage cluster and a weight. This weight
is used to balance load between pools every time a queue is created.
Once it's created, Zaqar keeps the information of the queue<->pool
association in a catalogue that is used to know where the queue lives.
We'll likely add new algorithms to have a better and more even
distribution of queues across the registered pools.

I'm sure message distribution could be implemented in Zaqar but I'm not
convinced we should do so right now. The reason being it would bring in
a whole lot of new issues to the project that I think we can and should
avoid for now.

Thanks for the feedback, Devananda.
Flavio


> 
> -Devananda
> 
> [0] http://en.wikipedia.org/wiki/Inflection_point -- in this context,
> I mean the point on the graph of throughput vs latency where the
> derivative goes from near-zer

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Thierry Carrez

Joe Gordon wrote:
> On Thu, Sep 18, 2014 at 9:02 AM, Devananda van der Veen
> mailto:devananda@gmail.com>> wrote:
>> - guaranteed message order
>> - not distributing work across a configurable number of back ends
>> 
>> These are scale-limiting design choices which are reflected in the
>> API's characteristics.
> 
> I agree with Clint and Devananda

The underlying question being... can Zaqar evolve to ultimately reach
the massive scale use case Joe, Clint and Devananda want it to reach, or
are those design choices so deeply rooted in the code and architecture
that Zaqar won't naturally mutate to support that use case.

The Zaqar team has shown great willingness to adapt in order to support
more use cases, but I guess there may be architectural design choices
that would just mean starting over ?

-- 
Thierry Carrez (ttx)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/19/2014 11:00 AM, Thierry Carrez wrote:
> Joe Gordon wrote:
>> On Thu, Sep 18, 2014 at 9:02 AM, Devananda van der Veen
>> mailto:devananda@gmail.com>> wrote:
>>> - guaranteed message order
>>> - not distributing work across a configurable number of back ends
>>>
>>> These are scale-limiting design choices which are reflected in the
>>> API's characteristics.
>>
>> I agree with Clint and Devananda
> 
> The underlying question being... can Zaqar evolve to ultimately reach
> the massive scale use case Joe, Clint and Devananda want it to reach, or
> are those design choices so deeply rooted in the code and architecture
> that Zaqar won't naturally mutate to support that use case.
> 
> The Zaqar team has shown great willingness to adapt in order to support
> more use cases, but I guess there may be architectural design choices
> that would just mean starting over ?


Zaqar has scaling capabilities that go beyond depending on a single
storage cluster. As I mentioned in my previous email, the support for
storage pools allows the operator to scale out the storage layer and
balance the load across them.

There's always space for improvement and I definitely wouldn't go that
far to say there's a need to start over.

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Eoghan Glynn



> Hi All,
> 
> My understanding of Zaqar is that it's like SQS. SQS uses distributed queues,
> which have a few unusual properties [0]:
> Message Order
> 
> 
> Amazon SQS makes a best effort to preserve order in messages, but due to the
> distributed nature of the queue, we cannot guarantee you will receive
> messages in the exact order you sent them. If your system requires that
> order be preserved, we recommend you place sequencing information in each
> message so you can reorder the messages upon receipt.
> At-Least-Once Delivery
> 
> 
> Amazon SQS stores copies of your messages on multiple servers for redundancy
> and high availability. On rare occasions, one of the servers storing a copy
> of a message might be unavailable when you receive or delete the message. If
> that occurs, the copy of the message will not be deleted on that unavailable
> server, and you might get that message copy again when you receive messages.
> Because of this, you must design your application to be idempotent (i.e., it
> must not be adversely affected if it processes the same message more than
> once).
> Message Sample
> 
> 
> The behavior of retrieving messages from the queue depends whether you are
> using short (standard) polling, the default behavior, or long polling. For
> more information about long polling, see Amazon SQS Long Polling .
> 
> With short polling, when you retrieve messages from the queue, Amazon SQS
> samples a subset of the servers (based on a weighted random distribution)
> and returns messages from just those servers. This means that a particular
> receive request might not return all your messages. Or, if you have a small
> number of messages in your queue (less than 1000), it means a particular
> request might not return any of your messages, whereas a subsequent request
> will. If you keep retrieving from your queues, Amazon SQS will sample all of
> the servers, and you will receive all of your messages.
> 
> The following figure shows short polling behavior of messages being returned
> after one of your system components makes a receive request. Amazon SQS
> samples several of the servers (in gray) and returns the messages from those
> servers (Message A, C, D, and B). Message E is not returned to this
> particular request, but it would be returned to a subsequent request.
> 
> 
> 
> 
> 
> 
> 
> Presumably SQS has these properties because it makes the system scalable, if
> so does Zaqar have the same properties (not just making these same
> guarantees in the API, but actually having these properties in the
> backends)? And if not, why? I looked on the wiki [1] for information on
> this, but couldn't find anything.

The premise of this thread is flawed I think.

It seems to be predicated on a direct quote from the public
documentation of a closed-source system justifying some
assumptions about the internal architecture and design goals
of that closed-source system.

It then proceeds to hold zaqar to account for not making
the same choices as that closed-source system.

This puts the zaqar folks in a no-win situation, as it's hard
to refute such arguments when they have no visibility over
the innards of that closed-source system.

Sure, the assumption may well be correct that the designers
of SQS made the choice to expose applications to out-of-order
messages as this was the only practical way of acheiving their
scalability goals.

But since the code isn't on github and the design discussions
aren't publicly archived, we have no way of validating that.

Would it be more reasonable to compare against a cloud-scale
messaging system that folks may have more direct knowledge
of?

For example, is HP Cloud Messaging[1] rolled out in full
production by now?

Is it still cloning the original Marconi API, or has it kept
up with the evolution of the API? Has the nature of this API
been seen as the root cause of any scalability issues?

Cheers,
Eoghan

[1] 
http://www.openstack.org/blog/2013/05/an-introductory-tour-of-openstack-cloud-messaging-as-a-service

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Clint Byrum

Excerpts from Eoghan Glynn's message of 2014-09-19 04:23:55 -0700:
> 
> > Hi All,
> > 
> > My understanding of Zaqar is that it's like SQS. SQS uses distributed 
> > queues,
> > which have a few unusual properties [0]:
> > Message Order
> > 
> > 
> > Amazon SQS makes a best effort to preserve order in messages, but due to the
> > distributed nature of the queue, we cannot guarantee you will receive
> > messages in the exact order you sent them. If your system requires that
> > order be preserved, we recommend you place sequencing information in each
> > message so you can reorder the messages upon receipt.
> > At-Least-Once Delivery
> > 
> > 
> > Amazon SQS stores copies of your messages on multiple servers for redundancy
> > and high availability. On rare occasions, one of the servers storing a copy
> > of a message might be unavailable when you receive or delete the message. If
> > that occurs, the copy of the message will not be deleted on that unavailable
> > server, and you might get that message copy again when you receive messages.
> > Because of this, you must design your application to be idempotent (i.e., it
> > must not be adversely affected if it processes the same message more than
> > once).
> > Message Sample
> > 
> > 
> > The behavior of retrieving messages from the queue depends whether you are
> > using short (standard) polling, the default behavior, or long polling. For
> > more information about long polling, see Amazon SQS Long Polling .
> > 
> > With short polling, when you retrieve messages from the queue, Amazon SQS
> > samples a subset of the servers (based on a weighted random distribution)
> > and returns messages from just those servers. This means that a particular
> > receive request might not return all your messages. Or, if you have a small
> > number of messages in your queue (less than 1000), it means a particular
> > request might not return any of your messages, whereas a subsequent request
> > will. If you keep retrieving from your queues, Amazon SQS will sample all of
> > the servers, and you will receive all of your messages.
> > 
> > The following figure shows short polling behavior of messages being returned
> > after one of your system components makes a receive request. Amazon SQS
> > samples several of the servers (in gray) and returns the messages from those
> > servers (Message A, C, D, and B). Message E is not returned to this
> > particular request, but it would be returned to a subsequent request.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Presumably SQS has these properties because it makes the system scalable, if
> > so does Zaqar have the same properties (not just making these same
> > guarantees in the API, but actually having these properties in the
> > backends)? And if not, why? I looked on the wiki [1] for information on
> > this, but couldn't find anything.
> 
> The premise of this thread is flawed I think.
> 
> It seems to be predicated on a direct quote from the public
> documentation of a closed-source system justifying some
> assumptions about the internal architecture and design goals
> of that closed-source system.
> 
> It then proceeds to hold zaqar to account for not making
> the same choices as that closed-source system.
> 

I don't think we want Zaqar to make the same choices. OpenStack's
constraints are different from AWS's.

I want to highlight that our expectations are for the API to support
deploying at scale. SQS _clearly_ started with a point of extreme scale
for the deployer, and thus is a good example of an API that is limited
enough to scale like that.

What has always been the concern is that Zaqar would make it extremely
complicated and/or costly to get to that level.

> This puts the zaqar folks in a no-win situation, as it's hard
> to refute such arguments when they have no visibility over
> the innards of that closed-source system.
> 

Nobody expects to know the insides. But the outsides, the parts that
are public, are brilliant because they are _limited_, and yet they still
support many many use cases.

> Sure, the assumption may well be correct that the designers
> of SQS made the choice to expose applications to out-of-order
> messages as this was the only practical way of acheiving their
> scalability goals.
> 
> But since the code isn't on github and the design discussions
> aren't publicly archived, we have no way of validating that.
> 

We don't need to see the code. Not requiring ordering makes the whole
problem easier to reason about. You don't need explicit pools anymore.
Just throw messages wherever, and make sure that everywhere gets
polled on a reasonable enough frequency. This is the kind of thing
operations loves. No global state means no split brain to avoid, no
synchronization. Does it solve all problems? no. But it solves a single
one, REALLY well.

Frankly I don't understand why there would be this argument to hold on
to so many use cases and so much API surface area. Zaqar's life gets
easier without ordering g

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Gordon Sim


On 09/19/2014 08:53 AM, Flavio Percoco wrote:

On 09/18/2014 09:25 PM, Gordon Sim wrote:

I don't see that the claim mechanism brings any stronger guarantee, it
just offers a competing consumer behaviour where browsing is
non-competing (non-destructive). In both cases you require the client to
be able to remember which messages it had processed in order to ensure
exactly once. The claim reduces the scope of any doubt, but the client
still needs to be able to determine whether it has already processed any
message in the claim already.


The client needs to remember which messages it had processed if it
doesn't delete them (ack) after it has processed them. It's true the
client could also fail after having processed the message which means it
won't be able to ack it.

That said, being able to prevent other consumers to consume a specific
message can bring a stronger guarantee depending on how messages are
processed. I mean, claiming a message guarantees that throughout the
duration of that claim, no other client will be able to consume the
claimed messages, which means it allows messages to be consumed only once.


I think 'exactly once' means different things when used for competing 
consumers and non-competing consumers. For the former it means the 
message is processed by only one consumer, and only once. For the latter 
it means every consumer processes the message exactly once.


Using a claim provides the competing consumer behaviour. To me this is a 
'different' guarantee from non-competing consumer rather than a 
'stronger' one, and it is orthogonal to the reliability of the delivery.


However we only differ on terminology used; I believe we are on the same 
page as far as the semantics go :-)



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Vipul Sabhaya

On Fri, Sep 19, 2014 at 4:23 AM, Eoghan Glynn  wrote:

>
>
> > Hi All,
> >
> > My understanding of Zaqar is that it's like SQS. SQS uses distributed
> queues,
> > which have a few unusual properties [0]:
> > Message Order
> >
> >
> > Amazon SQS makes a best effort to preserve order in messages, but due to
> the
> > distributed nature of the queue, we cannot guarantee you will receive
> > messages in the exact order you sent them. If your system requires that
> > order be preserved, we recommend you place sequencing information in each
> > message so you can reorder the messages upon receipt.
> > At-Least-Once Delivery
> >
> >
> > Amazon SQS stores copies of your messages on multiple servers for
> redundancy
> > and high availability. On rare occasions, one of the servers storing a
> copy
> > of a message might be unavailable when you receive or delete the
> message. If
> > that occurs, the copy of the message will not be deleted on that
> unavailable
> > server, and you might get that message copy again when you receive
> messages.
> > Because of this, you must design your application to be idempotent
> (i.e., it
> > must not be adversely affected if it processes the same message more than
> > once).
> > Message Sample
> >
> >
> > The behavior of retrieving messages from the queue depends whether you
> are
> > using short (standard) polling, the default behavior, or long polling.
> For
> > more information about long polling, see Amazon SQS Long Polling .
> >
> > With short polling, when you retrieve messages from the queue, Amazon SQS
> > samples a subset of the servers (based on a weighted random distribution)
> > and returns messages from just those servers. This means that a
> particular
> > receive request might not return all your messages. Or, if you have a
> small
> > number of messages in your queue (less than 1000), it means a particular
> > request might not return any of your messages, whereas a subsequent
> request
> > will. If you keep retrieving from your queues, Amazon SQS will sample
> all of
> > the servers, and you will receive all of your messages.
> >
> > The following figure shows short polling behavior of messages being
> returned
> > after one of your system components makes a receive request. Amazon SQS
> > samples several of the servers (in gray) and returns the messages from
> those
> > servers (Message A, C, D, and B). Message E is not returned to this
> > particular request, but it would be returned to a subsequent request.
> >
> >
> >
> >
> >
> >
> >
> > Presumably SQS has these properties because it makes the system
> scalable, if
> > so does Zaqar have the same properties (not just making these same
> > guarantees in the API, but actually having these properties in the
> > backends)? And if not, why? I looked on the wiki [1] for information on
> > this, but couldn't find anything.
>
> The premise of this thread is flawed I think.
>
> It seems to be predicated on a direct quote from the public
> documentation of a closed-source system justifying some
> assumptions about the internal architecture and design goals
> of that closed-source system.
>
> It then proceeds to hold zaqar to account for not making
> the same choices as that closed-source system.
>
> This puts the zaqar folks in a no-win situation, as it's hard
> to refute such arguments when they have no visibility over
> the innards of that closed-source system.
>
> Sure, the assumption may well be correct that the designers
> of SQS made the choice to expose applications to out-of-order
> messages as this was the only practical way of acheiving their
> scalability goals.
>
> But since the code isn't on github and the design discussions
> aren't publicly archived, we have no way of validating that.
>
> Would it be more reasonable to compare against a cloud-scale
> messaging system that folks may have more direct knowledge
> of?
>
> For example, is HP Cloud Messaging[1] rolled out in full
> production by now?
>
>
Unfortunately the HP Cloud Messaging service was decommissioned.


> Is it still cloning the original Marconi API, or has it kept
> up with the evolution of the API? Has the nature of this API
> been seen as the root cause of any scalability issues?
>
>
We created a RabbitMQ backed implementation that aimed to be API compatible
with Marconi.  This proved difficult given some of the API issues that have
been discussed on this very thread.  Our implementation could never be full
API compatible with Marconi (there really isn’t an easy way to map AMQP to
HTTP, without losing serious functionality).

We also worked closely with the Marconi team, trying to get upstream to
support AMQP — the Marconi team also came to the same conclusion that their
API was not a good fit for such a backend.

Now — we are looking at options.  One that intrigues us has also been
suggested on these threads, specifically building a ‘managed messaging
service’ that could provision various messaging technologies (rabbit,
kafka, etc), and at

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Zane Bitter


On 18/09/14 10:55, Flavio Percoco wrote:

On 09/18/2014 04:24 PM, Clint Byrum wrote:

Great job highlighting what our friends over at Amazon are doing.

It's clear from these snippets, and a few other pieces of documentation
for SQS I've read, that the Amazon team approached SQS from a _massive_
scaling perspective. I think what may be forcing a lot of this frustration
with Zaqar is that it was designed with a much smaller scale in mind.

I think as long as that is the case, the design will remain in question.
I'd be comfortable saying that the use cases I've been thinking about
are entirely fine with the limitations SQS has.


I think these are pretty strong comments with not enough arguments to
defend them.


I actually more or less agree with Clint here. As Joe noted (and *thank 
you* Joe for starting this thread - the first one to compare Zaqar to 
something relevant!), SQS offers very, very limited guarantees, and it's 
clear that the reason for that is to make it massively, massively 
scalable in the way that e.g. S3 is scalable while also remaining 
comparably durable (S3 is supposedly designed for 11 nines, BTW).


Zaqar, meanwhile, seems to be promising the world in terms of 
guarantees. (And then taking it away in the fine print, where it says 
that the operator can disregard many of them, potentially without the 
user's knowledge.)


On the other hand, IIUC Zaqar does in fact have a sharding feature 
("Pools") which is its answer to the massive scaling question. I don't 
know enough details to comment further except to say that it evidently 
has been carefully thought about at some level, and it's really 
frustrating for the Zaqar folks when people just assume that it hasn't 
without doing any research. On the face of it sharding is a decent 
solution for this problem. Maybe we need to dig into the details and 
make sure folks are satisfied that there are no hidden dragons.



Saying that Zaqar was designed with a smaller scale in mid without
actually saying why you think so is not fair besides not being true. So
please, do share why you think Zaqar was not designed for big scales and
provide comments that will help the project to grow and improve.

- Is it because the storage technologies that have been chosen?
- Is it because of the API?
- Is it because of the programing language/framework ?


I didn't read Clint and Devananda's comments as an attack on any of 
these things (although I agree that there have been far too many such 
attacks over the last 12 months from people who didn't bother to do 
their homework first). They're looking at Zaqar from first principles 
and finding that it promises too much, raising the concern the team may 
in future reach a point where they are unable to meet the needs of 
future users (perhaps for scaling reasons) without breaking existing 
users who have come to rely on those promises.



So far, we've just discussed the API semantics and not zaqar's
scalability, which makes your comments even more surprising.


What guarantees you offer can determine pretty much all of the design 
tradeoffs (e.g. latency vs. durability) that you have to make. Some of 
those (e.g. random access to messages) are baked in to the API, but 
others are not. It's actually a real concern to me to see elsewhere in 
this thread that you are punting to operators on many of the latter.


IMO the Zaqar team needs to articulate an opinionated vision of just 
what Zaqar actually is, and why it offers value. And by 'value' here I 
mean there should be $ signs attached.


For example, it makes no sense to me that Zaqar should ever be able to 
run in a mode that doesn't guarantee delivery of messages. There are a 
million and one easy, cheap ways to set up a system that _might_ deliver 
your message. One server running a message broker is sufficient. But if 
you want reliable delivery, then you'll probably need at least 3 (for 
still pretty low values of "reliable"). I did some back-of-the-envelope 
math with the AWS pricing and _conservatively_ for any application 
receiving <100k messages per hour (~30 per second) it's cheaper to use 
SQS than to spin up those servers yourself.


In other words, a service that *guarantees* delivery of messages *has* 
to be run by the cloud operator because for the overwhelming majority of 
applications, the user cannot do so economically.


(I'm assuming here that AWS's _relative_ pricing accurately reflects 
their _relative_ cost basis, which is almost certainly not strictly 
true, but I expect a close enough approximation for these purposes.)


What I would like to hear in this thread is:

"Zaqar is We-Never-Ever-Ever-Ever-Lose-Your-Message as a Service 
(WNEEELYMaaS), and it has to be in OpenStack because only the cloud 
operator can provide that cost-effectively."


What I'm hearing instead is:

- We'll probably deliver your message.
- We can guarantee that we'll deliver your message, but only on clouds 
where the operator has chosen to configure Mong

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Gordon Sim


On 09/19/2014 09:13 PM, Zane Bitter wrote:

SQS offers very, very limited guarantees, and it's clear that the reason
for that is to make it massively, massively scalable in the way that
e.g. S3 is scalable while also remaining comparably durable (S3 is
supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms of
guarantees. (And then taking it away in the fine print, where it says
that the operator can disregard many of them, potentially without the
user's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature
("Pools") which is its answer to the massive scaling question.


There are different dimensions to the scaling problem.

As I understand it, pools don't help scaling a given queue since all the 
messages for that queue must be in the same pool. At present traffic 
through different Zaqar queues are essentially entirely orthogonal 
streams. Pooling can help scale the number of such orthogonal streams, 
but to be honest, that's the easier part of the problem.


There is also the possibility of using the sharding capabilities of the 
underlying storage. But the pattern of use will determine how effective 
that can be.


So for example, on the ordering question, if order is defined by a 
single sequence number held in the database and atomically incremented 
for every message published, that is not likely to be something where 
the databases sharding is going to help in scaling the number of 
concurrent publications.


Though sharding would allow scaling the total number messages on the 
queue (by distributing them over multiple shards), the total ordering of 
those messages reduces it's effectiveness in scaling the number of 
concurrent getters (e.g. the concurrent subscribers in pub-sub) since 
they will all be getting the messages in exactly the same order.


Strict ordering impacts the competing consumers case also (and is in my 
opinion of limited value as a guarantee anyway). At any given time, the 
head of the queue is in one shard, and all concurrent claim requests 
will contend for messages in that same shard. Though the unsuccessful 
claimants may then move to another shard as the head moves, they will 
all again try to access the messages in the same order.


So if Zaqar's goal is to scale the number of orthogonal queues, and the 
number of messages held at any time within these, the pooling facility 
and any sharding capability in the underlying store for a pool would 
likely be effective even with the strict ordering guarantee.


If scaling the number of communicants on a given communication channel 
is a goal however, then strict ordering may hamper that. If it does, it 
seems to me that this is not just a policy tweak on the underlying 
datastore to choose the desired balance between ordering and scale, but 
a more fundamental question on the internal structure of the queue 
implementation built on top of the datastore.


I also get the impression, perhaps wrongly, that providing the strict 
ordering guarantee wasn't necessarily an explicit requirement, but was 
simply a property of the underlying implementation(?).


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Zane Bitter


On 22/09/14 10:11, Gordon Sim wrote:

On 09/19/2014 09:13 PM, Zane Bitter wrote:

SQS offers very, very limited guarantees, and it's clear that the reason
for that is to make it massively, massively scalable in the way that
e.g. S3 is scalable while also remaining comparably durable (S3 is
supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms of
guarantees. (And then taking it away in the fine print, where it says
that the operator can disregard many of them, potentially without the
user's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature
("Pools") which is its answer to the massive scaling question.


There are different dimensions to the scaling problem.


Many thanks for this analysis, Gordon. This is really helpful stuff.


As I understand it, pools don't help scaling a given queue since all the
messages for that queue must be in the same pool. At present traffic
through different Zaqar queues are essentially entirely orthogonal
streams. Pooling can help scale the number of such orthogonal streams,
but to be honest, that's the easier part of the problem.


But I think it's also the important part of the problem. When I talk 
about scaling, I mean 1 million clients sending 10 messages per second 
each, not 10 clients sending 1 million messages per second each.


When a user gets to the point that individual queues have massive 
throughput, it's unlikely that a one-size-fits-all cloud offering like 
Zaqar or SQS is _ever_ going to meet their needs. Those users will want 
to spin up and configure their own messaging systems on Nova servers, 
and at that kind of size they'll be able to afford to. (In fact, they 
may not be able to afford _not_ to, assuming per-message-based pricing.)



There is also the possibility of using the sharding capabilities of the
underlying storage. But the pattern of use will determine how effective
that can be.

So for example, on the ordering question, if order is defined by a
single sequence number held in the database and atomically incremented
for every message published, that is not likely to be something where
the databases sharding is going to help in scaling the number of
concurrent publications.

Though sharding would allow scaling the total number messages on the
queue (by distributing them over multiple shards), the total ordering of
those messages reduces it's effectiveness in scaling the number of
concurrent getters (e.g. the concurrent subscribers in pub-sub) since
they will all be getting the messages in exactly the same order.

Strict ordering impacts the competing consumers case also (and is in my
opinion of limited value as a guarantee anyway). At any given time, the
head of the queue is in one shard, and all concurrent claim requests
will contend for messages in that same shard. Though the unsuccessful
claimants may then move to another shard as the head moves, they will
all again try to access the messages in the same order.

So if Zaqar's goal is to scale the number of orthogonal queues, and the
number of messages held at any time within these, the pooling facility
and any sharding capability in the underlying store for a pool would
likely be effective even with the strict ordering guarantee.


IMHO this is (or should be) the goal - support enormous numbers of 
small-to-moderate sized queues.



If scaling the number of communicants on a given communication channel
is a goal however, then strict ordering may hamper that. If it does, it
seems to me that this is not just a policy tweak on the underlying
datastore to choose the desired balance between ordering and scale, but
a more fundamental question on the internal structure of the queue
implementation built on top of the datastore.


I agree with your analysis, but I don't think this should be a goal.

Note that the user can still implement this themselves using 
application-level sharding - if you know that in-order delivery is not 
important to you, then randomly assign clients to a queue and then poll 
all of the queues in the round-robin. This yields _exactly_ the same 
semantics as SQS.


The reverse is true of SQS - if you want FIFO then you have to implement 
re-ordering by sequence number in your application. (I'm not certain, 
but it also sounds very much like this situation is ripe for losing 
messages when your client dies.)


So the question is: in which use case do we want to push additional 
complexity into the application? The case where there are truly massive 
volumes of messages flowing to a single point? Or the case where the 
application wants the messages in order?


I'd suggest both that the former applications are better able to handle 
that extra complexity and that the latter applications are probably more 
common. So it seems that the Zaqar team made a good decision.


(Aside: it follows that Zaqar probably should have a maximum throughput 
quota for each queue; or that it should report usage information

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter  wrote:

> On 22/09/14 10:11, Gordon Sim wrote:
>
>> On 09/19/2014 09:13 PM, Zane Bitter wrote:
>>
>>> SQS offers very, very limited guarantees, and it's clear that the reason
>>> for that is to make it massively, massively scalable in the way that
>>> e.g. S3 is scalable while also remaining comparably durable (S3 is
>>> supposedly designed for 11 nines, BTW).
>>>
>>> Zaqar, meanwhile, seems to be promising the world in terms of
>>> guarantees. (And then taking it away in the fine print, where it says
>>> that the operator can disregard many of them, potentially without the
>>> user's knowledge.)
>>>
>>> On the other hand, IIUC Zaqar does in fact have a sharding feature
>>> ("Pools") which is its answer to the massive scaling question.
>>>
>>
>> There are different dimensions to the scaling problem.
>>
>
> Many thanks for this analysis, Gordon. This is really helpful stuff.
>
>  As I understand it, pools don't help scaling a given queue since all the
>> messages for that queue must be in the same pool. At present traffic
>> through different Zaqar queues are essentially entirely orthogonal
>> streams. Pooling can help scale the number of such orthogonal streams,
>> but to be honest, that's the easier part of the problem.
>>
>
> But I think it's also the important part of the problem. When I talk about
> scaling, I mean 1 million clients sending 10 messages per second each, not
> 10 clients sending 1 million messages per second each.
>
> When a user gets to the point that individual queues have massive
> throughput, it's unlikely that a one-size-fits-all cloud offering like
> Zaqar or SQS is _ever_ going to meet their needs. Those users will want to
> spin up and configure their own messaging systems on Nova servers, and at
> that kind of size they'll be able to afford to. (In fact, they may not be
> able to afford _not_ to, assuming per-message-based pricing.)
>

Running a message queue that has a high guarantee of not loosing a message
is hard and SQS promises exactly that, it *will* deliver your message. If a
use case can handle occasionally dropping messages then running your own MQ
makes more sense.

SQS is designed to handle massive queues as well, while I haven't found any
examples of queues that have 1 million messages/second being sent or
received  30k to 100k messages/second is not unheard of [0][1][2].

[0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
[1] http://java.dzone.com/articles/benchmarking-sqs
[2]
http://www.slideshare.net/AmazonWebServices/massive-message-processing-with-amazon-sqs-and-amazon-dynamodb-arc301-aws-reinvent-2013-28431182


>  There is also the possibility of using the sharding capabilities of the
>> underlying storage. But the pattern of use will determine how effective
>> that can be.
>>
>> So for example, on the ordering question, if order is defined by a
>> single sequence number held in the database and atomically incremented
>> for every message published, that is not likely to be something where
>> the databases sharding is going to help in scaling the number of
>> concurrent publications.
>>
>> Though sharding would allow scaling the total number messages on the
>> queue (by distributing them over multiple shards), the total ordering of
>> those messages reduces it's effectiveness in scaling the number of
>> concurrent getters (e.g. the concurrent subscribers in pub-sub) since
>> they will all be getting the messages in exactly the same order.
>>
>> Strict ordering impacts the competing consumers case also (and is in my
>> opinion of limited value as a guarantee anyway). At any given time, the
>> head of the queue is in one shard, and all concurrent claim requests
>> will contend for messages in that same shard. Though the unsuccessful
>> claimants may then move to another shard as the head moves, they will
>> all again try to access the messages in the same order.
>>
>> So if Zaqar's goal is to scale the number of orthogonal queues, and the
>> number of messages held at any time within these, the pooling facility
>> and any sharding capability in the underlying store for a pool would
>> likely be effective even with the strict ordering guarantee.
>>
>
> IMHO this is (or should be) the goal - support enormous numbers of
> small-to-moderate sized queues.


If 50,000 messages per second doesn't count as small-to-moderate then Zaqar
does not fulfill a major SQS use case.


>
>
>  If scaling the number of communicants on a given communication channel
>> is a goal however, then strict ordering may hamper that. If it does, it
>> seems to me that this is not just a policy tweak on the underlying
>> datastore to choose the desired balance between ordering and scale, but
>> a more fundamental question on the internal structure of the queue
>> implementation built on top of the datastore.
>>
>
> I agree with your analysis, but I don't think this should be a goal.
>
> Note that the user can still implement this themselves u

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Zane Bitter

On 22/09/14 17:06, Joe Gordon wrote:

On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter wrote:

On 22/09/14 10:11, Gordon Sim wrote:

On 09/19/2014 09:13 PM, Zane Bitter wrote:

SQS offers very, very limited guarantees, and it's clear that the reason
for that is to make it massively, massively scalable in the way that
e.g. S3 is scalable while also remaining comparably durable (S3 is
supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms of
guarantees. (And then taking it away in the fine print, where it says
that the operator can disregard many of them, potentially without the
user's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature
("Pools") which is its answer to the massive scaling question.

There are different dimensions to the scaling problem.

Many thanks for this analysis, Gordon. This is really helpful stuff.

As I understand it, pools don't help scaling a given queue since all the

messages for that queue must be in the same pool. At present traffic
through different Zaqar queues are essentially entirely orthogonal
streams. Pooling can help scale the number of such orthogonal streams,
but to be honest, that's the easier part of the problem.

But I think it's also the important part of the problem. When I talk about
scaling, I mean 1 million clients sending 10 messages per second each, not
10 clients sending 1 million messages per second each.

When a user gets to the point that individual queues have massive
throughput, it's unlikely that a one-size-fits-all cloud offering like
Zaqar or SQS is _ever_ going to meet their needs. Those users will want to
spin up and configure their own messaging systems on Nova servers, and at
that kind of size they'll be able to afford to. (In fact, they may not be
able to afford _not_ to, assuming per-message-based pricing.)

Running a message queue that has a high guarantee of not loosing a message
is hard and SQS promises exactly that, it *will* deliver your message. If a
use case can handle occasionally dropping messages then running your own MQ
makes more sense.

SQS is designed to handle massive queues as well, while I haven't found any
examples of queues that have 1 million messages/second being sent or
received 30k to 100k messages/second is not unheard of [0][1][2].

[0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
[1] http://java.dzone.com/articles/benchmarking-sqs
[2]
http://www.slideshare.net/AmazonWebServices/massive-message-processing-with-amazon-sqs-and-amazon-dynamodb-arc301-aws-reinvent-2013-28431182

Thanks for digging those up, that's really helpful input. I think number
[1] kind of summed up part of what I'm arguing here though:

"But once your requirements get above 35k messages per second, chances
are you need custom solutions anyway; not to mention that while SQS is
cheap, it may become expensive with such loads."

There is also the possibility of using the sharding capabilities of the

underlying storage. But the pattern of use will determine how effective
that can be.

So for example, on the ordering question, if order is defined by a
single sequence number held in the database and atomically incremented
for every message published, that is not likely to be something where
the databases sharding is going to help in scaling the number of
concurrent publications.

Though sharding would allow scaling the total number messages on the
queue (by distributing them over multiple shards), the total ordering of
those messages reduces it's effectiveness in scaling the number of
concurrent getters (e.g. the concurrent subscribers in pub-sub) since
they will all be getting the messages in exactly the same order.

Strict ordering impacts the competing consumers case also (and is in my
opinion of limited value as a guarantee anyway). At any given time, the
head of the queue is in one shard, and all concurrent claim requests
will contend for messages in that same shard. Though the unsuccessful
claimants may then move to another shard as the head moves, they will
all again try to access the messages in the same order.

So if Zaqar's goal is to scale the number of orthogonal queues, and the
number of messages held at any time within these, the pooling facility
and any sharding capability in the underlying store for a pool would
likely be effective even with the strict ordering guarantee.

IMHO this is (or should be) the goal - support enormous numbers of
small-to-moderate sized queues.

If 50,000 messages per second doesn't count as small-to-moderate then Zaqar
does not fulfill a major SQS use case.

It's not a drop-in replacement, but as I mentioned you can recreate the
SQS semantics exactly *and* get the scalability benefits of that
approach by sharding at the application level and then round-robin polling.

As I also mentioned, this is pretty easy to implement, and is only
required for really big applications that are more likely to be written

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter  wrote:

> On 22/09/14 17:06, Joe Gordon wrote:
>
>> On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter  wrote:
>>
>>  On 22/09/14 10:11, Gordon Sim wrote:
>>>
>>>  On 09/19/2014 09:13 PM, Zane Bitter wrote:

  SQS offers very, very limited guarantees, and it's clear that the
> reason
> for that is to make it massively, massively scalable in the way that
> e.g. S3 is scalable while also remaining comparably durable (S3 is
> supposedly designed for 11 nines, BTW).
>
> Zaqar, meanwhile, seems to be promising the world in terms of
> guarantees. (And then taking it away in the fine print, where it says
> that the operator can disregard many of them, potentially without the
> user's knowledge.)
>
> On the other hand, IIUC Zaqar does in fact have a sharding feature
> ("Pools") which is its answer to the massive scaling question.
>
>
 There are different dimensions to the scaling problem.

>>> Many thanks for this analysis, Gordon. This is really helpful stuff.
>>>
>>>   As I understand it, pools don't help scaling a given queue since all
>>> the
>>>
 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.

>>> But I think it's also the important part of the problem. When I talk
>>> about
>>> scaling, I mean 1 million clients sending 10 messages per second each,
>>> not
>>> 10 clients sending 1 million messages per second each.
>>>
>>> When a user gets to the point that individual queues have massive
>>> throughput, it's unlikely that a one-size-fits-all cloud offering like
>>> Zaqar or SQS is _ever_ going to meet their needs. Those users will want
>>> to
>>> spin up and configure their own messaging systems on Nova servers, and at
>>> that kind of size they'll be able to afford to. (In fact, they may not be
>>> able to afford _not_ to, assuming per-message-based pricing.)
>>>
>>>
>> Running a message queue that has a high guarantee of not loosing a message
>> is hard and SQS promises exactly that, it *will* deliver your message. If
>> a
>> use case can handle occasionally dropping messages then running your own
>> MQ
>> makes more sense.
>>
>> SQS is designed to handle massive queues as well, while I haven't found
>> any
>> examples of queues that have 1 million messages/second being sent or
>> received  30k to 100k messages/second is not unheard of [0][1][2].
>>
>> [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
>> [1] http://java.dzone.com/articles/benchmarking-sqs
>> [2]
>> http://www.slideshare.net/AmazonWebServices/massive-
>> message-processing-with-amazon-sqs-and-amazon-
>> dynamodb-arc301-aws-reinvent-2013-28431182
>>
>
> Thanks for digging those up, that's really helpful input. I think number
> [1] kind of summed up part of what I'm arguing here though:
>
> "But once your requirements get above 35k messages per second, chances are
> you need custom solutions anyway; not to mention that while SQS is cheap,
> it may become expensive with such loads."

If you don't require the reliability guarantees that SQS provides then
perhaps. But I would be surprised to hear that a user can set up something
with this level of uptime for less:

"Amazon SQS runs within Amazon’s high-availability data centers, so queues
will be available whenever applications need them. To prevent messages from
being lost or becoming unavailable, all messages are stored redundantly
across multiple servers and data centers." [1]

>
>
>There is also the possibility of using the sharding capabilities of the
>>>
 underlying storage. But the pattern of use will determine how effective
 that can be.

 So for example, on the ordering question, if order is defined by a
 single sequence number held in the database and atomically incremented
 for every message published, that is not likely to be something where
 the databases sharding is going to help in scaling the number of
 concurrent publications.

 Though sharding would allow scaling the total number messages on the
 queue (by distributing them over multiple shards), the total ordering of
 those messages reduces it's effectiveness in scaling the number of
 concurrent getters (e.g. the concurrent subscribers in pub-sub) since
 they will all be getting the messages in exactly the same order.

 Strict ordering impacts the competing consumers case also (and is in my
 opinion of limited value as a guarantee anyway). At any given time, the
 head of the queue is in one shard, and all concurrent claim requests
 will contend for messages in that same shard. Though the unsuccessful
 claimants may then move to another shard as the head moves, they will
 all aga

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On Mon, Sep 22, 2014 at 7:04 PM, Joe Gordon  wrote:

>
>
> On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter  wrote:
>
>> On 22/09/14 17:06, Joe Gordon wrote:
>>
>>> On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter  wrote:
>>>
>>>  On 22/09/14 10:11, Gordon Sim wrote:

  On 09/19/2014 09:13 PM, Zane Bitter wrote:
>
>  SQS offers very, very limited guarantees, and it's clear that the
>> reason
>> for that is to make it massively, massively scalable in the way that
>> e.g. S3 is scalable while also remaining comparably durable (S3 is
>> supposedly designed for 11 nines, BTW).
>>
>> Zaqar, meanwhile, seems to be promising the world in terms of
>> guarantees. (And then taking it away in the fine print, where it says
>> that the operator can disregard many of them, potentially without the
>> user's knowledge.)
>>
>> On the other hand, IIUC Zaqar does in fact have a sharding feature
>> ("Pools") which is its answer to the massive scaling question.
>>
>>
> There are different dimensions to the scaling problem.
>
>
 Many thanks for this analysis, Gordon. This is really helpful stuff.

   As I understand it, pools don't help scaling a given queue since all
 the

> messages for that queue must be in the same pool. At present traffic
> through different Zaqar queues are essentially entirely orthogonal
> streams. Pooling can help scale the number of such orthogonal streams,
> but to be honest, that's the easier part of the problem.
>
>
 But I think it's also the important part of the problem. When I talk
 about
 scaling, I mean 1 million clients sending 10 messages per second each,
 not
 10 clients sending 1 million messages per second each.

 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want
 to
 spin up and configure their own messaging systems on Nova servers, and
 at
 that kind of size they'll be able to afford to. (In fact, they may not
 be
 able to afford _not_ to, assuming per-message-based pricing.)


>>> Running a message queue that has a high guarantee of not loosing a
>>> message
>>> is hard and SQS promises exactly that, it *will* deliver your message.
>>> If a
>>> use case can handle occasionally dropping messages then running your own
>>> MQ
>>> makes more sense.
>>>
>>> SQS is designed to handle massive queues as well, while I haven't found
>>> any
>>> examples of queues that have 1 million messages/second being sent or
>>> received  30k to 100k messages/second is not unheard of [0][1][2].
>>>
>>> [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
>>> [1] http://java.dzone.com/articles/benchmarking-sqs
>>> [2]
>>> http://www.slideshare.net/AmazonWebServices/massive-
>>> message-processing-with-amazon-sqs-and-amazon-
>>> dynamodb-arc301-aws-reinvent-2013-28431182
>>>
>>
>> Thanks for digging those up, that's really helpful input. I think number
>> [1] kind of summed up part of what I'm arguing here though:
>>
>> "But once your requirements get above 35k messages per second, chances
>> are you need custom solutions anyway; not to mention that while SQS is
>> cheap, it may become expensive with such loads."
>
>
> If you don't require the reliability guarantees that SQS provides then
> perhaps. But I would be surprised to hear that a user can set up something
> with this level of uptime for less:
>
> "Amazon SQS runs within Amazon’s high-availability data centers, so queues
> will be available whenever applications need them. To prevent messages from
> being lost or becoming unavailable, all messages are stored redundantly
> across multiple servers and data centers." [1]
>
>
>>
>>
>>There is also the possibility of using the sharding capabilities of the

> underlying storage. But the pattern of use will determine how effective
> that can be.
>
> So for example, on the ordering question, if order is defined by a
> single sequence number held in the database and atomically incremented
> for every message published, that is not likely to be something where
> the databases sharding is going to help in scaling the number of
> concurrent publications.
>
> Though sharding would allow scaling the total number messages on the
> queue (by distributing them over multiple shards), the total ordering
> of
> those messages reduces it's effectiveness in scaling the number of
> concurrent getters (e.g. the concurrent subscribers in pub-sub) since
> they will all be getting the messages in exactly the same order.
>
> Strict ordering impacts the competing consumers case also (and is in my
> opinion of limited value as a guarantee anyway). At any given time, the
> head of the queue is in one shar

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Clint Byrum

Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:
> On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter  wrote:
> 
> > On 22/09/14 17:06, Joe Gordon wrote:
> >
> >> On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter  wrote:
> >>
> >>  On 22/09/14 10:11, Gordon Sim wrote:
> >>>
> >>>  On 09/19/2014 09:13 PM, Zane Bitter wrote:
> 
>   SQS offers very, very limited guarantees, and it's clear that the
> > reason
> > for that is to make it massively, massively scalable in the way that
> > e.g. S3 is scalable while also remaining comparably durable (S3 is
> > supposedly designed for 11 nines, BTW).
> >
> > Zaqar, meanwhile, seems to be promising the world in terms of
> > guarantees. (And then taking it away in the fine print, where it says
> > that the operator can disregard many of them, potentially without the
> > user's knowledge.)
> >
> > On the other hand, IIUC Zaqar does in fact have a sharding feature
> > ("Pools") which is its answer to the massive scaling question.
> >
> >
>  There are different dimensions to the scaling problem.
> 
> 
> >>> Many thanks for this analysis, Gordon. This is really helpful stuff.
> >>>
> >>>   As I understand it, pools don't help scaling a given queue since all
> >>> the
> >>>
>  messages for that queue must be in the same pool. At present traffic
>  through different Zaqar queues are essentially entirely orthogonal
>  streams. Pooling can help scale the number of such orthogonal streams,
>  but to be honest, that's the easier part of the problem.
> 
> 
> >>> But I think it's also the important part of the problem. When I talk
> >>> about
> >>> scaling, I mean 1 million clients sending 10 messages per second each,
> >>> not
> >>> 10 clients sending 1 million messages per second each.
> >>>
> >>> When a user gets to the point that individual queues have massive
> >>> throughput, it's unlikely that a one-size-fits-all cloud offering like
> >>> Zaqar or SQS is _ever_ going to meet their needs. Those users will want
> >>> to
> >>> spin up and configure their own messaging systems on Nova servers, and at
> >>> that kind of size they'll be able to afford to. (In fact, they may not be
> >>> able to afford _not_ to, assuming per-message-based pricing.)
> >>>
> >>>
> >> Running a message queue that has a high guarantee of not loosing a message
> >> is hard and SQS promises exactly that, it *will* deliver your message. If
> >> a
> >> use case can handle occasionally dropping messages then running your own
> >> MQ
> >> makes more sense.
> >>
> >> SQS is designed to handle massive queues as well, while I haven't found
> >> any
> >> examples of queues that have 1 million messages/second being sent or
> >> received  30k to 100k messages/second is not unheard of [0][1][2].
> >>
> >> [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
> >> [1] http://java.dzone.com/articles/benchmarking-sqs
> >> [2]
> >> http://www.slideshare.net/AmazonWebServices/massive-
> >> message-processing-with-amazon-sqs-and-amazon-
> >> dynamodb-arc301-aws-reinvent-2013-28431182
> >>
> >
> > Thanks for digging those up, that's really helpful input. I think number
> > [1] kind of summed up part of what I'm arguing here though:
> >
> > "But once your requirements get above 35k messages per second, chances are
> > you need custom solutions anyway; not to mention that while SQS is cheap,
> > it may become expensive with such loads."
> 
> 
> If you don't require the reliability guarantees that SQS provides then
> perhaps. But I would be surprised to hear that a user can set up something
> with this level of uptime for less:
> 
> "Amazon SQS runs within Amazon’s high-availability data centers, so queues
> will be available whenever applications need them. To prevent messages from
> being lost or becoming unavailable, all messages are stored redundantly
> across multiple servers and data centers." [1]
> 

This is pretty easily doable with gearman or even just using Redis
directly. But it is still ops for end users. The AWS users I've talked to
who use SQS do so because they like that they can use RDS, SQS, and ELB,
and have only one type of thing to operate: their app.

> >
> >
> >There is also the possibility of using the sharding capabilities of the
> >>>
>  underlying storage. But the pattern of use will determine how effective
>  that can be.
> 
>  So for example, on the ordering question, if order is defined by a
>  single sequence number held in the database and atomically incremented
>  for every message published, that is not likely to be something where
>  the databases sharding is going to help in scaling the number of
>  concurrent publications.
> 
>  Though sharding would allow scaling the total number messages on the
>  queue (by distributing them over multiple shards), the total ordering of
>  those messages reduces it's effectiveness in scaling the n

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On Mon, Sep 22, 2014 at 8:13 PM, Clint Byrum  wrote:

> Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:
> > On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter  wrote:
> >
> > > On 22/09/14 17:06, Joe Gordon wrote:
> > >
> > >> On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter 
> wrote:
> > >>
> > >>  On 22/09/14 10:11, Gordon Sim wrote:
> > >>>
> > >>>  On 09/19/2014 09:13 PM, Zane Bitter wrote:
> > 
> >   SQS offers very, very limited guarantees, and it's clear that the
> > > reason
> > > for that is to make it massively, massively scalable in the way
> that
> > > e.g. S3 is scalable while also remaining comparably durable (S3 is
> > > supposedly designed for 11 nines, BTW).
> > >
> > > Zaqar, meanwhile, seems to be promising the world in terms of
> > > guarantees. (And then taking it away in the fine print, where it
> says
> > > that the operator can disregard many of them, potentially without
> the
> > > user's knowledge.)
> > >
> > > On the other hand, IIUC Zaqar does in fact have a sharding feature
> > > ("Pools") which is its answer to the massive scaling question.
> > >
> > >
> >  There are different dimensions to the scaling problem.
> > 
> > 
> > >>> Many thanks for this analysis, Gordon. This is really helpful stuff.
> > >>>
> > >>>   As I understand it, pools don't help scaling a given queue since
> all
> > >>> the
> > >>>
> >  messages for that queue must be in the same pool. At present traffic
> >  through different Zaqar queues are essentially entirely orthogonal
> >  streams. Pooling can help scale the number of such orthogonal
> streams,
> >  but to be honest, that's the easier part of the problem.
> > 
> > 
> > >>> But I think it's also the important part of the problem. When I talk
> > >>> about
> > >>> scaling, I mean 1 million clients sending 10 messages per second
> each,
> > >>> not
> > >>> 10 clients sending 1 million messages per second each.
> > >>>
> > >>> When a user gets to the point that individual queues have massive
> > >>> throughput, it's unlikely that a one-size-fits-all cloud offering
> like
> > >>> Zaqar or SQS is _ever_ going to meet their needs. Those users will
> want
> > >>> to
> > >>> spin up and configure their own messaging systems on Nova servers,
> and at
> > >>> that kind of size they'll be able to afford to. (In fact, they may
> not be
> > >>> able to afford _not_ to, assuming per-message-based pricing.)
> > >>>
> > >>>
> > >> Running a message queue that has a high guarantee of not loosing a
> message
> > >> is hard and SQS promises exactly that, it *will* deliver your
> message. If
> > >> a
> > >> use case can handle occasionally dropping messages then running your
> own
> > >> MQ
> > >> makes more sense.
> > >>
> > >> SQS is designed to handle massive queues as well, while I haven't
> found
> > >> any
> > >> examples of queues that have 1 million messages/second being sent or
> > >> received  30k to 100k messages/second is not unheard of [0][1][2].
> > >>
> > >> [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
> > >> [1] http://java.dzone.com/articles/benchmarking-sqs
> > >> [2]
> > >> http://www.slideshare.net/AmazonWebServices/massive-
> > >> message-processing-with-amazon-sqs-and-amazon-
> > >> dynamodb-arc301-aws-reinvent-2013-28431182
> > >>
> > >
> > > Thanks for digging those up, that's really helpful input. I think
> number
> > > [1] kind of summed up part of what I'm arguing here though:
> > >
> > > "But once your requirements get above 35k messages per second, chances
> are
> > > you need custom solutions anyway; not to mention that while SQS is
> cheap,
> > > it may become expensive with such loads."
> >
> >
> > If you don't require the reliability guarantees that SQS provides then
> > perhaps. But I would be surprised to hear that a user can set up
> something
> > with this level of uptime for less:
> >
> > "Amazon SQS runs within Amazon’s high-availability data centers, so
> queues
> > will be available whenever applications need them. To prevent messages
> from
> > being lost or becoming unavailable, all messages are stored redundantly
> > across multiple servers and data centers." [1]
> >
>
> This is pretty easily doable with gearman or even just using Redis
> directly. But it is still ops for end users. The AWS users I've talked to
> who use SQS do so because they like that they can use RDS, SQS, and ELB,
> and have only one type of thing to operate: their app.
>
> > >
> > >
> > >There is also the possibility of using the sharding capabilities of
> the
> > >>>
> >  underlying storage. But the pattern of use will determine how
> effective
> >  that can be.
> > 
> >  So for example, on the ordering question, if order is defined by a
> >  single sequence number held in the database and atomically
> incremented
> >  for every message published, that is not likely to be something
> where
> >  the databa

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Gordon Sim


On 09/22/2014 05:58 PM, Zane Bitter wrote:

On 22/09/14 10:11, Gordon Sim wrote:

As I understand it, pools don't help scaling a given queue since all the
messages for that queue must be in the same pool. At present traffic
through different Zaqar queues are essentially entirely orthogonal
streams. Pooling can help scale the number of such orthogonal streams,
but to be honest, that's the easier part of the problem.


But I think it's also the important part of the problem. When I talk
about scaling, I mean 1 million clients sending 10 messages per second
each, not 10 clients sending 1 million messages per second each.


I wasn't really talking about high throughput per producer (which I 
agree is not going to be a good fit), but about e.g. a large number of 
subscribers for the same set of messages, e.g. publishing one message 
per second to 10,000 subscribers.


Even at much smaller scale, expanding from 10 subscribers to say 100 
seems relatively modest but the subscriber related load would increase 
by a factor of 10. I think handling these sorts of changes is also an 
important part of the problem (though perhaps not a part that Zaqar is 
focused on).



When a user gets to the point that individual queues have massive
throughput, it's unlikely that a one-size-fits-all cloud offering like
Zaqar or SQS is _ever_ going to meet their needs. Those users will want
to spin up and configure their own messaging systems on Nova servers,
and at that kind of size they'll be able to afford to. (In fact, they
may not be able to afford _not_ to, assuming per-message-based pricing.)


[...]

If scaling the number of communicants on a given communication channel
is a goal however, then strict ordering may hamper that. If it does, it
seems to me that this is not just a policy tweak on the underlying
datastore to choose the desired balance between ordering and scale, but
a more fundamental question on the internal structure of the queue
implementation built on top of the datastore.


I agree with your analysis, but I don't think this should be a goal.


I think it's worth clarifying that alongside the goals since scaling can 
mean different things to different people. The implication then is that 
there is some limit in the number of producers and/or consumers on a 
queue beyond which the service won't scale and applications need to 
design around that.



Note that the user can still implement this themselves using
application-level sharding - if you know that in-order delivery is not
important to you, then randomly assign clients to a queue and then poll
all of the queues in the round-robin. This yields _exactly_ the same
semantics as SQS.


You can certainly leave the problem of scaling in this dimension to the 
application itself by having them split the traffic into orthogonal 
streams or hooking up orthogonal streams to provide an aggregated stream.


A true distributed queue isn't entirely trivial, but it may well be that 
most applications can get by with a much simpler approximation.


Distributed (pub-sub) topic semantics are easier to implement, but if 
the application is responsible for keeping the partitions connected, 
then it also takes on part of the burden for availability and redundancy.



The reverse is true of SQS - if you want FIFO then you have to implement
re-ordering by sequence number in your application. (I'm not certain,
but it also sounds very much like this situation is ripe for losing
messages when your client dies.)

So the question is: in which use case do we want to push additional
complexity into the application? The case where there are truly massive
volumes of messages flowing to a single point?  Or the case where the
application wants the messages in order?


I think the first case is more generally about increasing the number of 
communicating parties (publishers or subscribers or both).


For competing consumers ordering isn't usually a concern since you are 
processing in parallel anyway (if it is important you need some notion 
of message grouping within which order is preserved and some stickiness 
between group and consumer).


For multiple non-competing consumers the choice needn't be as simple as 
total ordering or no ordering at all. Many systems quite naturally only 
define partial ordering which can be guaranteed more scalably.


That's not to deny that there are indeed cases where total ordering may 
be required however.



I'd suggest both that the former applications are better able to handle
that extra complexity and that the latter applications are probably more
common. So it seems that the Zaqar team made a good decision.


If that was a deliberate decision it would be worth clarifying in the 
goals. It seems to be a different conclusion from that reached by SQS 
and as such is part of the answer to the question that began the thread.



(Aside: it follows that Zaqar probably should have a maximum throughput
quota for each queue; or that it should report usage info

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/23/2014 05:13 AM, Clint Byrum wrote:
> Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:

[snip]

>>
>> To me this is less about valid or invalid choices. The Zaqar team is
>> comparing Zaqar to SQS, but after digging into the two of them, zaqar
>> barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
>> important parts of SQS: the message will be delivered and will never be
>> lost by SQS. Zaqar doesn't have the same scaling properties as SQS. Zaqar
>> is aiming for low latency per message, SQS doesn't appear to be. So if
>> Zaqar isn't SQS what is Zaqar and why should I use it?
>>
> 
> I have to agree. I'd like to see a simple, non-ordered, high latency,
> high scale messaging service that can be used cheaply by cloud operators
> and users. What I see instead is a very powerful, ordered, low latency,
> medium scale messaging service that will likely cost a lot to scale out
> to the thousands of users level.

I don't fully agree :D

Let me break the above down into several points:

* Zaqar team is comparing Zaqar to SQS: True, we're comparing to the
*type* of service SQS is but not *all* the guarantees it gives. We're
not working on an exact copy of the service but on a service capable of
addressing the same use cases.

* Zaqar is not guaranteeing reliability: This is not true. Yes, the
current default write concern for the mongodb driver is `acknowledge`
but that's a bug, not a feature [0] ;)

* Zaqar doesn't have the same scaling properties as SQS: What are SQS
scaling properties? We know they have a big user base, we know they have
lots of connections, queues and what not but we don't have numbers to
compare ourselves with.

* Zaqar is aiming for low latency per message: This is not true and I'd
be curious to know where did this come from. A couple of things to consider:

- First and foremost, low latency is a very relative measure  and it
depends on each use-case.
- The benchmarks Kurt did were purely informative. I believe it's good
to do them every once in a while but this doesn't mean the team is
mainly focused on that.
- Not being focused on 'low-latency' does not mean the team will
overlook performance.

* Zaqar has FIFO and SQS doesn't: FIFO won't hurt *your use-case* if
ordering is not a requirement but the lack of it does when ordering is a
must.

* Scaling out Zaqar will cost a lot: In terms of what? I'm pretty sure
it's not for free but I'd like to understand better this point and
figure out a way to improve it, if possible.

* If Zaqar isn't SQS then what is it? Why should I use it?: I don't
believe Zaqar is SQS as I don't believe nova is EC2. Do they share
similar features and provide similar services? Yes, does that mean you
can address similar use cases, hence a similar users? Yes.

In addition to the above, I believe Zaqar is a simple service, easy to
install and to interact with. From a user perspective the semantics are
few and the concepts are neither new nor difficult to grasp. From an
operators perspective, I don't believe it adds tons of complexity. It
does require the operator to deploy a replicated storage environment but
I believe all services require that.

Cheers,
Flavio

P.S: Sorry for my late answer or lack of it. I lost *all* my emails
yesterday and I'm working on recovering them.

[0] https://bugs.launchpad.net/zaqar/+bug/1372335

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/23/2014 10:58 AM, Gordon Sim wrote:
> On 09/22/2014 05:58 PM, Zane Bitter wrote:
>> On 22/09/14 10:11, Gordon Sim wrote:
>>> As I understand it, pools don't help scaling a given queue since all the
>>> messages for that queue must be in the same pool. At present traffic
>>> through different Zaqar queues are essentially entirely orthogonal
>>> streams. Pooling can help scale the number of such orthogonal streams,
>>> but to be honest, that's the easier part of the problem.
>>
>> But I think it's also the important part of the problem. When I talk
>> about scaling, I mean 1 million clients sending 10 messages per second
>> each, not 10 clients sending 1 million messages per second each.
> 
> I wasn't really talking about high throughput per producer (which I
> agree is not going to be a good fit), but about e.g. a large number of
> subscribers for the same set of messages, e.g. publishing one message
> per second to 10,000 subscribers.
> 
> Even at much smaller scale, expanding from 10 subscribers to say 100
> seems relatively modest but the subscriber related load would increase
> by a factor of 10. I think handling these sorts of changes is also an
> important part of the problem (though perhaps not a part that Zaqar is
> focused on).
> 
>> When a user gets to the point that individual queues have massive
>> throughput, it's unlikely that a one-size-fits-all cloud offering like
>> Zaqar or SQS is _ever_ going to meet their needs. Those users will want
>> to spin up and configure their own messaging systems on Nova servers,
>> and at that kind of size they'll be able to afford to. (In fact, they
>> may not be able to afford _not_ to, assuming per-message-based pricing.)
> 
> [...]
>>> If scaling the number of communicants on a given communication channel
>>> is a goal however, then strict ordering may hamper that. If it does, it
>>> seems to me that this is not just a policy tweak on the underlying
>>> datastore to choose the desired balance between ordering and scale, but
>>> a more fundamental question on the internal structure of the queue
>>> implementation built on top of the datastore.
>>
>> I agree with your analysis, but I don't think this should be a goal.
> 
> I think it's worth clarifying that alongside the goals since scaling can
> mean different things to different people. The implication then is that
> there is some limit in the number of producers and/or consumers on a
> queue beyond which the service won't scale and applications need to
> design around that.

Agreed. The above is not part of Zaqar's goals. That is to say that each
store knows best how to distribute reads and writes itself. Nonetheless,
drivers can be very smart about this and be implemented in ways they'd
take the most out of the backend.


>> Note that the user can still implement this themselves using
>> application-level sharding - if you know that in-order delivery is not
>> important to you, then randomly assign clients to a queue and then poll
>> all of the queues in the round-robin. This yields _exactly_ the same
>> semantics as SQS.
> 
> You can certainly leave the problem of scaling in this dimension to the
> application itself by having them split the traffic into orthogonal
> streams or hooking up orthogonal streams to provide an aggregated stream.
> 
> A true distributed queue isn't entirely trivial, but it may well be that
> most applications can get by with a much simpler approximation.
> 
> Distributed (pub-sub) topic semantics are easier to implement, but if
> the application is responsible for keeping the partitions connected,
> then it also takes on part of the burden for availability and redundancy.
> 
>> The reverse is true of SQS - if you want FIFO then you have to implement
>> re-ordering by sequence number in your application. (I'm not certain,
>> but it also sounds very much like this situation is ripe for losing
>> messages when your client dies.)
>>
>> So the question is: in which use case do we want to push additional
>> complexity into the application? The case where there are truly massive
>> volumes of messages flowing to a single point?  Or the case where the
>> application wants the messages in order?
> 
> I think the first case is more generally about increasing the number of
> communicating parties (publishers or subscribers or both).
> 
> For competing consumers ordering isn't usually a concern since you are
> processing in parallel anyway (if it is important you need some notion
> of message grouping within which order is preserved and some stickiness
> between group and consumer).
> 
> For multiple non-competing consumers the choice needn't be as simple as
> total ordering or no ordering at all. Many systems quite naturally only
> define partial ordering which can be guaranteed more scalably.
> 
> That's not to deny that there are indeed cases where total ordering may
> be required however.
> 
>> I'd suggest both that the former applications are better able to handle
>> that extra

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Zane Bitter


On 23/09/14 08:58, Flavio Percoco wrote:

I believe the guarantee is still useful and it currently does not
represent an issue for the service nor the user. 2 things could happen
to FIFO in the future:

1. It's made optional and we allow users to opt-in in a per flavor
basis. (I personally don't like this one because it makes
interoperability even harder).


Hmm, I'm not so sure this is such a bad option. I criticised flavours 
earlier in this thread on the assumption that it meant every storage 
back-end would have its own semantics, and those would be surfaced to 
the user in the form of flavours - that does indeed make 
interoperability very hard.


The same issue does not arise for options implemented in Zaqar itself. 
If every back-end supports FIFO semantics but Zaqar has a layer that 
distributes the queue among multiple backends or not, depending on the 
flavour selected by the user, then there would be no impact on 
interoperability as the same semantics would be available regardless of 
the back-end chosen by the operator.



2. It's removed completely (Again, I personally don't like this one
because I don't think we have strong enough cases to require this to
happen).

That said, there's just 1 thing I think will happen for now, it'll be
kept as-is unless there are strong cases that'd require (1) or (2). All
this should be considered in the discussion of the API v2, whenever that
happens.


I think this is it ;)

cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Zane Bitter


On 22/09/14 22:04, Joe Gordon wrote:

To me this is less about valid or invalid choices. The Zaqar team is
comparing Zaqar to SQS, but after digging into the two of them, zaqar
barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
important parts of SQS: the message will be delivered and will never be
lost by SQS.


I agree that this is the most important feature. Happily, Flavio has 
clarified this in his other thread[1]:


 *Zaqar's vision is to provide a cross-cloud interoperable,
  fully-reliable messaging service at scale that is both, easy and not
  invasive, for deployers and users.*

  ...

  Zaqar aims to be a fully-reliable service, therefore messages should
  never be lost under any circumstances except for when the message's
  expiration time (ttl) is reached

So Zaqar _will_ guarantee reliable delivery.


Zaqar doesn't have the same scaling properties as SQS.


This is true. (That's not to say it won't scale, but it doesn't scale in 
exactly the same way that SQS does because it has a different architecture.)


It appears that the main reason for this is the ordering guarantee, 
which was introduced in response to feedback from users. So this is 
clearly a different design choice: SQS chose reliability plus 
effectively infinite scalability, while Zaqar chose reliability plus 
FIFO. It's not feasible to satisfy all three simultaneously, so the 
options are:


1) Implement two separate modes and allow the user to decide
2) Continue to choose FIFO over infinite scalability
3) Drop FIFO and choose infinite scalability instead

This is one of the key points on which we need to get buy-in from the 
community on selecting one of these as the long-term strategy.



Zaqar is aiming for low latency per message, SQS doesn't appear to be.


I've seen no evidence that Zaqar is actually aiming for that. There are 
waaay lower-latency ways to implement messaging if you don't care about 
durability (you wouldn't do store-and-forward, for a start). If you see 
a lot of talk about low latency, it's probably because for a long time 
people insisted on comparing Zaqar to RabbitMQ instead of SQS.


(Let's also be careful not to talk about high latency as if it were a 
virtue in itself; it's simply something we would happily trade off for 
other properties. Zaqar _is_ making that trade-off.)



So if Zaqar isn't SQS what is Zaqar and why should I use it?


If you are a small-to-medium user of an SQS-like service, Zaqar is like 
SQS but better because not only does it never lose your messages but 
they always arrive in order, and you have the option to fan them out to 
multiple subscribers. If you are a very large user along one particular 
dimension (I believe it's number of messages delivered from a single 
queue, but probably Gordon will correct me :D) then Zaqar may not _yet_ 
have a good story for you.


cheers,
Zane.

[1] 
http://lists.openstack.org/pipermail/openstack-dev/2014-September/046809.html


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Fox, Kevin M

Flavio wrote> "The reasoning, as explained in an other
email, is that from a use-case perspective, strict ordering won't hurt
you if you don't need it whereas having to implement it in the client
side because the service doesn't provide it can be a PITA."

The reasoning is flawed though. If performance is a concern, having strict 
ordering costs you when you may not care!

For example, is it better to implement a video streaming service on tcp or udp 
if firewalls aren't a concern? The latter. Why? Because ordering is a problem 
for these systems! If you have frames, 1 2 and 3..., and frame 2 gets lost on 
the first transmit and needs resending, but 3 gets there, the system has to 
wait to display frame 3 waiting for frame 2. But by the time frame 2 gets 
there, frame 3 doesn't matter because the system needs to move on to frame 5 
now. The human eye doens't care to wait for retransmits of frames. it only 
cares about the now. So because of the ordering, the eye sees 3 dropped frames 
instead of just one. making the system worse, not better.

Yeah, I know its a bit of a silly example. No one would implement video 
streaming on top of messaging like that. But it does present the point that 
something that seemingly only provides good things (order is always better then 
disorder, right?), sometimes has unintended and negative side affects. In 
lossless systems, it can show up as unnecessary latency or higher cpu loads.

I think your option 1 will make Zaqar much more palatable to those that don't 
need the strict ordering requirement.

I'm glad you want to make hard things like guaranteed ordering available so 
that users don't have to deal with it themselves if they don't want to. Its a 
great feature. But it also is an anti-feature in some cases. The ramifications 
of its requirements are higher then you think, and a feature to just disable it 
shouldn't be very costly to implement.

Part of the controversy right now, I think, has been not understanding the use 
case here, and by insisting that FIFO only ever is positive, it makes others 
that know its negatives question what other assumptions were made in Zaqar and 
makes them a little gun shy.

Please do reconsider this stance.

Thanks,
Kevin

From: Flavio Percoco [fla...@redhat.com]
Sent: Tuesday, September 23, 2014 5:58 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed 
Queues

On 09/23/2014 10:58 AM, Gordon Sim wrote:
> On 09/22/2014 05:58 PM, Zane Bitter wrote:
>> On 22/09/14 10:11, Gordon Sim wrote:
>>> As I understand it, pools don't help scaling a given queue since all the
>>> messages for that queue must be in the same pool. At present traffic
>>> through different Zaqar queues are essentially entirely orthogonal
>>> streams. Pooling can help scale the number of such orthogonal streams,
>>> but to be honest, that's the easier part of the problem.
>>
>> But I think it's also the important part of the problem. When I talk
>> about scaling, I mean 1 million clients sending 10 messages per second
>> each, not 10 clients sending 1 million messages per second each.
>
> I wasn't really talking about high throughput per producer (which I
> agree is not going to be a good fit), but about e.g. a large number of
> subscribers for the same set of messages, e.g. publishing one message
> per second to 10,000 subscribers.
>
> Even at much smaller scale, expanding from 10 subscribers to say 100
> seems relatively modest but the subscriber related load would increase
> by a factor of 10. I think handling these sorts of changes is also an
> important part of the problem (though perhaps not a part that Zaqar is
> focused on).
>
>> When a user gets to the point that individual queues have massive
>> throughput, it's unlikely that a one-size-fits-all cloud offering like
>> Zaqar or SQS is _ever_ going to meet their needs. Those users will want
>> to spin up and configure their own messaging systems on Nova servers,
>> and at that kind of size they'll be able to afford to. (In fact, they
>> may not be able to afford _not_ to, assuming per-message-based pricing.)
>
> [...]
>>> If scaling the number of communicants on a given communication channel
>>> is a goal however, then strict ordering may hamper that. If it does, it
>>> seems to me that this is not just a policy tweak on the underlying
>>> datastore to choose the desired balance between ordering and scale, but
>>> a more fundamental question on the internal structure of the queue
>>> implementation built on top of the datastore.
>>
>> I agree with your analysis, but I don't think this

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

On 09/23/2014 09:29 PM, Fox, Kevin M wrote:
> Flavio wrote> "The reasoning, as explained in an other
> email, is that from a use-case perspective, strict ordering won't hurt
> you if you don't need it whereas having to implement it in the client
> side because the service doesn't provide it can be a PITA."
> 
> The reasoning is flawed though. If performance is a concern, having strict 
> ordering costs you when you may not care!
> 
> For example, is it better to implement a video streaming service on tcp or 
> udp if firewalls aren't a concern? The latter. Why? Because ordering is a 
> problem for these systems! If you have frames, 1 2 and 3..., and frame 2 gets 
> lost on the first transmit and needs resending, but 3 gets there, the system 
> has to wait to display frame 3 waiting for frame 2. But by the time frame 2 
> gets there, frame 3 doesn't matter because the system needs to move on to 
> frame 5 now. The human eye doens't care to wait for retransmits of frames. it 
> only cares about the now. So because of the ordering, the eye sees 3 dropped 
> frames instead of just one. making the system worse, not better.
> 
> Yeah, I know its a bit of a silly example. No one would implement video 
> streaming on top of messaging like that. But it does present the point that 
> something that seemingly only provides good things (order is always better 
> then disorder, right?), sometimes has unintended and negative side affects. 
> In lossless systems, it can show up as unnecessary latency or higher cpu 
> loads.
> 
> I think your option 1 will make Zaqar much more palatable to those that don't 
> need the strict ordering requirement.
> 
> I'm glad you want to make hard things like guaranteed ordering available so 
> that users don't have to deal with it themselves if they don't want to. Its a 
> great feature. But it also is an anti-feature in some cases. The 
> ramifications of its requirements are higher then you think, and a feature to 
> just disable it shouldn't be very costly to implement.
> 
> Part of the controversy right now, I think, has been not understanding the 
> use case here, and by insisting that FIFO only ever is positive, it makes 
> others that know its negatives question what other assumptions were made in 
> Zaqar and makes them a little gun shy.
> 
> Please do reconsider this stance.


Hey Kevin,

FWIW, I explicitly said "from a use-case perspective" which in the
context of the emails I was replying to referred to the need (or not)
for FIFO and not to the impact it has in other areas like performance.

In any way I tried to insist that FIFO is only ever positive and I've
also explicitly said in several other emails that it *does* have an
impact on performance.

That said, I agree that if FIFO's reality in Zaqar changes, it'll likely
be towards the option (1).

Thanks for your feedback,
Flavio

> 
> Thanks,
> Kevin
> 
> 
> ____________
> From: Flavio Percoco [fla...@redhat.com]
> Sent: Tuesday, September 23, 2014 5:58 AM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed 
> Queues
> 
> On 09/23/2014 10:58 AM, Gordon Sim wrote:
>> On 09/22/2014 05:58 PM, Zane Bitter wrote:
>>> On 22/09/14 10:11, Gordon Sim wrote:
>>>> As I understand it, pools don't help scaling a given queue since all the
>>>> messages for that queue must be in the same pool. At present traffic
>>>> through different Zaqar queues are essentially entirely orthogonal
>>>> streams. Pooling can help scale the number of such orthogonal streams,
>>>> but to be honest, that's the easier part of the problem.
>>>
>>> But I think it's also the important part of the problem. When I talk
>>> about scaling, I mean 1 million clients sending 10 messages per second
>>> each, not 10 clients sending 1 million messages per second each.
>>
>> I wasn't really talking about high throughput per producer (which I
>> agree is not going to be a good fit), but about e.g. a large number of
>> subscribers for the same set of messages, e.g. publishing one message
>> per second to 10,000 subscribers.
>>
>> Even at much smaller scale, expanding from 10 subscribers to say 100
>> seems relatively modest but the subscriber related load would increase
>> by a factor of 10. I think handling these sorts of changes is also an
>> important part of the problem (though perhaps not a part that Zaqar is
>> focused on).
>>
>>> When a user gets to the point that individual queues have massive
>>> throughpu

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Joe Gordon

On Tue, Sep 23, 2014 at 9:13 AM, Zane Bitter  wrote:

> On 22/09/14 22:04, Joe Gordon wrote:
>
>> To me this is less about valid or invalid choices. The Zaqar team is
>> comparing Zaqar to SQS, but after digging into the two of them, zaqar
>> barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
>> important parts of SQS: the message will be delivered and will never be
>> lost by SQS.
>>
>
> I agree that this is the most important feature. Happily, Flavio has
> clarified this in his other thread[1]:
>
>  *Zaqar's vision is to provide a cross-cloud interoperable,
>   fully-reliable messaging service at scale that is both, easy and not
>   invasive, for deployers and users.*
>
>   ...
>
>   Zaqar aims to be a fully-reliable service, therefore messages should
>   never be lost under any circumstances except for when the message's
>   expiration time (ttl) is reached
>
> So Zaqar _will_ guarantee reliable delivery.
>
>  Zaqar doesn't have the same scaling properties as SQS.
>>
>
> This is true. (That's not to say it won't scale, but it doesn't scale in
> exactly the same way that SQS does because it has a different architecture.)
>
> It appears that the main reason for this is the ordering guarantee, which
> was introduced in response to feedback from users. So this is clearly a
> different design choice: SQS chose reliability plus effectively infinite
> scalability, while Zaqar chose reliability plus FIFO. It's not feasible to
> satisfy all three simultaneously, so the options are:
>
> 1) Implement two separate modes and allow the user to decide
> 2) Continue to choose FIFO over infinite scalability
> 3) Drop FIFO and choose infinite scalability instead
>
> This is one of the key points on which we need to get buy-in from the
> community on selecting one of these as the long-term strategy.
>
>  Zaqar is aiming for low latency per message, SQS doesn't appear to be.
>>
>
> I've seen no evidence that Zaqar is actually aiming for that. There are
> waaay lower-latency ways to implement messaging if you don't care about
> durability (you wouldn't do store-and-forward, for a start). If you see a
> lot of talk about low latency, it's probably because for a long time people
> insisted on comparing Zaqar to RabbitMQ instead of SQS.
>

I thought this was why Zaqar uses Falcon and not Pecan/WSME?

"For an application like Marconi where throughput and latency is of
paramount importance, I recommend Falcon over Pecan."
https://wiki.openstack.org/wiki/Zaqar/pecan-evaluation#Recommendation

Yes that statement mentions throughput as well, but it does mention latency
as well.


>
> (Let's also be careful not to talk about high latency as if it were a
> virtue in itself; it's simply something we would happily trade off for
> other properties. Zaqar _is_ making that trade-off.)
>
>  So if Zaqar isn't SQS what is Zaqar and why should I use it?
>>
>
> If you are a small-to-medium user of an SQS-like service, Zaqar is like
> SQS but better because not only does it never lose your messages but they
> always arrive in order, and you have the option to fan them out to multiple
> subscribers. If you are a very large user along one particular dimension (I
> believe it's number of messages delivered from a single queue, but probably
> Gordon will correct me :D) then Zaqar may not _yet_ have a good story for
> you.
>
> cheers,
> Zane.
>
> [1] http://lists.openstack.org/pipermail/openstack-dev/2014-
> September/046809.html
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Joe Gordon

On Tue, Sep 23, 2014 at 2:40 AM, Flavio Percoco  wrote:

> On 09/23/2014 05:13 AM, Clint Byrum wrote:
> > Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:
>
> [snip]
>
> >>
> >> To me this is less about valid or invalid choices. The Zaqar team is
> >> comparing Zaqar to SQS, but after digging into the two of them, zaqar
> >> barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
> >> important parts of SQS: the message will be delivered and will never be
> >> lost by SQS. Zaqar doesn't have the same scaling properties as SQS.
> Zaqar
> >> is aiming for low latency per message, SQS doesn't appear to be. So if
> >> Zaqar isn't SQS what is Zaqar and why should I use it?
> >>
> >
> > I have to agree. I'd like to see a simple, non-ordered, high latency,
> > high scale messaging service that can be used cheaply by cloud operators
> > and users. What I see instead is a very powerful, ordered, low latency,
> > medium scale messaging service that will likely cost a lot to scale out
> > to the thousands of users level.
>
> I don't fully agree :D
>
> Let me break the above down into several points:
>
> * Zaqar team is comparing Zaqar to SQS: True, we're comparing to the
> *type* of service SQS is but not *all* the guarantees it gives. We're
> not working on an exact copy of the service but on a service capable of
> addressing the same use cases.
>
> * Zaqar is not guaranteeing reliability: This is not true. Yes, the
> current default write concern for the mongodb driver is `acknowledge`
> but that's a bug, not a feature [0] ;)
>
> * Zaqar doesn't have the same scaling properties as SQS: What are SQS
> scaling properties? We know they have a big user base, we know they have
> lots of connections, queues and what not but we don't have numbers to
> compare ourselves with.
>

Here is *a* number
30k messages per second on a single queue:
http://java.dzone.com/articles/benchmarking-sqs


>
> * Zaqar is aiming for low latency per message: This is not true and I'd
> be curious to know where did this come from. A couple of things to
> consider:
>
> - First and foremost, low latency is a very relative measure  and
> it
> depends on each use-case.
> - The benchmarks Kurt did were purely informative. I believe it's
> good
> to do them every once in a while but this doesn't mean the team is
> mainly focused on that.
> - Not being focused on 'low-latency' does not mean the team will
> overlook performance.
>
> * Zaqar has FIFO and SQS doesn't: FIFO won't hurt *your use-case* if
> ordering is not a requirement but the lack of it does when ordering is a
> must.
>
> * Scaling out Zaqar will cost a lot: In terms of what? I'm pretty sure
> it's not for free but I'd like to understand better this point and
> figure out a way to improve it, if possible.
>
> * If Zaqar isn't SQS then what is it? Why should I use it?: I don't
> believe Zaqar is SQS as I don't believe nova is EC2. Do they share
> similar features and provide similar services? Yes, does that mean you
> can address similar use cases, hence a similar users? Yes.
>
> In addition to the above, I believe Zaqar is a simple service, easy to
> install and to interact with. From a user perspective the semantics are
> few and the concepts are neither new nor difficult to grasp. From an
> operators perspective, I don't believe it adds tons of complexity. It
> does require the operator to deploy a replicated storage environment but
> I believe all services require that.
>
> Cheers,
> Flavio
>
> P.S: Sorry for my late answer or lack of it. I lost *all* my emails
> yesterday and I'm working on recovering them.
>
> [0] https://bugs.launchpad.net/zaqar/+bug/1372335
>
> --
> @flaper87
> Flavio Percoco
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Clint Byrum

Excerpts from Joe Gordon's message of 2014-09-23 14:59:33 -0700:
> On Tue, Sep 23, 2014 at 9:13 AM, Zane Bitter  wrote:
> 
> > On 22/09/14 22:04, Joe Gordon wrote:
> >
> >> To me this is less about valid or invalid choices. The Zaqar team is
> >> comparing Zaqar to SQS, but after digging into the two of them, zaqar
> >> barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
> >> important parts of SQS: the message will be delivered and will never be
> >> lost by SQS.
> >>
> >
> > I agree that this is the most important feature. Happily, Flavio has
> > clarified this in his other thread[1]:
> >
> >  *Zaqar's vision is to provide a cross-cloud interoperable,
> >   fully-reliable messaging service at scale that is both, easy and not
> >   invasive, for deployers and users.*
> >
> >   ...
> >
> >   Zaqar aims to be a fully-reliable service, therefore messages should
> >   never be lost under any circumstances except for when the message's
> >   expiration time (ttl) is reached
> >
> > So Zaqar _will_ guarantee reliable delivery.
> >
> >  Zaqar doesn't have the same scaling properties as SQS.
> >>
> >
> > This is true. (That's not to say it won't scale, but it doesn't scale in
> > exactly the same way that SQS does because it has a different architecture.)
> >
> > It appears that the main reason for this is the ordering guarantee, which
> > was introduced in response to feedback from users. So this is clearly a
> > different design choice: SQS chose reliability plus effectively infinite
> > scalability, while Zaqar chose reliability plus FIFO. It's not feasible to
> > satisfy all three simultaneously, so the options are:
> >
> > 1) Implement two separate modes and allow the user to decide
> > 2) Continue to choose FIFO over infinite scalability
> > 3) Drop FIFO and choose infinite scalability instead
> >
> > This is one of the key points on which we need to get buy-in from the
> > community on selecting one of these as the long-term strategy.
> >
> >  Zaqar is aiming for low latency per message, SQS doesn't appear to be.
> >>
> >
> > I've seen no evidence that Zaqar is actually aiming for that. There are
> > waaay lower-latency ways to implement messaging if you don't care about
> > durability (you wouldn't do store-and-forward, for a start). If you see a
> > lot of talk about low latency, it's probably because for a long time people
> > insisted on comparing Zaqar to RabbitMQ instead of SQS.
> >
> 
> I thought this was why Zaqar uses Falcon and not Pecan/WSME?
> 
> "For an application like Marconi where throughput and latency is of
> paramount importance, I recommend Falcon over Pecan."
> https://wiki.openstack.org/wiki/Zaqar/pecan-evaluation#Recommendation
> 
> Yes that statement mentions throughput as well, but it does mention latency
> as well.
> 

I definitely see where that may have subtly suggested the wrong
thing, if indeed latency isn't a top concern.

I think what it probably should say is something like this:

"For an application like Marconi where there will be many repetitive,
small requests, a lighter weight solution such as Falcon is preferred
over Pecan."

As in, we care about the cost of all those requests, not so much about
the latency.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Devananda van der Veen

On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter  wrote:
> On 22/09/14 17:06, Joe Gordon wrote:
>>
>> If 50,000 messages per second doesn't count as small-to-moderate then
>> Zaqar
>> does not fulfill a major SQS use case.
>
>
> It's not a drop-in replacement, but as I mentioned you can recreate the SQS
> semantics exactly *and* get the scalability benefits of that approach by
> sharding at the application level and then round-robin polling.
>

This response seems dismissive to application developers deciding what
cloud-based messaging system to use for their application.

If I'm evaluating two messaging services, and one of them requires my
application to implement autoscaling and pool management, and the
other does not, I'm clearly going to pick the one which makes my
application development *simpler*. Also, choices made early in a
product's lifecycle (like, say, a facebook game) about which
technology they use (like, say, for messaging) are often informed by
hopeful expectations of explosive growth and fame.

So, based on what you've said, if I were a game developer comparing
SQS and Zaqar today, it seems clear that, if I picked Zaqar, and my
game gets really popular, it's also going to have to be engineered to
handle autoscaling of queues in Zaqar. Based on that, I'm going to
pick SQS. Because then I don't have to worry about what I'm going to
do when my game has 100 million users and there's still just one
queue.

Granted, it has become apparent that Zaqar is targeting a different
audience than SQS. I'm going to follow up shortly with more on that
topic.

-Devananda

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Devananda van der Veen

I've taken a bit of time out of this thread, and I'd like to jump back
in now and attempt to summarize what I've learned and hopefully frame
it in such a way that it helps us to answer the question Thierry
asked:

On Fri, Sep 19, 2014 at 2:00 AM, Thierry Carrez  wrote:
>
> The underlying question being... can Zaqar evolve to ultimately reach
> the massive scale use case Joe, Clint and Devananda want it to reach, or
> are those design choices so deeply rooted in the code and architecture
> that Zaqar won't naturally mutate to support that use case.

I also want to sincerely thank everyone who has been involved in this
discussion, and helped to clarify the different viewpoints and
uncertainties which have surrounded Zaqar lately. I hope that all of
this helps provide the Zaqar team guidance on a path forward, as I do
believe that a scalable cloud-based messaging service would greatly
benefit the OpenStack ecosystem.

Use cases
--

So, I'd like to start from the perspective of a hypothetical user
evaluating messaging services for the new application that I'm
developing. What does my application need from a messaging service so
that it can grow and become hugely popular with all the hipsters of
the world? In other words, what might my architectural requirements
be?

(This is certainly not a complete list of features, and it's not meant
to be -- it is a list of things that I *might* need from a messaging
service. But feel free to point out any glaring omissions I may have
made anyway :) )

1. Durability: I can't risk losing any messages
  Example: Using a queue to process votes. Every vote should count.

2. Single Delivery - each message must be processed *exactly* once
  Example: Using a queue to process votes. Every vote must be counted only once.

3. Low latency to interact with service
  Example: Single threaded application that can't wait on external calls

4. Low latency of a message through the system
  Example: Video streaming. Messages are very time-sensitive.

5. Aggregate throughput
  Example: Ad banner processing. Remember when sites could get
slash-dotted? I need a queue resilient to truly massive spikes in
traffic.

6. FIFO - When ordering matters
  Example: I can't "stop" a job that hasn't "start"ed yet.

So, as a developer, I actually probably never need all of these in a
single application -- but depending on what I'm doing, I'm going to
need some of them. Hopefully, the examples above give some idea of
what I have in mind for different sorts of applications I might
develop which would require these guarantees from a messaging service.

Why is this at all interesting or relevant? Because I think Zaqar and
SQS are actually, in their current forms, trying to meet different
sets of requirements. And, because I have not actually seen an
application using a cloud which requires the things that Zaqar is
guaranteeing - which doesn't mean they don't exist - it frames my past
judgements about Zaqar in a much better way than simply "I have
doubts". It explains _why_ I have those doubts.

I'd now like to offer the following as a summary of this email thread
and the available documentation on SQS and Zaqar, as far as which of
the above requirements are satisfied by each service and why I believe
that. If there are fallacies in here, please correct me.

SQS
--

Requirements it meets: 1, 5

The SQS documentation states that it guarantees durability of messages
(1) and handles "unlimited" throughput (5).

It does not guarantee once-and-only-once delivery (2) and requires
applications that care about this to de-duplicate on the receiving
side.

It also does not guarantee message order (6), making it unsuitable for
certain uses.

SQS is not an application-local service nor does it use a wire-level
protocol, so from this I infer that (3) and (4) were not design goals.

Zaqar

Requirements it meets: 1*, 2, 6

Zaqar states that it aims to guarantee message durability (1) but does
so by pushing the guarantee of durability into the storage layer.
Thus, Zaqar will not be able to guarantee durability of messages when
a storage pool fails, is misconfigured, or what have you. Therefor I
do not feel that message durability is a strong guarantee of Zaqar
itself; in some configurations, this guarantee may be possible based
on the underlying storage, but this capability will need to be exposed
in such a way that users can make informed decisions about which Zaqar
storage back-end (or "flavor") to use for their application based on
whether or not they need durability.

Single delivery of messages (2) is provided for by the claim semantics
in Zaqar's API. FIFO (6) ordering was an architectural choice made
based on feedback from users.

Aggregate throughput of a single queue (5) is not scalable beyond the
reach of a single storage pool. This makes it possible for an
application to outgrow Zaqar when its total throughput needs exceed
the capacity of a single pool. This would also make it possible for

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues