Re: [announce] Accord: A high-performance coordination service for write-intensive workloads

Flavio Junqueira Sun, 25 Sep 2011 14:50:24 -0700

On Sep 25, 2011, at 9:02 AM, OZAWA Tsuyoshi wrote:

(2011/09/25 6:43), Flavio Junqueira wrote:

Thanks for sending this reference to the list, it sounds very
interesting. I have a few questions and comments, if you don't mind:
1- I was wondering if you can give more detail on the setup youused to
generate the numbers you show in the graphs on your Accord page. The
ZooKeeper values are way too low, and I suspect that you're using a
single hard drive. It could be because you expect to use a singleharddrive with an Accord server, and you wanted to make the comparisonfair.
Is this correct?


No, it isn't.
Both ZooKeeper and Accord use the dedicated hard drive for logging.
Setting file I used is here:
https://gist.github.com/1240291

Please tell me if I have a mistake.

I gave a cursory look, and I can't see any obvious problem. It isintriguing that the numbers are so low. Have you tried with differentnumbers of servers? I'm not sure if I just missed this information,but what version of ZooKeeper are you looking at? Also, if it is nottoo much trouble, could you please report on your read performance?

2- The previous observation leads me to the next question: couldyou say
more about your use of disk with persistence on?
ZooKeeper returns ACK after writing the disks of the over halfmachines.
Accord returns ACK after writing the disk of just one machine, which
accepted a request. However, at the same time, the ACK assures thatall
servers receive the messages in the same order.
The difference of the semantics means that this measurement is notfair.I would like to measure the under fair situation, but not yet. Ifthereare requests from users, I'm going to implement it and measure it.Note
that the benchmark of in-memory is fair.

I'm not sure I understand this part. You say that an operation isACKed after being written to one disk, but also that it is guaranteedto be delivered in the same order in all servers. Does it mean thatAccord still replicates on other servers before ACKing but the otherservers do not write to disk? Otherwise, the first server may crashand never come back, and the message cannot possibly be delivered byother servers.

One question related to this point: with Accord, do you replicate theoriginal request message or the result of operation? Do you guaranteethat each server executes a request or applies the result of a requestexactly once? If not, what kind of semantics does Accord provide?

3- The limitation on the message size in ZooKeeper is not afundamental
limitation. We have chosen to limit for the reasons we explain in the
wiki page that is linked in the Accord page. Do you have anyparticular
use case in mind for which you think it would be useful to have very
large messages?
Some developers use ZooKeeper as storage. For example, Onixdeveloper, a
implementation of open flow switch, says that :
"for most the object size limitations of
Zookeeper and convenience of accessing the conﬁguration
state directly through the NIB are a reason to favor the
transactional database."
http://www.usenix.org/event/osdi10/tech/full_papers/Koponen.pdf

The comment in the paper is exactly right, we instruct our users tostore metadata in ZooKeeper and data elsewhere. There are systemsdesigned to store bulk data, and ZooKeeper shouldn't try to competewith such storage systems, it is not our goal.

4- If I understand the group communication substrate Accord uses, it
enables Accord to process client requests in any server. ZooKeeperhas a
leader for a few reasons, one being the ability of managing client
sessions. Ephemeral nodes, for example, are bound to sessions. Arethere
similar abstractions in Accord? If the answer is positive, could you
explain it a bit? If not, is it doable with the substrate you'reusing?
Yes, Accord has abstractions like Ephemeral nodes.
We use Corosync cluster engine, which provides Virtual Synchrony
semantics. It assures of having consensus of the message orderingand a
server-failure ordering among all servers(conductor daemons).

Good to know that you also support ephemerals. Could you say a littlemore about how you decide to eliminate an ephemeral node? I supposethat an ephemeral is bound to the client that created it somehow, andit is deleted if the client crashes or disconnects. What's the exactmechanism?

5- I'm not sure where we say that 8 bytes is a typical value in the
documentation. I actually remember writing in one of our papersthat a
typical value is around 1k bytes.


The benchmark assumes the lock. I'm going to measure various message
sizes. I'll report it.


Sounds good, thanks.

-Flavio

-Flavio

On Sep 23, 2011, at 4:22 PM, OZAWA Tsuyoshi wrote:
Hi,

Sending zookeeper-users and hbase-users ml since there may be some
cluster developers interested in participating in this projectthere.
I am pleased to announce the initial release of Accord, yet another
coordination service like Apache ZooKeeper.
ZooKeeper is a de facto standard coordination kernel as you know at
present.
Accord provides ZK-like features as a coordination service.Concretely
speaking, it features:
- Accord is a distributed, transactional, and fully-replicated (NoSPoF)
Key-Value Store with strong consistency.
- Accord can be scale-out up to tens of nodes.
- Accord servers can handle tens or thousands of clients.
- The changes for a write request from a client can be notified tothe
other clients.
- Accord detects events of client's joining/leaving, and notifies
joined/left client information to the other clients.

There are some problems in ZK, however, as follows:
- ZK cannot handle write-intensive workloads well. ZK forwards allwrite
requests to a master server. It may be bottleneck in write-intensive
workload.
- ZK is optimized for disk-persistence mode, not for in-memory mode.
ZOOKEEPER-866 shows that ZK has the other bottleneck outside disk
persistence, though there are some needs of a fully-replicatedstorage
with both strong consistency and low latency.
https://issues.apache.org/jira/browse/ZOOKEEPER-866
- Limited Transaction APIs. ZK can only issue write operations(write,
del) in a transaction(multi-update).

These restriction limit the capability of the coordination kernel.
Accord solves such problems.
1. Accord uses Corosync Cluster Engine as a total-order messaging
infrastructure instead of Zab, an atomic broadcast protocol ZKuses. The
engine enable any servers to accept and process requests.
2. Accord supports in-memory mode.
3. More flexible transaction support. Not only write, deloperations,
but also cmp, copy, read operations are supported in transaction
operation.
These differences of the core engine (1, 2) enable us to avoidmasterbottleneck. Benchmark demonstrates that the write-operationthroughput
of Accord is much higher than one of ZooKeeper
(up to 20 times better throughput at persistent mode, and up to 18times
better throughput at in-memory mode).
The high performance kernel can extend the application ranges.Assumed
applications are as follows, for instance :
- Distributed Lock Manager whose lock operations occur at a high
frequency from thousands of clients.
I assume that the lock manager for Hbase in particluar. Thecoordination
service enables HBase to update multiple rows with ACID properties.
Hbase acts as distributed DB with ACID properties until thecoordinationservice becomes the bottleneck. The new coordination kernel,Accord, can
handle 18 times better throughput than ZK. As a result, Accord can
dramatically improve the scalability of Hbase with ACID properties.
- Metadata management service for large-scale distributed storage,
including HDFS, Ceph and Sheepdog etc.
Replicated-master can be implemented easily.
- Replicated Message Queue or logger (For instance, replicated
RabbitMQ).
and so on.

The other distributed systems can use Accord features easily because
Accord provides general-purpose APIs (read/write/del/more flexible
transaction).
More information including getting started, benchmarks, and APIdocs are
available from our project page :
http://www.osrg.net/accord

and all code is available from:
http://github.com/collie/accord

Please try it out, and let me know any opinions or problems.

Best regards,
OZAWA Tsuyoshi
<[email protected]>
flavio
junqueira

research scientist

[email protected]
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300 fax (408) 349 3301


flavio
junqueira

research scientist

[email protected]
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301

Re: [announce] Accord: A high-performance coordination service for write-intensive workloads

Reply via email to