On Mon, Sep 26, 2011 at 6:49 AM, Flavio Junqueira <[email protected]> wrote: > On Sep 25, 2011, at 9:02 AM, OZAWA Tsuyoshi wrote: > >> (2011/09/25 6:43), Flavio Junqueira wrote: >>> >>> Thanks for sending this reference to the list, it sounds very >>> interesting. I have a few questions and comments, if you don't mind: >>> >>> 1- I was wondering if you can give more detail on the setup you used to >>> generate the numbers you show in the graphs on your Accord page. The >>> ZooKeeper values are way too low, and I suspect that you're using a >>> single hard drive. It could be because you expect to use a single hard >>> drive with an Accord server, and you wanted to make the comparison fair. >>> Is this correct? >> >> No, it isn't. >> Both ZooKeeper and Accord use the dedicated hard drive for logging. >> Setting file I used is here: >> https://gist.github.com/1240291 >> >> Please tell me if I have a mistake. >> > > I gave a cursory look, and I can't see any obvious problem. It is intriguing > that the numbers are so low. Have you tried with different numbers of > servers? I'm not sure if I just missed this information, but what version of > ZooKeeper are you looking at?
I use ZooKeeper 3.3.3, the latest release version, with 7200 rpm HDD. The benchmarks is measured with 3, 5, 7 servers. > Also, if it is not too much trouble, could you please report on your read > performance? I'll measure it, and report it later. >>> 2- The previous observation leads me to the next question: could you say >>> more about your use of disk with persistence on? >> >> ZooKeeper returns ACK after writing the disks of the over half machines. >> Accord returns ACK after writing the disk of just one machine, which >> accepted a request. However, at the same time, the ACK assures that all >> servers receive the messages in the same order. >> The difference of the semantics means that this measurement is not fair. >> I would like to measure the under fair situation, but not yet. If there >> are requests from users, I'm going to implement it and measure it. Note >> that the benchmark of in-memory is fair. >> > > I'm not sure I understand this part. You say that an operation is ACKed > after being written to one disk, but also that it is guaranteed to be > delivered in the same order in all servers. Does it mean that Accord still > replicates on other servers before ACKing but the other servers do not write > to disk? Yes, you're right. > One question related to this point: with Accord, do you replicate the > original request message or the result of operation? Accord replicates request message. > Do you guarantee that each server executes a request or applies > the result of a request exactly once? Yes, it does. > The comment in the paper is exactly right, we instruct our users to store > metadata in ZooKeeper and data elsewhere. There are systems designed to > store bulk data, and ZooKeeper shouldn't try to compete with such storage > systems, it is not our goal. You're right. We decided to support unlimited data size, because some write-intensive applications such as replicated queue, or replicated database need to send big messages, though it can block next messages. In that way, I think that Accord may not compete with ZooKeeper, because the application ranges is different. ZooKeeper is suitable for read-mostly applications such as sharing configuration, leader election, and failure detection, while Accord is more suitable for write-mostly applications. >>> 4- If I understand the group communication substrate Accord uses, it >>> enables Accord to process client requests in any server. ZooKeeper has a >>> leader for a few reasons, one being the ability of managing client >>> sessions. Ephemeral nodes, for example, are bound to sessions. Are there >>> similar abstractions in Accord? If the answer is positive, could you >>> explain it a bit? If not, is it doable with the substrate you're using? >> >> Yes, Accord has abstractions like Ephemeral nodes. >> We use Corosync cluster engine, which provides Virtual Synchrony >> semantics. It assures of having consensus of the message ordering and a >> server-failure ordering among all servers(conductor daemons). > > Good to know that you also support ephemerals. Could you say a little more > about how you decide to eliminate an ephemeral node? I suppose that an > ephemeral is bound to the client that created it somehow, and it is deleted > if the client crashes or disconnects. What's the exact mechanism? Accord currently provides more primitive abstractions. Accord servers assign client id when a client joins to Accord cluster, and notify client id of the left client when a client leaves. The clients mark down own id if it's ephemeral node when they tries to write. The other clients received a notification of leaving a client try to delete the nodes which the left client created. On Mon, Sep 26, 2011 at 6:49 AM, Flavio Junqueira <[email protected]> wrote: > On Sep 25, 2011, at 9:02 AM, OZAWA Tsuyoshi wrote: > >> (2011/09/25 6:43), Flavio Junqueira wrote: >>> >>> Thanks for sending this reference to the list, it sounds very >>> interesting. I have a few questions and comments, if you don't mind: >>> >>> 1- I was wondering if you can give more detail on the setup you used to >>> generate the numbers you show in the graphs on your Accord page. The >>> ZooKeeper values are way too low, and I suspect that you're using a >>> single hard drive. It could be because you expect to use a single hard >>> drive with an Accord server, and you wanted to make the comparison fair. >>> Is this correct? >> >> No, it isn't. >> Both ZooKeeper and Accord use the dedicated hard drive for logging. >> Setting file I used is here: >> https://gist.github.com/1240291 >> >> Please tell me if I have a mistake. >> > > I gave a cursory look, and I can't see any obvious problem. It is intriguing > that the numbers are so low. Have you tried with different numbers of > servers? I'm not sure if I just missed this information, but what version of > ZooKeeper are you looking at? Also, if it is not too much trouble, could > you please report on your read performance? > >>> 2- The previous observation leads me to the next question: could you say >>> more about your use of disk with persistence on? >> >> ZooKeeper returns ACK after writing the disks of the over half machines. >> Accord returns ACK after writing the disk of just one machine, which >> accepted a request. However, at the same time, the ACK assures that all >> servers receive the messages in the same order. >> The difference of the semantics means that this measurement is not fair. >> I would like to measure the under fair situation, but not yet. If there >> are requests from users, I'm going to implement it and measure it. Note >> that the benchmark of in-memory is fair. >> > > I'm not sure I understand this part. You say that an operation is ACKed > after being written to one disk, but also that it is guaranteed to be > delivered in the same order in all servers. Does it mean that Accord still > replicates on other servers before ACKing but the other servers do not write > to disk? Otherwise, the first server may crash and never come back, and the > message cannot possibly be delivered by other servers. > > One question related to this point: with Accord, do you replicate the > original request message or the result of operation? Do you guarantee that > each server executes a request or applies the result of a request exactly > once? If not, what kind of semantics does Accord provide? > >>> 3- The limitation on the message size in ZooKeeper is not a fundamental >>> limitation. We have chosen to limit for the reasons we explain in the >>> wiki page that is linked in the Accord page. Do you have any particular >>> use case in mind for which you think it would be useful to have very >>> large messages? >> >> Some developers use ZooKeeper as storage. For example, Onix developer, a >> implementation of open flow switch, says that : >> "for most the object size limitations of >> Zookeeper and convenience of accessing the configuration >> state directly through the NIB are a reason to favor the >> transactional database." >> http://www.usenix.org/event/osdi10/tech/full_papers/Koponen.pdf >> > > The comment in the paper is exactly right, we instruct our users to store > metadata in ZooKeeper and data elsewhere. There are systems designed to > store bulk data, and ZooKeeper shouldn't try to compete with such storage > systems, it is not our goal. > >>> 4- If I understand the group communication substrate Accord uses, it >>> enables Accord to process client requests in any server. ZooKeeper has a >>> leader for a few reasons, one being the ability of managing client >>> sessions. Ephemeral nodes, for example, are bound to sessions. Are there >>> similar abstractions in Accord? If the answer is positive, could you >>> explain it a bit? If not, is it doable with the substrate you're using? >> >> Yes, Accord has abstractions like Ephemeral nodes. >> We use Corosync cluster engine, which provides Virtual Synchrony >> semantics. It assures of having consensus of the message ordering and a >> server-failure ordering among all servers(conductor daemons). > > Good to know that you also support ephemerals. Could you say a little more > about how you decide to eliminate an ephemeral node? I suppose that an > ephemeral is bound to the client that created it somehow, and it is deleted > if the client crashes or disconnects. What's the exact mechanism? > >> >>> 5- I'm not sure where we say that 8 bytes is a typical value in the >>> documentation. I actually remember writing in one of our papers that a >>> typical value is around 1k bytes. >> >> The benchmark assumes the lock. I'm going to measure various message >> sizes. I'll report it. > > Sounds good, thanks. > > -Flavio > >> >> >>> -Flavio >>> >>> On Sep 23, 2011, at 4:22 PM, OZAWA Tsuyoshi wrote: >>> >>>> Hi, >>>> >>>> Sending zookeeper-users and hbase-users ml since there may be some >>>> cluster developers interested in participating in this project there. >>>> >>>> I am pleased to announce the initial release of Accord, yet another >>>> coordination service like Apache ZooKeeper. >>>> ZooKeeper is a de facto standard coordination kernel as you know at >>>> present. >>>> Accord provides ZK-like features as a coordination service. Concretely >>>> speaking, it features: >>>> - Accord is a distributed, transactional, and fully-replicated (No SPoF) >>>> Key-Value Store with strong consistency. >>>> - Accord can be scale-out up to tens of nodes. >>>> - Accord servers can handle tens or thousands of clients. >>>> - The changes for a write request from a client can be notified to the >>>> other clients. >>>> - Accord detects events of client's joining/leaving, and notifies >>>> joined/left client information to the other clients. >>>> >>>> There are some problems in ZK, however, as follows: >>>> - ZK cannot handle write-intensive workloads well. ZK forwards all write >>>> requests to a master server. It may be bottleneck in write-intensive >>>> workload. >>>> - ZK is optimized for disk-persistence mode, not for in-memory mode. >>>> ZOOKEEPER-866 shows that ZK has the other bottleneck outside disk >>>> persistence, though there are some needs of a fully-replicated storage >>>> with both strong consistency and low latency. >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-866 >>>> - Limited Transaction APIs. ZK can only issue write operations (write, >>>> del) in a transaction(multi-update). >>>> >>>> These restriction limit the capability of the coordination kernel. >>>> Accord solves such problems. >>>> 1. Accord uses Corosync Cluster Engine as a total-order messaging >>>> infrastructure instead of Zab, an atomic broadcast protocol ZK uses. The >>>> engine enable any servers to accept and process requests. >>>> 2. Accord supports in-memory mode. >>>> 3. More flexible transaction support. Not only write, del operations, >>>> but also cmp, copy, read operations are supported in transaction >>>> operation. >>>> >>>> These differences of the core engine (1, 2) enable us to avoid master >>>> bottleneck. Benchmark demonstrates that the write-operation throughput >>>> of Accord is much higher than one of ZooKeeper >>>> (up to 20 times better throughput at persistent mode, and up to 18 times >>>> better throughput at in-memory mode). >>>> >>>> The high performance kernel can extend the application ranges. Assumed >>>> applications are as follows, for instance : >>>> - Distributed Lock Manager whose lock operations occur at a high >>>> frequency from thousands of clients. >>>> I assume that the lock manager for Hbase in particluar. The coordination >>>> service enables HBase to update multiple rows with ACID properties. >>>> Hbase acts as distributed DB with ACID properties until the coordination >>>> service becomes the bottleneck. The new coordination kernel, Accord, can >>>> handle 18 times better throughput than ZK. As a result, Accord can >>>> dramatically improve the scalability of Hbase with ACID properties. >>>> - Metadata management service for large-scale distributed storage, >>>> including HDFS, Ceph and Sheepdog etc. >>>> Replicated-master can be implemented easily. >>>> - Replicated Message Queue or logger (For instance, replicated >> >> RabbitMQ). >>>> >>>> and so on. >>>> >>>> The other distributed systems can use Accord features easily because >>>> Accord provides general-purpose APIs (read/write/del/more flexible >>>> transaction). >>>> >>>> More information including getting started, benchmarks, and API docs are >>>> available from our project page : >>>> http://www.osrg.net/accord >>>> >>>> and all code is available from: >>>> http://github.com/collie/accord >>>> >>>> Please try it out, and let me know any opinions or problems. >>>> >>>> Best regards, >>>> OZAWA Tsuyoshi >>>> <[email protected]> >>>> >>> >>> flavio >>> junqueira >>> >>> research scientist >>> >>> [email protected] >>> direct +34 93-183-8828 >>> >>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>> phone (408) 349 3300 fax (408) 349 3301 >>> >>> > > flavio > junqueira > > research scientist > > [email protected] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > > -- OZAWA Tsuyoshi <[email protected]>
