Hi! Alexey, > If we want to support etcd as a metastorage - let's do this as a concrete configuration option, a > first-class citizen of the system rather than an SPI implementation with a rigid interface. On one side this is quite reasonable. But on the other side, if someone wants to adopt, for example Apache Zookeeper or some other proprietary external lock service, we could provide basic interfaces to do the job.
> Thus, by default, they will be mixed which will significantly simplify cluster setup and usability. According to raft specs, the leader processes all requests from clients. Leader's response latency is a crucial thing for the whole cluster stability. Cluster setup simplicity is a subject of documentation, scripts and so on, i.e. starting kafka is quite easy. Also, if we use mixed approach, service discovery protocol should be implemented.This is necessary, because we should discover nodes firstly in order to choose finite subset for RAFT ensemble. For example, Consul by HashiCorp uses gossip protocol to do the job. (Nodes participating in RAFT are called servers, [1] If we use separated approach, we could use service discovery pattern that is common for zookeeper or etcd (data node create record with TTL and renew it. (EPHEMERAL node approach for zk), other data nodes watches for new records) Some words about PacificA Article [2] -- is just brief descriptions and ideas. Alexey, is there any formal specification of this protocol? Preferrably in TLA+? [1] -- https://www.consul.io/docs/architecture/gossip [2] -- https://www.microsoft.com/en-us/research/wp-content/uploads/2008/02/tr-2008-25.pdf пт, 23 окт. 2020 г. в 13:05, Alexey Goncharuk <alexey.goncha...@gmail.com>: > Hello Ivan, > > Thanks for the feedback, see my comments inline: > > чт, 22 окт. 2020 г. в 17:59, Ivan Daschinsky <ivanda...@gmail.com>: > > > Hi! > > Alexey, your proposal looks great. Can I ask you some questions? > > 1. Is nodes, that take part of metastorage replication group (raft > > candidates and leader) are expected to also bear cache data and > participate > > in cache transactions? > > As for me, it seems quite dangerous to mix roles. For example, heavy > > load from users can cause long GC pauses on leader of replication group > and > > therefore failure, new leader election, etc. > > > I think both ways should be possible. The set of nodes that hold > metastorage should be defined declaratively in runtime, as well as the set > of nodes holding table data. Thus, by default, they will be mixed which > will significantly simplify cluster setup and usability, but when needed, > this should be easily adjusted in runtime by the cluster administrator. > > > > 2. If previous statement is true, other question arises. If one of > > candidates or leader fails, how will a insufficient node will be chosen > > from regular nodes to form full ensemble? Random one? > > > Similarly - by default, a 'best' node will be chosen from the available > ones, but the administrator can override this. > > > > 3. Do you think, that this metastorage implementation can be pluggable? > it > > can be implemented on top of etcd, for example. > > I think the metastorage abstraction must be clearly separated so it is > possible to change the implementation. Moreover, I was thinking that we may > use etcd to speed up the development of other system components while we > are working on our own protocol implementation. However, I do not think we > should expose it as a pluggable public API. If we want to support etcd as a > metastorage - let's do this as a concrete configuration option, a > first-class citizen of the system rather than an SPI implementation with a > rigid interface. > > WDYT? > -- Sincerely yours, Ivan Daschinskiy