Hi!
Alexey, your proposal looks great. Can I ask you some questions?
1. Is nodes, that take part of metastorage replication group (raft
candidates and leader) are expected to also bear cache data and participate
in cache transactions?
   As for me, it seems quite dangerous to mix roles. For example, heavy
load from users can cause long GC pauses on leader of replication group and
therefore failure, new leader election, etc.

2. If previous statement is true, other question arises. If one of
candidates or leader fails, how will a insufficient node will be chosen
from regular nodes to form full ensemble? Random one?
3. Do you think, that this metastorage implementation can be pluggable? it
can be implemented on top of etcd, for example.


чт, 22 окт. 2020 г. в 13:04, Alexey Goncharuk <alexey.goncha...@gmail.com>:

> Hello Yakov,
>
> Glad to see you back!
>
> Hi!
> > I am back!
> >
> > Here are several ideas on top of my mind for Ignite 3.0
> > 1. Client nodes should take the config from servers. Basically it should
> be
> > enough to provide some cluster identifier or any known IP address to
> start
> > a client.
> >
> This totally makes sense and should be covered by the distributed
> metastorage approach described in [1]. A client can read and watch updates
> for the cluster configuration and run solely based on that config.
>
>
> > 2. Thread per partition. Again. I strongly recommend taking a look at how
> > Scylla DB operates. I think this is the best distributed database
> threading
> > model and can be a perfect fit for Ignite. Main idea is in "share
> nothing"
> > approach - they keep thread switches to the necessary minimum - messages
> > reading or updating data are processed within the thread that reads them
> > (of course, the sender should properly route the message, i.e. send it to
> > the correct socket). Blocking operations such as fsync happen outside of
> > worker (shard) threads.
> > This will require to split indexes per partition which should be quite ok
> > for most of the use cases in my view. Edge cases with high-selectivity
> > indexes selecting 1-2 rows for a value can be sped up with hash indexes
> or
> > even by secondary index backed by another cache.
> >
> Generally agree, and again this is what we will naturally have when we
> implement any log-based replication protocol [1]. However, I think it makes
> sense to separate the operation replication, which can be done in one
> thread, and actual command execution. The command execution can be done
> asynchronously thus reducing the latency of any operation to a single log
> append + fsync.
>
>
> > 3. Replicate physical updates instead of logical. This will simplify
> logic
> > running on backups to zero. Read a page sent by the primary node and
> apply
> > it locally. Most probably this change will require pure thread per
> > partition described above.
> >
> Not sure about this, I think this approach has the following disadvantages:
>  * We will not be able to replicate a single page update as such an update
> usually leaves storage in a corrupted state. Therefore, we will still have
> to group pages into batches which must be applied atomically, thus
> complicating the protocol
>  * Physical replication will result in significant network traffic
> amplification, especially for cases with large inline indexes. The same
> goes for EntryProcessors - we will have to replicate huge values while a
> small operation modifying a large object could have been replicated
>  * Physical replication complicates a road for Ignite to support a rolling
> upgrade in the future. If we choose to change the local storage format, we
> will have to somehow convert a new binary format to the old at replication
> time when sending from new to old nodes, and additionally support forward
> conversion on new nodes if an old node is a replication group leader
>  * Finally, this approach closes the road to having different storage
> formats on different nodes (for example, have one of the nodes in the
> replication group keep data in columnar format, as it is done in TiDB).
> This allows us to route analytical queries to separate dedicated nodes
> without affecting the performance properties of the whole replication
> group.
>
>
> > 4. Merge (all?) network components (communication, disco, REST, etc?) and
> > listen to one port.
> >
> Makes sense. There is no clear design for this though, we did not even
> discuss this in detail. I am not even sure we need separate discovery and
> communication services, as they are very dependent; on the other hand, a
> lot of discovery functions are moved to the distributed metastore.
>
>
>
> > 5. Revisit transaction concurrency and isolation settings. Currently some
> > of the combinations do not work as users may expect them to and some look
> > just weird.
> >
> Agree, looks like we can cut more than half of the transaction modes
> without sacrificing functionality at all. However, similarly to the single
> port point, we did not discuss how exactly the transactional protocol will
> look like yet, so this is an open question. Once I put my thoughts
> together, I will create an IEP for this (unless somebody does it earlier,
> of course).
>
> --AG
>
> [1]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure
> <
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure
> >
>


-- 
Sincerely yours, Ivan Daschinskiy

Reply via email to