Hello Yakov,

Glad to see you back!

Hi!
> I am back!
>
> Here are several ideas on top of my mind for Ignite 3.0
> 1. Client nodes should take the config from servers. Basically it should be
> enough to provide some cluster identifier or any known IP address to start
> a client.
>
This totally makes sense and should be covered by the distributed
metastorage approach described in [1]. A client can read and watch updates
for the cluster configuration and run solely based on that config.


> 2. Thread per partition. Again. I strongly recommend taking a look at how
> Scylla DB operates. I think this is the best distributed database threading
> model and can be a perfect fit for Ignite. Main idea is in "share nothing"
> approach - they keep thread switches to the necessary minimum - messages
> reading or updating data are processed within the thread that reads them
> (of course, the sender should properly route the message, i.e. send it to
> the correct socket). Blocking operations such as fsync happen outside of
> worker (shard) threads.
> This will require to split indexes per partition which should be quite ok
> for most of the use cases in my view. Edge cases with high-selectivity
> indexes selecting 1-2 rows for a value can be sped up with hash indexes or
> even by secondary index backed by another cache.
>
Generally agree, and again this is what we will naturally have when we
implement any log-based replication protocol [1]. However, I think it makes
sense to separate the operation replication, which can be done in one
thread, and actual command execution. The command execution can be done
asynchronously thus reducing the latency of any operation to a single log
append + fsync.


> 3. Replicate physical updates instead of logical. This will simplify logic
> running on backups to zero. Read a page sent by the primary node and apply
> it locally. Most probably this change will require pure thread per
> partition described above.
>
Not sure about this, I think this approach has the following disadvantages:
 * We will not be able to replicate a single page update as such an update
usually leaves storage in a corrupted state. Therefore, we will still have
to group pages into batches which must be applied atomically, thus
complicating the protocol
 * Physical replication will result in significant network traffic
amplification, especially for cases with large inline indexes. The same
goes for EntryProcessors - we will have to replicate huge values while a
small operation modifying a large object could have been replicated
 * Physical replication complicates a road for Ignite to support a rolling
upgrade in the future. If we choose to change the local storage format, we
will have to somehow convert a new binary format to the old at replication
time when sending from new to old nodes, and additionally support forward
conversion on new nodes if an old node is a replication group leader
 * Finally, this approach closes the road to having different storage
formats on different nodes (for example, have one of the nodes in the
replication group keep data in columnar format, as it is done in TiDB).
This allows us to route analytical queries to separate dedicated nodes
without affecting the performance properties of the whole replication group.


> 4. Merge (all?) network components (communication, disco, REST, etc?) and
> listen to one port.
>
Makes sense. There is no clear design for this though, we did not even
discuss this in detail. I am not even sure we need separate discovery and
communication services, as they are very dependent; on the other hand, a
lot of discovery functions are moved to the distributed metastore.



> 5. Revisit transaction concurrency and isolation settings. Currently some
> of the combinations do not work as users may expect them to and some look
> just weird.
>
Agree, looks like we can cut more than half of the transaction modes
without sacrificing functionality at all. However, similarly to the single
port point, we did not discuss how exactly the transactional protocol will
look like yet, so this is an open question. Once I put my thoughts
together, I will create an IEP for this (unless somebody does it earlier,
of course).

--AG

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure
<https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure>

Reply via email to