Hi, all

After running cluster version with online stream about two weeks, we 
experienced two times of failures that cluster is no response and can't recover 
by restarting. And we didn't find an effective way to recover data from 
cluster. So we'd like to make testable cluster version in enterprise which 
should have the properties:


  1.  Write operation won’t be blocked frequently.
  2.  Query bugs are tolerant as it could be fixed and iterate quickly.
  3.  Most of issues could be resolve by restart nodes or cluster.
  4.  Exist a solution to solve the unrecoverable issue after lose small part 
of data.
  5.  Cluster restart could complete in a proper time.
  6.  System has monitor mechanism.
We’re planning improve from below aspects:


  1.  Meta data use too much memory

In our scenario, the measurement scale is large which would be around 1 billion 
but we have small data point ingestion (100K per second). We found the cluster 
node can’t afford the metadata storage as memory limitation(each nodes has 256G 
memory).  As the small data point request rate, the CPU load is only about 1% ~ 
2%. For the scenario, we intended to import some 3rd party storage component 
like RocksDB to help manage schema meta data. Of course, this would be optional 
and can be configured.



  1.  Raft implementation

For this one, we planned to make it two steps. First, we’d like to abstract the 
interfaces of Raft, try to make Raft as a independent component. This should 
also be one work item when implement new architecture. Second, we’d like to 
import some 3rd party Raft library like Ratis and make it configurable ideally.



  1.  Engineering components

Cluster missed some components like monitor system(this one should be working 
in progress by community, we’d like to help if needed), migration single node 
data into cluster which would help migrate single node to cluster and tools to 
help do failure recovery. We need to make these tools to make the system 
observable and recoverable.



  1.  Test

As new test architecture is importing into community, we would try to 
complement test cases under new architecture.


Most of the solutions above are not investigate deeply, any idea is welcomed.

What’s the benefit of the work?
We intend to make the version run on production so that we can collect 
feedback/bugs from real user and iterate by that. And finally become a baseline 
of stable cluster version.

Why won’t make it in new architecture?
We don’t do this under new architecture because the new architecture just 
started planning and we can’t wait anymore. And nearly all of the work doesn’t 
conflict with new architecture and could be usable in new architecture.
Please feel free to reply the email to discussion if you have any concern or 
idea.

Welcome to discuss if you have any concern.

----------------------------------------------------------
Thanks!
Jianyun Cheng

Reply via email to