Hi,
I think it is fine to keep maintaining the current cluster module,
either on rel/0.12 and master branch.
Look forward to progress.

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院

程建云 <chengjian...@360.cn> 于2021年11月24日周三 下午5:38写道:
>
> Hi, all
>
> After running cluster version with online stream about two weeks, we 
> experienced two times of failures that cluster is no response and can't 
> recover by restarting. And we didn't find an effective way to recover data 
> from cluster. So we'd like to make testable cluster version in enterprise 
> which should have the properties:
>
>
>   1.  Write operation won’t be blocked frequently.
>   2.  Query bugs are tolerant as it could be fixed and iterate quickly.
>   3.  Most of issues could be resolve by restart nodes or cluster.
>   4.  Exist a solution to solve the unrecoverable issue after lose small part 
> of data.
>   5.  Cluster restart could complete in a proper time.
>   6.  System has monitor mechanism.
> We’re planning improve from below aspects:
>
>
>   1.  Meta data use too much memory
>
> In our scenario, the measurement scale is large which would be around 1 
> billion but we have small data point ingestion (100K per second). We found 
> the cluster node can’t afford the metadata storage as memory limitation(each 
> nodes has 256G memory).  As the small data point request rate, the CPU load 
> is only about 1% ~ 2%. For the scenario, we intended to import some 3rd party 
> storage component like RocksDB to help manage schema meta data. Of course, 
> this would be optional and can be configured.
>
>
>
>   1.  Raft implementation
>
> For this one, we planned to make it two steps. First, we’d like to abstract 
> the interfaces of Raft, try to make Raft as a independent component. This 
> should also be one work item when implement new architecture. Second, we’d 
> like to import some 3rd party Raft library like Ratis and make it 
> configurable ideally.
>
>
>
>   1.  Engineering components
>
> Cluster missed some components like monitor system(this one should be working 
> in progress by community, we’d like to help if needed), migration single node 
> data into cluster which would help migrate single node to cluster and tools 
> to help do failure recovery. We need to make these tools to make the system 
> observable and recoverable.
>
>
>
>   1.  Test
>
> As new test architecture is importing into community, we would try to 
> complement test cases under new architecture.
>
>
> Most of the solutions above are not investigate deeply, any idea is welcomed.
>
> What’s the benefit of the work?
> We intend to make the version run on production so that we can collect 
> feedback/bugs from real user and iterate by that. And finally become a 
> baseline of stable cluster version.
>
> Why won’t make it in new architecture?
> We don’t do this under new architecture because the new architecture just 
> started planning and we can’t wait anymore. And nearly all of the work 
> doesn’t conflict with new architecture and could be usable in new 
> architecture.
> Please feel free to reply the email to discussion if you have any concern or 
> idea.
>
> Welcome to discuss if you have any concern.
>
> ----------------------------------------------------------
> Thanks!
> Jianyun Cheng
>

Reply via email to