Hi, I think it is fine to keep maintaining the current cluster module, either on rel/0.12 and master branch. Look forward to progress.
Best, ----------------------------------- Xiangdong Huang School of Software, Tsinghua University 黄向东 清华大学 软件学院 程建云 <chengjian...@360.cn> 于2021年11月24日周三 下午5:38写道: > > Hi, all > > After running cluster version with online stream about two weeks, we > experienced two times of failures that cluster is no response and can't > recover by restarting. And we didn't find an effective way to recover data > from cluster. So we'd like to make testable cluster version in enterprise > which should have the properties: > > > 1. Write operation won’t be blocked frequently. > 2. Query bugs are tolerant as it could be fixed and iterate quickly. > 3. Most of issues could be resolve by restart nodes or cluster. > 4. Exist a solution to solve the unrecoverable issue after lose small part > of data. > 5. Cluster restart could complete in a proper time. > 6. System has monitor mechanism. > We’re planning improve from below aspects: > > > 1. Meta data use too much memory > > In our scenario, the measurement scale is large which would be around 1 > billion but we have small data point ingestion (100K per second). We found > the cluster node can’t afford the metadata storage as memory limitation(each > nodes has 256G memory). As the small data point request rate, the CPU load > is only about 1% ~ 2%. For the scenario, we intended to import some 3rd party > storage component like RocksDB to help manage schema meta data. Of course, > this would be optional and can be configured. > > > > 1. Raft implementation > > For this one, we planned to make it two steps. First, we’d like to abstract > the interfaces of Raft, try to make Raft as a independent component. This > should also be one work item when implement new architecture. Second, we’d > like to import some 3rd party Raft library like Ratis and make it > configurable ideally. > > > > 1. Engineering components > > Cluster missed some components like monitor system(this one should be working > in progress by community, we’d like to help if needed), migration single node > data into cluster which would help migrate single node to cluster and tools > to help do failure recovery. We need to make these tools to make the system > observable and recoverable. > > > > 1. Test > > As new test architecture is importing into community, we would try to > complement test cases under new architecture. > > > Most of the solutions above are not investigate deeply, any idea is welcomed. > > What’s the benefit of the work? > We intend to make the version run on production so that we can collect > feedback/bugs from real user and iterate by that. And finally become a > baseline of stable cluster version. > > Why won’t make it in new architecture? > We don’t do this under new architecture because the new architecture just > started planning and we can’t wait anymore. And nearly all of the work > doesn’t conflict with new architecture and could be usable in new > architecture. > Please feel free to reply the email to discussion if you have any concern or > idea. > > Welcome to discuss if you have any concern. > > ---------------------------------------------------------- > Thanks! > Jianyun Cheng >