Hi, all After running cluster version with online stream about two weeks, we experienced two times of failures that cluster is no response and can't recover by restarting. And we didn't find an effective way to recover data from cluster. So we'd like to make testable cluster version in enterprise which should have the properties:
1. Write operation won’t be blocked frequently. 2. Query bugs are tolerant as it could be fixed and iterate quickly. 3. Most of issues could be resolve by restart nodes or cluster. 4. Exist a solution to solve the unrecoverable issue after lose small part of data. 5. Cluster restart could complete in a proper time. 6. System has monitor mechanism. We’re planning improve from below aspects: 1. Meta data use too much memory In our scenario, the measurement scale is large which would be around 1 billion but we have small data point ingestion (100K per second). We found the cluster node can’t afford the metadata storage as memory limitation(each nodes has 256G memory). As the small data point request rate, the CPU load is only about 1% ~ 2%. For the scenario, we intended to import some 3rd party storage component like RocksDB to help manage schema meta data. Of course, this would be optional and can be configured. 1. Raft implementation For this one, we planned to make it two steps. First, we’d like to abstract the interfaces of Raft, try to make Raft as a independent component. This should also be one work item when implement new architecture. Second, we’d like to import some 3rd party Raft library like Ratis and make it configurable ideally. 1. Engineering components Cluster missed some components like monitor system(this one should be working in progress by community, we’d like to help if needed), migration single node data into cluster which would help migrate single node to cluster and tools to help do failure recovery. We need to make these tools to make the system observable and recoverable. 1. Test As new test architecture is importing into community, we would try to complement test cases under new architecture. Most of the solutions above are not investigate deeply, any idea is welcomed. What’s the benefit of the work? We intend to make the version run on production so that we can collect feedback/bugs from real user and iterate by that. And finally become a baseline of stable cluster version. Why won’t make it in new architecture? We don’t do this under new architecture because the new architecture just started planning and we can’t wait anymore. And nearly all of the work doesn’t conflict with new architecture and could be usable in new architecture. Please feel free to reply the email to discussion if you have any concern or idea. Welcome to discuss if you have any concern. ---------------------------------------------------------- Thanks! Jianyun Cheng