Reliability of ignite is very important to me, so please consider following idea:
- Important threads as WAL writer (as a sample of any critical thread) must not do any blocking action, by this way: - WAL thread must be management thread for all WAL operations - Child, worker thread of WAL writer must do separate operations which implements concrete WAL writings - Operations are separate units of work, countable by it's heartbeat for sample and has characteristics and ids. - Operations written in queue and has state. - If hung occur in a concrete operation, this operation may be cancelled, (all child operations in a cluster too) and all others operations continue to work, with failed operation go to recovery state or report user about fail - If WAL child thread do infinite blocking operation, it's need to kill this working thread and start new with same queue of operations of WAL type So, we become able : - always know what concrete operation are in hung, (not that whole main WAL thread hung) so can better decide want to do. - WAL thread operations newer irresponsive, at minimum it reports that it long doing some operation and just can insert next operation queue or propose fail - report size of queue and else full detail information about what happening and allow to decide precisely - fail concrete user operations, clean resources, spawn new working thread or else, and continue to work without painful node or cluster restart - minimal cleanless possible (just some operations) - balance operations with queues, also implementing backpressure, so make sure that optimal performance load is kept and cluster will not go to degradation from some local oversaturations - newer see that node hung, but just degrade and being in fully controlled state - WAL thread operations check management functions can be encapsulated to special class with that functionality and called from else main threads as now. Sorry for any inconvenience, I'm new to writing here -- Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/