Will this address your problem, we dont have distinct actions based on
ERROR codes that controller will understand and take different actions.
Were you looking for something like that ?
I will need to think more about this. I think the retry mechnism might
be good enough for now.
Good point on not differentiating if the partition once existed v/s newly
created. We actually plan to modify the drop notification
behavior. Jason/Terence are discussing about this in another thread. Please
add your suggestion to that thread. We should probably have a create and
drop method(not transition) on the participants.
Currently, how do other systems that use Helix handle the bootstrapping
process? When a resource is created for the first time, the actions of a
participant are different as compared to other times when a resource
partition is expanded to use another instance. Specifically, there are
three cases that need to be handled with respect to bootstrapping:
1. A cluster is up and running, and a new resource is created and
rebalanced.
2. A cluster that had resources is being started after being shutdown
3. A cluster is running and a resource is already laid out on the
cluster. Then some partitions are moved to instances that previously did
not have any partitions of that resource.
I looked through the examples and found the ClusterMessagingService
interface that can be used to send messages to instances in the cluster.
I can see 3 can be handled by using the messaging infrastructure.
However, both 1 and 2 will have the resource partitions start in the
OFFLINE mode. The messaging API cannot help because all instances in the
cluster are in the same boat for a particular resource in case 1 and
case 2. So what is the preferred way to know if you are in case 1 or in
case 2? One way I see is that if you have local artifacts matching the
partitions that are transiting from OFFLINE -> SLAVE mode, one could
infer it is case 2. Is that how other systems solve this issue?
On a separate note, is the messaging infrastructure general purpose? As
in can that be used by applications to perform RPC in the cluster
obviating the need for a separate RPC mechanism like Avro? I can see
that the handler will need more code than one would need to write when
using Avro to get RPC working, but my question is about the design point
of the messaging infrastructure.
Thanks,
Vinayak