Re: General Architecture built around Helix

Lance Co Ting Keh Tue, 18 Jun 2013 13:36:57 -0700

Thanks Kishore, here is the link to the bug:
https://issues.apache.org/jira/browse/HELIX-131



On Tue, Jun 18, 2013 at 9:13 AM, kishore g <[email protected]> wrote:

> My bad, i dint realize that you needed helixadmin to actually create the
> cluster.  Please file a bug, fix it quite simple.
>
> thanks,
> Kishore G
>
>
> On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <[email protected]> wrote:
>
>> Thanks Kishore. Would you like me to file a bug fix for the first
>> solution?
>>
>> Also with the use of the factory, i get the following error message:
>> [error] org.apache.helix.HelixException: Initial cluster structure is not
>> set up for cluster: dev-box-cluster
>>
>> Seems it did not create the appropriate zNodes for me. was there
>> something i was suppose to initialize before calling the factory?
>>
>> Thank you
>> Lance
>>
>>
>>
>>
>>
>> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]> wrote:
>>
>>> Hi Lance,
>>>
>>> Looks like we are not setting the connection timeout while connecting to
>>> zookeeper in zkHelixAdmin.
>>>
>>> Fix is to change line 99 in ZkHelixAdmin.java   _zkClient = 
>>> newZkClient(zkAddress); to
>>> _zkClient = new ZkClient(zkAddress, timeout* 1000);
>>>
>>> Another workaround is to use HelixManager to get HelixAdmin
>>>
>>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin",
>>> InstanceType.ADMINISTRATOR, zkAddress);
>>> manager.connect();
>>> admin= manager. getClusterManagmentTool();
>>>
>>> This will wait for 60 seconds before failing.
>>> Thanks,
>>> Kishore G
>>>
>>>
>>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>
>>>> Thank you kishore. I'll definitely try the memory consumption of one
>>>> JVM per node.js server first. If its too much we'll likely do your proposed
>>>> design but execute kills via the OS. This is to ensure no rogue servers.
>>>>
>>>> I have a small implementation question. when calling new ZkHelixAdmin,
>>>> when it fails it retries again and again infinitely. (val admin = new
>>>> ZKHelixAdmin("")) is there a method I can override to limit the number of
>>>> reconnects and just have it fail?
>>>>
>>>>
>>>>
>>>> Lance
>>>>
>>>>
>>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]>wrote:
>>>>
>>>>> Hi Lance,
>>>>>
>>>>> Looks good to me. Having a JVM per node.js server might add additional
>>>>> over head, you should definitely run this with production configuration 
>>>>> and
>>>>> ensure that it does not impact performanace. If you find it consuming too
>>>>> many resources, you can probably try this approach.
>>>>>
>>>>>    1. Have one agent per node
>>>>>    2. Instead of creating a separate helix agent per node.js, you can
>>>>>    create a multiple participants within the same agent. Each participant 
>>>>> will
>>>>>    represents node.js process.
>>>>>    3. The monitoring of participant LIVEINSTANCE and killing of
>>>>>    node.js process can be done by one of the helix agents. You create an
>>>>>    another resource using leader-standby model. Only one helix agent will 
>>>>> be
>>>>>    the leader and it will monitor the LIVEINSTANCES and if any Helix Agent
>>>>>    dies it can ask node.js servers to kill itself( you can use http or any
>>>>>    other mechanism of your choice). The idea here is to designate one 
>>>>> leader
>>>>>    in the system to ensure that helix-agent and node.js act like a pair.
>>>>>
>>>>> You can try this only if you find that overhead of JVM is significant
>>>>> with the approach you have listed.
>>>>>
>>>>> Thanks,
>>>>> Kishore G
>>>>>
>>>>>
>>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>>
>>>>>> Thank you for your advise Santiago. That is certainly part of the
>>>>>> design as well.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Lance
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Helix user here (not developer) so take my words with a grain of
>>>>>>> salt.
>>>>>>>
>>>>>>> Regarding 6 you might want to consider the behavior of the node.js
>>>>>>> instance if that instance loses connection to zk, you'll probably want 
>>>>>>> to
>>>>>>> kill it too, otherwise you could ignore the fact that the JVM lost the
>>>>>>> connection too.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Santiago
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>>>>
>>>>>>>> We have a working prototype of basically something like #2 you
>>>>>>>> proposed above. We're using the standard helix participant, and on the
>>>>>>>> @Transitions of the state model send commands to node.js via Http.
>>>>>>>>
>>>>>>>> I want to run you through our general architecture to make sure we
>>>>>>>> are not violating anything on the Helix side. As a reminder, what we 
>>>>>>>> need
>>>>>>>> to guarantee is that an any given time one and only one node.js 
>>>>>>>> process is
>>>>>>>> in charge of a task.
>>>>>>>>
>>>>>>>> 1. A machine with N cores will have N (pending testing) node.js
>>>>>>>> processes running
>>>>>>>> 2. Associated with each of the N node processes are also N Helix
>>>>>>>> participants (separate JVM instances -- reason for this to come later)
>>>>>>>> 3. Separate helix controller will be running on the machine and
>>>>>>>> will just leader elect between machines.
>>>>>>>> 4. The spectator router will likely be HAProxy and thus a linux
>>>>>>>> kernel will run JVM to serve as Helix spectator
>>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode.
>>>>>>>> (however i do get error messages that say that i havent defined an 
>>>>>>>> OFFLINE
>>>>>>>> to DROPPED mode, i was going to ask you this but this is a minor detail
>>>>>>>> compared to the rest of the architecture)
>>>>>>>> 5. Simple Bash script will serve as a watch dog on each node.js and
>>>>>>>> helix participant pair. If any of the two are "dead" the other process 
>>>>>>>> must
>>>>>>>> immediately be SIGKILLED, hence the need for one JVM serving as Helix
>>>>>>>> Participant for every Node.js
>>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight to
>>>>>>>> zookeeper as an extra safety blanket. If it finds that it is NOT in the
>>>>>>>> liveinstances it likely means that its JVM participant lost its 
>>>>>>>> connection
>>>>>>>> to Zookeeper, but the process is still running so the bash script has 
>>>>>>>> not
>>>>>>>> terminated the node server. In this case the node server must end its 
>>>>>>>> own
>>>>>>>> process.
>>>>>>>>
>>>>>>>> Thank you for all your help.
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>> Lance
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hi Lance,
>>>>>>>>>
>>>>>>>>> Thanks for your interest in Helix. There are two possible
>>>>>>>>> approaches
>>>>>>>>>
>>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant in
>>>>>>>>> non-jvm language which in your case is node.js. There seem to be 
>>>>>>>>> quite a
>>>>>>>>> few implementations in node.js that can interact with zookeeper. Helix
>>>>>>>>> participant does the following ( you got it right but i am providing 
>>>>>>>>> right
>>>>>>>>> sequence)
>>>>>>>>>
>>>>>>>>>    1. Create an ephemeral node under LIVEINSTANCES
>>>>>>>>>    2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for
>>>>>>>>>    transitions
>>>>>>>>>    3. After transition is completed it updates
>>>>>>>>>    /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE
>>>>>>>>>
>>>>>>>>> Controller is doing most of the heavy lifting of ensuring that
>>>>>>>>> these transitions lead to the desired configuration. Its quite easy to
>>>>>>>>> re-implement this in any other language, the most difficult thing 
>>>>>>>>> would be
>>>>>>>>> zookeeper binding. We have used java bindings and its solid.
>>>>>>>>> This is at a very high level, there are some more details I have
>>>>>>>>> left out like handling connection loss/session expiry etc that will 
>>>>>>>>> require
>>>>>>>>> some thinking.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. The other option is to use the Helix-agent as a proxy: We added
>>>>>>>>> Helix agent as part of 0.6.1, we havent documented it yet. Here is 
>>>>>>>>> the gist
>>>>>>>>> of what it does. Think of it as a generic state transition handler. 
>>>>>>>>> You can
>>>>>>>>> configure Helix to run a specific system command as part of each
>>>>>>>>> transition. Helix agent is a separate process that runs along side 
>>>>>>>>> your
>>>>>>>>> actual process. Instead of the actual process getting the transition, 
>>>>>>>>> Helix
>>>>>>>>> Agent gets the transition. As part of this transition the Helix agent 
>>>>>>>>> can
>>>>>>>>> invoke api's on the actual process via RPC, HTTP etc. Helix agent 
>>>>>>>>> simply
>>>>>>>>> acts as a proxy to the actual process.
>>>>>>>>>
>>>>>>>>> I have another approach and will try to write it up tonight, but
>>>>>>>>> before that I have few questions
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. How many node.js servers run on each node one or >1
>>>>>>>>>    2. Spectator/router is java or non java based ?
>>>>>>>>>    3. Can you provide more details about your state machine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>> Kishore G
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys did a
>>>>>>>>>> tremendous job with Helix. We are looking to use it to manage a 
>>>>>>>>>> cluster
>>>>>>>>>> primarily running Node.js. Our model for using Helix would be to
>>>>>>>>>> have node.js or some other non-JVM library be *Participants*, a
>>>>>>>>>> router as a *Spectator* and another set of machines to serve as
>>>>>>>>>> the *Controllers *(pending testing we may just run master-slave
>>>>>>>>>> controllers on the same instances as the Participants) . The 
>>>>>>>>>> participants
>>>>>>>>>> will be interacting with Zookeeper in two ways, one is to receive 
>>>>>>>>>> helix
>>>>>>>>>> state transition messages through the instance of the HelixManager
>>>>>>>>>> <Participant>, and another is to directly interact with Zookeeper 
>>>>>>>>>> just to
>>>>>>>>>> maintain ephemeral nodes within /INSTANCES. Maintaining ephemeral 
>>>>>>>>>> nodes
>>>>>>>>>> directly to Zookeeper would be done instead of using InstanceConfig 
>>>>>>>>>> and
>>>>>>>>>> calling addInstance on HelixAdmin because of the basic health 
>>>>>>>>>> checking
>>>>>>>>>> baked into maintaining ephemeral nodes. If not we would then have to 
>>>>>>>>>> write
>>>>>>>>>> a health checker from Node.js and the JVM running the Participant. 
>>>>>>>>>> Are
>>>>>>>>>> there better alternatives for non-JVM Helix participants? I 
>>>>>>>>>> corresponded
>>>>>>>>>> with Kishore briefly and he mentioned HelixAgents specifically
>>>>>>>>>> ProcessMonitorThread that came out in the last release.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you very much!
>>>>>>>>>>
>>>>>>>>>>  Lance Co Ting Keh
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: General Architecture built around Helix

Reply via email to