Re: General Architecture built around Helix

Lance Co Ting Keh Mon, 24 Jun 2013 16:13:44 -0700

Thanks Kishore!


On Sun, Jun 23, 2013 at 10:42 PM, kishore g <[email protected]> wrote:

> Hi Lance,
>
> That a fairly simple fix. Will provide the fix tomorrow.
>
> thanks,
> Kishore G
>
>
> On Sun, Jun 23, 2013 at 2:18 PM, Lance Co Ting Keh <[email protected]> wrote:
>
>> Hi Kishore,
>>
>> Hope you are having a restful weekend. I was just wondering when I should
>> normally expect the bug fix to go through?
>>
>>
>> Thank you very much,
>> Lance
>>
>>
>> On Tue, Jun 18, 2013 at 1:36 PM, Lance Co Ting Keh <[email protected]> wrote:
>>
>>> Thanks Kishore, here is the link to the bug:
>>> https://issues.apache.org/jira/browse/HELIX-131
>>>
>>>
>>> On Tue, Jun 18, 2013 at 9:13 AM, kishore g <[email protected]> wrote:
>>>
>>>> My bad, i dint realize that you needed helixadmin to actually create
>>>> the cluster.  Please file a bug, fix it quite simple.
>>>>
>>>> thanks,
>>>> Kishore G
>>>>
>>>>
>>>> On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <[email protected]>wrote:
>>>>
>>>>> Thanks Kishore. Would you like me to file a bug fix for the first
>>>>> solution?
>>>>>
>>>>> Also with the use of the factory, i get the following error message:
>>>>> [error] org.apache.helix.HelixException: Initial cluster structure is
>>>>> not set up for cluster: dev-box-cluster
>>>>>
>>>>> Seems it did not create the appropriate zNodes for me. was there
>>>>> something i was suppose to initialize before calling the factory?
>>>>>
>>>>> Thank you
>>>>> Lance
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]>wrote:
>>>>>
>>>>>> Hi Lance,
>>>>>>
>>>>>> Looks like we are not setting the connection timeout while connecting
>>>>>> to zookeeper in zkHelixAdmin.
>>>>>>
>>>>>> Fix is to change line 99 in ZkHelixAdmin.java   _zkClient = 
>>>>>> newZkClient(zkAddress); to
>>>>>> _zkClient = new ZkClient(zkAddress, timeout* 1000);
>>>>>>
>>>>>> Another workaround is to use HelixManager to get HelixAdmin
>>>>>>
>>>>>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin",
>>>>>> InstanceType.ADMINISTRATOR, zkAddress);
>>>>>> manager.connect();
>>>>>> admin= manager. getClusterManagmentTool();
>>>>>>
>>>>>> This will wait for 60 seconds before failing.
>>>>>> Thanks,
>>>>>> Kishore G
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>>>
>>>>>>> Thank you kishore. I'll definitely try the memory consumption of one
>>>>>>> JVM per node.js server first. If its too much we'll likely do your 
>>>>>>> proposed
>>>>>>> design but execute kills via the OS. This is to ensure no rogue servers.
>>>>>>>
>>>>>>> I have a small implementation question. when calling new
>>>>>>> ZkHelixAdmin, when it fails it retries again and again infinitely. (val
>>>>>>> admin = new ZKHelixAdmin("")) is there a method I can override to limit 
>>>>>>> the
>>>>>>> number of reconnects and just have it fail?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Lance
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hi Lance,
>>>>>>>>
>>>>>>>> Looks good to me. Having a JVM per node.js server might add
>>>>>>>> additional over head, you should definitely run this with production
>>>>>>>> configuration and ensure that it does not impact performanace. If you 
>>>>>>>> find
>>>>>>>> it consuming too many resources, you can probably try this approach.
>>>>>>>>
>>>>>>>>    1. Have one agent per node
>>>>>>>>    2. Instead of creating a separate helix agent per node.js, you
>>>>>>>>    can create a multiple participants within the same agent. Each 
>>>>>>>> participant
>>>>>>>>    will represents node.js process.
>>>>>>>>    3. The monitoring of participant LIVEINSTANCE and killing of
>>>>>>>>    node.js process can be done by one of the helix agents. You create 
>>>>>>>> an
>>>>>>>>    another resource using leader-standby model. Only one helix agent 
>>>>>>>> will be
>>>>>>>>    the leader and it will monitor the LIVEINSTANCES and if any Helix 
>>>>>>>> Agent
>>>>>>>>    dies it can ask node.js servers to kill itself( you can use http or 
>>>>>>>> any
>>>>>>>>    other mechanism of your choice). The idea here is to designate one 
>>>>>>>> leader
>>>>>>>>    in the system to ensure that helix-agent and node.js act like a 
>>>>>>>> pair.
>>>>>>>>
>>>>>>>> You can try this only if you find that overhead of JVM is
>>>>>>>> significant with the approach you have listed.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kishore G
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Thank you for your advise Santiago. That is certainly part of the
>>>>>>>>> design as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Lance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Helix user here (not developer) so take my words with a grain of
>>>>>>>>>> salt.
>>>>>>>>>>
>>>>>>>>>> Regarding 6 you might want to consider the behavior of the
>>>>>>>>>> node.js instance if that instance loses connection to zk, you'll 
>>>>>>>>>> probably
>>>>>>>>>> want to kill it too, otherwise you could ignore the fact that the 
>>>>>>>>>> JVM lost
>>>>>>>>>> the connection too.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Santiago
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> We have a working prototype of basically something like #2 you
>>>>>>>>>>> proposed above. We're using the standard helix participant, and on 
>>>>>>>>>>> the
>>>>>>>>>>> @Transitions of the state model send commands to node.js via Http.
>>>>>>>>>>>
>>>>>>>>>>> I want to run you through our general architecture to make sure
>>>>>>>>>>> we are not violating anything on the Helix side. As a reminder, 
>>>>>>>>>>> what we
>>>>>>>>>>> need to guarantee is that an any given time one and only one node.js
>>>>>>>>>>> process is in charge of a task.
>>>>>>>>>>>
>>>>>>>>>>> 1. A machine with N cores will have N (pending testing) node.js
>>>>>>>>>>> processes running
>>>>>>>>>>> 2. Associated with each of the N node processes are also N Helix
>>>>>>>>>>> participants (separate JVM instances -- reason for this to come 
>>>>>>>>>>> later)
>>>>>>>>>>> 3. Separate helix controller will be running on the machine and
>>>>>>>>>>> will just leader elect between machines.
>>>>>>>>>>> 4. The spectator router will likely be HAProxy and thus a linux
>>>>>>>>>>> kernel will run JVM to serve as Helix spectator
>>>>>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode.
>>>>>>>>>>> (however i do get error messages that say that i havent defined an 
>>>>>>>>>>> OFFLINE
>>>>>>>>>>> to DROPPED mode, i was going to ask you this but this is a minor 
>>>>>>>>>>> detail
>>>>>>>>>>> compared to the rest of the architecture)
>>>>>>>>>>> 5. Simple Bash script will serve as a watch dog on each node.js
>>>>>>>>>>> and helix participant pair. If any of the two are "dead" the other 
>>>>>>>>>>> process
>>>>>>>>>>> must immediately be SIGKILLED, hence the need for one JVM serving 
>>>>>>>>>>> as Helix
>>>>>>>>>>> Participant for every Node.js
>>>>>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight
>>>>>>>>>>> to zookeeper as an extra safety blanket. If it finds that it is NOT 
>>>>>>>>>>> in the
>>>>>>>>>>> liveinstances it likely means that its JVM participant lost its 
>>>>>>>>>>> connection
>>>>>>>>>>> to Zookeeper, but the process is still running so the bash script 
>>>>>>>>>>> has not
>>>>>>>>>>> terminated the node server. In this case the node server must end 
>>>>>>>>>>> its own
>>>>>>>>>>> process.
>>>>>>>>>>>
>>>>>>>>>>> Thank you for all your help.
>>>>>>>>>>>
>>>>>>>>>>> Sincerely,
>>>>>>>>>>> Lance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g 
>>>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Lance,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your interest in Helix. There are two possible
>>>>>>>>>>>> approaches
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant in
>>>>>>>>>>>> non-jvm language which in your case is node.js. There seem to be 
>>>>>>>>>>>> quite a
>>>>>>>>>>>> few implementations in node.js that can interact with zookeeper. 
>>>>>>>>>>>> Helix
>>>>>>>>>>>> participant does the following ( you got it right but i am 
>>>>>>>>>>>> providing right
>>>>>>>>>>>> sequence)
>>>>>>>>>>>>
>>>>>>>>>>>>    1. Create an ephemeral node under LIVEINSTANCES
>>>>>>>>>>>>    2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for
>>>>>>>>>>>>    transitions
>>>>>>>>>>>>    3. After transition is completed it updates
>>>>>>>>>>>>    /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE
>>>>>>>>>>>>
>>>>>>>>>>>> Controller is doing most of the heavy lifting of ensuring that
>>>>>>>>>>>> these transitions lead to the desired configuration. Its quite 
>>>>>>>>>>>> easy to
>>>>>>>>>>>> re-implement this in any other language, the most difficult thing 
>>>>>>>>>>>> would be
>>>>>>>>>>>> zookeeper binding. We have used java bindings and its solid.
>>>>>>>>>>>> This is at a very high level, there are some more details I
>>>>>>>>>>>> have left out like handling connection loss/session expiry etc 
>>>>>>>>>>>> that will
>>>>>>>>>>>> require some thinking.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2. The other option is to use the Helix-agent as a proxy: We
>>>>>>>>>>>> added Helix agent as part of 0.6.1, we havent documented it yet. 
>>>>>>>>>>>> Here is
>>>>>>>>>>>> the gist of what it does. Think of it as a generic state transition
>>>>>>>>>>>> handler. You can configure Helix to run a specific system command 
>>>>>>>>>>>> as part
>>>>>>>>>>>> of each transition. Helix agent is a separate process that runs 
>>>>>>>>>>>> along side
>>>>>>>>>>>> your actual process. Instead of the actual process getting the 
>>>>>>>>>>>> transition,
>>>>>>>>>>>> Helix Agent gets the transition. As part of this transition the 
>>>>>>>>>>>> Helix agent
>>>>>>>>>>>> can invoke api's on the actual process via RPC, HTTP etc. Helix 
>>>>>>>>>>>> agent
>>>>>>>>>>>> simply acts as a proxy to the actual process.
>>>>>>>>>>>>
>>>>>>>>>>>> I have another approach and will try to write it up tonight,
>>>>>>>>>>>> but before that I have few questions
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. How many node.js servers run on each node one or >1
>>>>>>>>>>>>    2. Spectator/router is java or non java based ?
>>>>>>>>>>>>    3. Can you provide more details about your state machine.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> Kishore G
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys
>>>>>>>>>>>>> did a tremendous job with Helix. We are looking to use it to 
>>>>>>>>>>>>> manage a
>>>>>>>>>>>>> cluster primarily running Node.js. Our model for using Helix
>>>>>>>>>>>>> would be to have node.js or some other non-JVM library be *
>>>>>>>>>>>>> Participants*, a router as a *Spectator* and another set of
>>>>>>>>>>>>> machines to serve as the *Controllers *(pending testing we
>>>>>>>>>>>>> may just run master-slave controllers on the same instances as the
>>>>>>>>>>>>> Participants) . The participants will be interacting with 
>>>>>>>>>>>>> Zookeeper in two
>>>>>>>>>>>>> ways, one is to receive helix state transition messages through 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> instance of the HelixManager <Participant>, and another is to 
>>>>>>>>>>>>> directly
>>>>>>>>>>>>> interact with Zookeeper just to maintain ephemeral nodes within 
>>>>>>>>>>>>> /INSTANCES.
>>>>>>>>>>>>> Maintaining ephemeral nodes directly to Zookeeper would be done 
>>>>>>>>>>>>> instead of
>>>>>>>>>>>>> using InstanceConfig and calling addInstance on HelixAdmin 
>>>>>>>>>>>>> because of the
>>>>>>>>>>>>> basic health checking baked into maintaining ephemeral nodes. If 
>>>>>>>>>>>>> not we
>>>>>>>>>>>>> would then have to write a health checker from Node.js and the 
>>>>>>>>>>>>> JVM running
>>>>>>>>>>>>> the Participant. Are there better alternatives for non-JVM Helix
>>>>>>>>>>>>> participants? I corresponded with Kishore briefly and he mentioned
>>>>>>>>>>>>> HelixAgents specifically ProcessMonitorThread that came out in 
>>>>>>>>>>>>> the last
>>>>>>>>>>>>> release.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you very much!
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Lance Co Ting Keh
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: General Architecture built around Helix

Reply via email to