Re: General Architecture built around Helix

Lance Co Ting Keh Sun, 23 Jun 2013 14:19:27 -0700

Hi Kishore,

Hope you are having a restful weekend. I was just wondering when I should
normally expect the bug fix to go through?



Thank you very much,
Lance


On Tue, Jun 18, 2013 at 1:36 PM, Lance Co Ting Keh <[email protected]> wrote:

> Thanks Kishore, here is the link to the bug:
> https://issues.apache.org/jira/browse/HELIX-131
>
>
> On Tue, Jun 18, 2013 at 9:13 AM, kishore g <[email protected]> wrote:
>
>> My bad, i dint realize that you needed helixadmin to actually create the
>> cluster.  Please file a bug, fix it quite simple.
>>
>> thanks,
>> Kishore G
>>
>>
>> On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <[email protected]> wrote:
>>
>>> Thanks Kishore. Would you like me to file a bug fix for the first
>>> solution?
>>>
>>> Also with the use of the factory, i get the following error message:
>>> [error] org.apache.helix.HelixException: Initial cluster structure is
>>> not set up for cluster: dev-box-cluster
>>>
>>> Seems it did not create the appropriate zNodes for me. was there
>>> something i was suppose to initialize before calling the factory?
>>>
>>> Thank you
>>> Lance
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]> wrote:
>>>
>>>> Hi Lance,
>>>>
>>>> Looks like we are not setting the connection timeout while connecting
>>>> to zookeeper in zkHelixAdmin.
>>>>
>>>> Fix is to change line 99 in ZkHelixAdmin.java   _zkClient = 
>>>> newZkClient(zkAddress); to
>>>> _zkClient = new ZkClient(zkAddress, timeout* 1000);
>>>>
>>>> Another workaround is to use HelixManager to get HelixAdmin
>>>>
>>>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin",
>>>> InstanceType.ADMINISTRATOR, zkAddress);
>>>> manager.connect();
>>>> admin= manager. getClusterManagmentTool();
>>>>
>>>> This will wait for 60 seconds before failing.
>>>> Thanks,
>>>> Kishore G
>>>>
>>>>
>>>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>
>>>>> Thank you kishore. I'll definitely try the memory consumption of one
>>>>> JVM per node.js server first. If its too much we'll likely do your 
>>>>> proposed
>>>>> design but execute kills via the OS. This is to ensure no rogue servers.
>>>>>
>>>>> I have a small implementation question. when calling new ZkHelixAdmin,
>>>>> when it fails it retries again and again infinitely. (val admin = new
>>>>> ZKHelixAdmin("")) is there a method I can override to limit the number of
>>>>> reconnects and just have it fail?
>>>>>
>>>>>
>>>>>
>>>>> Lance
>>>>>
>>>>>
>>>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]>wrote:
>>>>>
>>>>>> Hi Lance,
>>>>>>
>>>>>> Looks good to me. Having a JVM per node.js server might add
>>>>>> additional over head, you should definitely run this with production
>>>>>> configuration and ensure that it does not impact performanace. If you 
>>>>>> find
>>>>>> it consuming too many resources, you can probably try this approach.
>>>>>>
>>>>>>    1. Have one agent per node
>>>>>>    2. Instead of creating a separate helix agent per node.js, you
>>>>>>    can create a multiple participants within the same agent. Each 
>>>>>> participant
>>>>>>    will represents node.js process.
>>>>>>    3. The monitoring of participant LIVEINSTANCE and killing of
>>>>>>    node.js process can be done by one of the helix agents. You create an
>>>>>>    another resource using leader-standby model. Only one helix agent 
>>>>>> will be
>>>>>>    the leader and it will monitor the LIVEINSTANCES and if any Helix 
>>>>>> Agent
>>>>>>    dies it can ask node.js servers to kill itself( you can use http or 
>>>>>> any
>>>>>>    other mechanism of your choice). The idea here is to designate one 
>>>>>> leader
>>>>>>    in the system to ensure that helix-agent and node.js act like a pair.
>>>>>>
>>>>>> You can try this only if you find that overhead of JVM is significant
>>>>>> with the approach you have listed.
>>>>>>
>>>>>> Thanks,
>>>>>> Kishore G
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>>>
>>>>>>> Thank you for your advise Santiago. That is certainly part of the
>>>>>>> design as well.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Lance
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Helix user here (not developer) so take my words with a grain of
>>>>>>>> salt.
>>>>>>>>
>>>>>>>> Regarding 6 you might want to consider the behavior of the node.js
>>>>>>>> instance if that instance loses connection to zk, you'll probably want 
>>>>>>>> to
>>>>>>>> kill it too, otherwise you could ignore the fact that the JVM lost the
>>>>>>>> connection too.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Santiago
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> We have a working prototype of basically something like #2 you
>>>>>>>>> proposed above. We're using the standard helix participant, and on the
>>>>>>>>> @Transitions of the state model send commands to node.js via Http.
>>>>>>>>>
>>>>>>>>> I want to run you through our general architecture to make sure we
>>>>>>>>> are not violating anything on the Helix side. As a reminder, what we 
>>>>>>>>> need
>>>>>>>>> to guarantee is that an any given time one and only one node.js 
>>>>>>>>> process is
>>>>>>>>> in charge of a task.
>>>>>>>>>
>>>>>>>>> 1. A machine with N cores will have N (pending testing) node.js
>>>>>>>>> processes running
>>>>>>>>> 2. Associated with each of the N node processes are also N Helix
>>>>>>>>> participants (separate JVM instances -- reason for this to come later)
>>>>>>>>> 3. Separate helix controller will be running on the machine and
>>>>>>>>> will just leader elect between machines.
>>>>>>>>> 4. The spectator router will likely be HAProxy and thus a linux
>>>>>>>>> kernel will run JVM to serve as Helix spectator
>>>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode.
>>>>>>>>> (however i do get error messages that say that i havent defined an 
>>>>>>>>> OFFLINE
>>>>>>>>> to DROPPED mode, i was going to ask you this but this is a minor 
>>>>>>>>> detail
>>>>>>>>> compared to the rest of the architecture)
>>>>>>>>> 5. Simple Bash script will serve as a watch dog on each node.js
>>>>>>>>> and helix participant pair. If any of the two are "dead" the other 
>>>>>>>>> process
>>>>>>>>> must immediately be SIGKILLED, hence the need for one JVM serving as 
>>>>>>>>> Helix
>>>>>>>>> Participant for every Node.js
>>>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight
>>>>>>>>> to zookeeper as an extra safety blanket. If it finds that it is NOT 
>>>>>>>>> in the
>>>>>>>>> liveinstances it likely means that its JVM participant lost its 
>>>>>>>>> connection
>>>>>>>>> to Zookeeper, but the process is still running so the bash script has 
>>>>>>>>> not
>>>>>>>>> terminated the node server. In this case the node server must end its 
>>>>>>>>> own
>>>>>>>>> process.
>>>>>>>>>
>>>>>>>>> Thank you for all your help.
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>> Lance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Lance,
>>>>>>>>>>
>>>>>>>>>> Thanks for your interest in Helix. There are two possible
>>>>>>>>>> approaches
>>>>>>>>>>
>>>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant in
>>>>>>>>>> non-jvm language which in your case is node.js. There seem to be 
>>>>>>>>>> quite a
>>>>>>>>>> few implementations in node.js that can interact with zookeeper. 
>>>>>>>>>> Helix
>>>>>>>>>> participant does the following ( you got it right but i am providing 
>>>>>>>>>> right
>>>>>>>>>> sequence)
>>>>>>>>>>
>>>>>>>>>>    1. Create an ephemeral node under LIVEINSTANCES
>>>>>>>>>>    2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for
>>>>>>>>>>    transitions
>>>>>>>>>>    3. After transition is completed it updates
>>>>>>>>>>    /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE
>>>>>>>>>>
>>>>>>>>>> Controller is doing most of the heavy lifting of ensuring that
>>>>>>>>>> these transitions lead to the desired configuration. Its quite easy 
>>>>>>>>>> to
>>>>>>>>>> re-implement this in any other language, the most difficult thing 
>>>>>>>>>> would be
>>>>>>>>>> zookeeper binding. We have used java bindings and its solid.
>>>>>>>>>> This is at a very high level, there are some more details I have
>>>>>>>>>> left out like handling connection loss/session expiry etc that will 
>>>>>>>>>> require
>>>>>>>>>> some thinking.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. The other option is to use the Helix-agent as a proxy: We
>>>>>>>>>> added Helix agent as part of 0.6.1, we havent documented it yet. 
>>>>>>>>>> Here is
>>>>>>>>>> the gist of what it does. Think of it as a generic state transition
>>>>>>>>>> handler. You can configure Helix to run a specific system command as 
>>>>>>>>>> part
>>>>>>>>>> of each transition. Helix agent is a separate process that runs 
>>>>>>>>>> along side
>>>>>>>>>> your actual process. Instead of the actual process getting the 
>>>>>>>>>> transition,
>>>>>>>>>> Helix Agent gets the transition. As part of this transition the 
>>>>>>>>>> Helix agent
>>>>>>>>>> can invoke api's on the actual process via RPC, HTTP etc. Helix agent
>>>>>>>>>> simply acts as a proxy to the actual process.
>>>>>>>>>>
>>>>>>>>>> I have another approach and will try to write it up tonight, but
>>>>>>>>>> before that I have few questions
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. How many node.js servers run on each node one or >1
>>>>>>>>>>    2. Spectator/router is java or non java based ?
>>>>>>>>>>    3. Can you provide more details about your state machine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>> Kishore G
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys did
>>>>>>>>>>> a tremendous job with Helix. We are looking to use it to manage a 
>>>>>>>>>>> cluster
>>>>>>>>>>> primarily running Node.js. Our model for using Helix would be
>>>>>>>>>>> to have node.js or some other non-JVM library be *Participants*,
>>>>>>>>>>> a router as a *Spectator* and another set of machines to serve
>>>>>>>>>>> as the *Controllers *(pending testing we may just run
>>>>>>>>>>> master-slave controllers on the same instances as the Participants) 
>>>>>>>>>>> . The
>>>>>>>>>>> participants will be interacting with Zookeeper in two ways, one is 
>>>>>>>>>>> to
>>>>>>>>>>> receive helix state transition messages through the instance of the
>>>>>>>>>>> HelixManager <Participant>, and another is to directly interact with
>>>>>>>>>>> Zookeeper just to maintain ephemeral nodes within /INSTANCES. 
>>>>>>>>>>> Maintaining
>>>>>>>>>>> ephemeral nodes directly to Zookeeper would be done instead of using
>>>>>>>>>>> InstanceConfig and calling addInstance on HelixAdmin because of the 
>>>>>>>>>>> basic
>>>>>>>>>>> health checking baked into maintaining ephemeral nodes. If not we 
>>>>>>>>>>> would
>>>>>>>>>>> then have to write a health checker from Node.js and the JVM 
>>>>>>>>>>> running the
>>>>>>>>>>> Participant. Are there better alternatives for non-JVM Helix 
>>>>>>>>>>> participants?
>>>>>>>>>>> I corresponded with Kishore briefly and he mentioned HelixAgents
>>>>>>>>>>> specifically ProcessMonitorThread that came out in the last release.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much!
>>>>>>>>>>>
>>>>>>>>>>>  Lance Co Ting Keh
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: General Architecture built around Helix

Reply via email to