Re: General Architecture built around Helix

kishore g Sun, 23 Jun 2013 22:42:54 -0700

Hi Lance,

That a fairly simple fix. Will provide the fix tomorrow.


thanks,
Kishore G


On Sun, Jun 23, 2013 at 2:18 PM, Lance Co Ting Keh <[email protected]> wrote:

> Hi Kishore,
>
> Hope you are having a restful weekend. I was just wondering when I should
> normally expect the bug fix to go through?
>
>
> Thank you very much,
> Lance
>
>
> On Tue, Jun 18, 2013 at 1:36 PM, Lance Co Ting Keh <[email protected]> wrote:
>
>> Thanks Kishore, here is the link to the bug:
>> https://issues.apache.org/jira/browse/HELIX-131
>>
>>
>> On Tue, Jun 18, 2013 at 9:13 AM, kishore g <[email protected]> wrote:
>>
>>> My bad, i dint realize that you needed helixadmin to actually create the
>>> cluster.  Please file a bug, fix it quite simple.
>>>
>>> thanks,
>>> Kishore G
>>>
>>>
>>> On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <[email protected]>wrote:
>>>
>>>> Thanks Kishore. Would you like me to file a bug fix for the first
>>>> solution?
>>>>
>>>> Also with the use of the factory, i get the following error message:
>>>> [error] org.apache.helix.HelixException: Initial cluster structure is
>>>> not set up for cluster: dev-box-cluster
>>>>
>>>> Seems it did not create the appropriate zNodes for me. was there
>>>> something i was suppose to initialize before calling the factory?
>>>>
>>>> Thank you
>>>> Lance
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]> wrote:
>>>>
>>>>> Hi Lance,
>>>>>
>>>>> Looks like we are not setting the connection timeout while connecting
>>>>> to zookeeper in zkHelixAdmin.
>>>>>
>>>>> Fix is to change line 99 in ZkHelixAdmin.java   _zkClient = 
>>>>> newZkClient(zkAddress); to
>>>>> _zkClient = new ZkClient(zkAddress, timeout* 1000);
>>>>>
>>>>> Another workaround is to use HelixManager to get HelixAdmin
>>>>>
>>>>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin",
>>>>> InstanceType.ADMINISTRATOR, zkAddress);
>>>>> manager.connect();
>>>>> admin= manager. getClusterManagmentTool();
>>>>>
>>>>> This will wait for 60 seconds before failing.
>>>>> Thanks,
>>>>> Kishore G
>>>>>
>>>>>
>>>>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>>
>>>>>> Thank you kishore. I'll definitely try the memory consumption of one
>>>>>> JVM per node.js server first. If its too much we'll likely do your 
>>>>>> proposed
>>>>>> design but execute kills via the OS. This is to ensure no rogue servers.
>>>>>>
>>>>>> I have a small implementation question. when calling new
>>>>>> ZkHelixAdmin, when it fails it retries again and again infinitely. (val
>>>>>> admin = new ZKHelixAdmin("")) is there a method I can override to limit 
>>>>>> the
>>>>>> number of reconnects and just have it fail?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Lance
>>>>>>
>>>>>>
>>>>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Lance,
>>>>>>>
>>>>>>> Looks good to me. Having a JVM per node.js server might add
>>>>>>> additional over head, you should definitely run this with production
>>>>>>> configuration and ensure that it does not impact performanace. If you 
>>>>>>> find
>>>>>>> it consuming too many resources, you can probably try this approach.
>>>>>>>
>>>>>>>    1. Have one agent per node
>>>>>>>    2. Instead of creating a separate helix agent per node.js, you
>>>>>>>    can create a multiple participants within the same agent. Each 
>>>>>>> participant
>>>>>>>    will represents node.js process.
>>>>>>>    3. The monitoring of participant LIVEINSTANCE and killing of
>>>>>>>    node.js process can be done by one of the helix agents. You create an
>>>>>>>    another resource using leader-standby model. Only one helix agent 
>>>>>>> will be
>>>>>>>    the leader and it will monitor the LIVEINSTANCES and if any Helix 
>>>>>>> Agent
>>>>>>>    dies it can ask node.js servers to kill itself( you can use http or 
>>>>>>> any
>>>>>>>    other mechanism of your choice). The idea here is to designate one 
>>>>>>> leader
>>>>>>>    in the system to ensure that helix-agent and node.js act like a pair.
>>>>>>>
>>>>>>> You can try this only if you find that overhead of JVM is
>>>>>>> significant with the approach you have listed.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kishore G
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <[email protected]>wrote:
>>>>>>>
>>>>>>>> Thank you for your advise Santiago. That is certainly part of the
>>>>>>>> design as well.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Lance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Helix user here (not developer) so take my words with a grain of
>>>>>>>>> salt.
>>>>>>>>>
>>>>>>>>> Regarding 6 you might want to consider the behavior of the node.js
>>>>>>>>> instance if that instance loses connection to zk, you'll probably 
>>>>>>>>> want to
>>>>>>>>> kill it too, otherwise you could ignore the fact that the JVM lost the
>>>>>>>>> connection too.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Santiago
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> We have a working prototype of basically something like #2 you
>>>>>>>>>> proposed above. We're using the standard helix participant, and on 
>>>>>>>>>> the
>>>>>>>>>> @Transitions of the state model send commands to node.js via Http.
>>>>>>>>>>
>>>>>>>>>> I want to run you through our general architecture to make sure
>>>>>>>>>> we are not violating anything on the Helix side. As a reminder, what 
>>>>>>>>>> we
>>>>>>>>>> need to guarantee is that an any given time one and only one node.js
>>>>>>>>>> process is in charge of a task.
>>>>>>>>>>
>>>>>>>>>> 1. A machine with N cores will have N (pending testing) node.js
>>>>>>>>>> processes running
>>>>>>>>>> 2. Associated with each of the N node processes are also N Helix
>>>>>>>>>> participants (separate JVM instances -- reason for this to come 
>>>>>>>>>> later)
>>>>>>>>>> 3. Separate helix controller will be running on the machine and
>>>>>>>>>> will just leader elect between machines.
>>>>>>>>>> 4. The spectator router will likely be HAProxy and thus a linux
>>>>>>>>>> kernel will run JVM to serve as Helix spectator
>>>>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode.
>>>>>>>>>> (however i do get error messages that say that i havent defined an 
>>>>>>>>>> OFFLINE
>>>>>>>>>> to DROPPED mode, i was going to ask you this but this is a minor 
>>>>>>>>>> detail
>>>>>>>>>> compared to the rest of the architecture)
>>>>>>>>>> 5. Simple Bash script will serve as a watch dog on each node.js
>>>>>>>>>> and helix participant pair. If any of the two are "dead" the other 
>>>>>>>>>> process
>>>>>>>>>> must immediately be SIGKILLED, hence the need for one JVM serving as 
>>>>>>>>>> Helix
>>>>>>>>>> Participant for every Node.js
>>>>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight
>>>>>>>>>> to zookeeper as an extra safety blanket. If it finds that it is NOT 
>>>>>>>>>> in the
>>>>>>>>>> liveinstances it likely means that its JVM participant lost its 
>>>>>>>>>> connection
>>>>>>>>>> to Zookeeper, but the process is still running so the bash script 
>>>>>>>>>> has not
>>>>>>>>>> terminated the node server. In this case the node server must end 
>>>>>>>>>> its own
>>>>>>>>>> process.
>>>>>>>>>>
>>>>>>>>>> Thank you for all your help.
>>>>>>>>>>
>>>>>>>>>> Sincerely,
>>>>>>>>>> Lance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g 
>>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Lance,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your interest in Helix. There are two possible
>>>>>>>>>>> approaches
>>>>>>>>>>>
>>>>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant in
>>>>>>>>>>> non-jvm language which in your case is node.js. There seem to be 
>>>>>>>>>>> quite a
>>>>>>>>>>> few implementations in node.js that can interact with zookeeper. 
>>>>>>>>>>> Helix
>>>>>>>>>>> participant does the following ( you got it right but i am 
>>>>>>>>>>> providing right
>>>>>>>>>>> sequence)
>>>>>>>>>>>
>>>>>>>>>>>    1. Create an ephemeral node under LIVEINSTANCES
>>>>>>>>>>>    2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for
>>>>>>>>>>>    transitions
>>>>>>>>>>>    3. After transition is completed it updates
>>>>>>>>>>>    /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE
>>>>>>>>>>>
>>>>>>>>>>> Controller is doing most of the heavy lifting of ensuring that
>>>>>>>>>>> these transitions lead to the desired configuration. Its quite easy 
>>>>>>>>>>> to
>>>>>>>>>>> re-implement this in any other language, the most difficult thing 
>>>>>>>>>>> would be
>>>>>>>>>>> zookeeper binding. We have used java bindings and its solid.
>>>>>>>>>>> This is at a very high level, there are some more details I have
>>>>>>>>>>> left out like handling connection loss/session expiry etc that will 
>>>>>>>>>>> require
>>>>>>>>>>> some thinking.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2. The other option is to use the Helix-agent as a proxy: We
>>>>>>>>>>> added Helix agent as part of 0.6.1, we havent documented it yet. 
>>>>>>>>>>> Here is
>>>>>>>>>>> the gist of what it does. Think of it as a generic state transition
>>>>>>>>>>> handler. You can configure Helix to run a specific system command 
>>>>>>>>>>> as part
>>>>>>>>>>> of each transition. Helix agent is a separate process that runs 
>>>>>>>>>>> along side
>>>>>>>>>>> your actual process. Instead of the actual process getting the 
>>>>>>>>>>> transition,
>>>>>>>>>>> Helix Agent gets the transition. As part of this transition the 
>>>>>>>>>>> Helix agent
>>>>>>>>>>> can invoke api's on the actual process via RPC, HTTP etc. Helix 
>>>>>>>>>>> agent
>>>>>>>>>>> simply acts as a proxy to the actual process.
>>>>>>>>>>>
>>>>>>>>>>> I have another approach and will try to write it up tonight, but
>>>>>>>>>>> before that I have few questions
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. How many node.js servers run on each node one or >1
>>>>>>>>>>>    2. Spectator/router is java or non java based ?
>>>>>>>>>>>    3. Can you provide more details about your state machine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>> Kishore G
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys did
>>>>>>>>>>>> a tremendous job with Helix. We are looking to use it to manage a 
>>>>>>>>>>>> cluster
>>>>>>>>>>>> primarily running Node.js. Our model for using Helix would be
>>>>>>>>>>>> to have node.js or some other non-JVM library be *Participants*,
>>>>>>>>>>>> a router as a *Spectator* and another set of machines to serve
>>>>>>>>>>>> as the *Controllers *(pending testing we may just run
>>>>>>>>>>>> master-slave controllers on the same instances as the 
>>>>>>>>>>>> Participants) . The
>>>>>>>>>>>> participants will be interacting with Zookeeper in two ways, one 
>>>>>>>>>>>> is to
>>>>>>>>>>>> receive helix state transition messages through the instance of the
>>>>>>>>>>>> HelixManager <Participant>, and another is to directly interact 
>>>>>>>>>>>> with
>>>>>>>>>>>> Zookeeper just to maintain ephemeral nodes within /INSTANCES. 
>>>>>>>>>>>> Maintaining
>>>>>>>>>>>> ephemeral nodes directly to Zookeeper would be done instead of 
>>>>>>>>>>>> using
>>>>>>>>>>>> InstanceConfig and calling addInstance on HelixAdmin because of 
>>>>>>>>>>>> the basic
>>>>>>>>>>>> health checking baked into maintaining ephemeral nodes. If not we 
>>>>>>>>>>>> would
>>>>>>>>>>>> then have to write a health checker from Node.js and the JVM 
>>>>>>>>>>>> running the
>>>>>>>>>>>> Participant. Are there better alternatives for non-JVM Helix 
>>>>>>>>>>>> participants?
>>>>>>>>>>>> I corresponded with Kishore briefly and he mentioned HelixAgents
>>>>>>>>>>>> specifically ProcessMonitorThread that came out in the last 
>>>>>>>>>>>> release.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you very much!
>>>>>>>>>>>>
>>>>>>>>>>>>  Lance Co Ting Keh
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: General Architecture built around Helix

Reply via email to