Thanks Kishore!
On Sun, Jun 23, 2013 at 10:42 PM, kishore g <[email protected]> wrote: > Hi Lance, > > That a fairly simple fix. Will provide the fix tomorrow. > > thanks, > Kishore G > > > On Sun, Jun 23, 2013 at 2:18 PM, Lance Co Ting Keh <[email protected]> wrote: > >> Hi Kishore, >> >> Hope you are having a restful weekend. I was just wondering when I should >> normally expect the bug fix to go through? >> >> >> Thank you very much, >> Lance >> >> >> On Tue, Jun 18, 2013 at 1:36 PM, Lance Co Ting Keh <[email protected]> wrote: >> >>> Thanks Kishore, here is the link to the bug: >>> https://issues.apache.org/jira/browse/HELIX-131 >>> >>> >>> On Tue, Jun 18, 2013 at 9:13 AM, kishore g <[email protected]> wrote: >>> >>>> My bad, i dint realize that you needed helixadmin to actually create >>>> the cluster. Please file a bug, fix it quite simple. >>>> >>>> thanks, >>>> Kishore G >>>> >>>> >>>> On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <[email protected]>wrote: >>>> >>>>> Thanks Kishore. Would you like me to file a bug fix for the first >>>>> solution? >>>>> >>>>> Also with the use of the factory, i get the following error message: >>>>> [error] org.apache.helix.HelixException: Initial cluster structure is >>>>> not set up for cluster: dev-box-cluster >>>>> >>>>> Seems it did not create the appropriate zNodes for me. was there >>>>> something i was suppose to initialize before calling the factory? >>>>> >>>>> Thank you >>>>> Lance >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]>wrote: >>>>> >>>>>> Hi Lance, >>>>>> >>>>>> Looks like we are not setting the connection timeout while connecting >>>>>> to zookeeper in zkHelixAdmin. >>>>>> >>>>>> Fix is to change line 99 in ZkHelixAdmin.java _zkClient = >>>>>> newZkClient(zkAddress); to >>>>>> _zkClient = new ZkClient(zkAddress, timeout* 1000); >>>>>> >>>>>> Another workaround is to use HelixManager to get HelixAdmin >>>>>> >>>>>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin", >>>>>> InstanceType.ADMINISTRATOR, zkAddress); >>>>>> manager.connect(); >>>>>> admin= manager. getClusterManagmentTool(); >>>>>> >>>>>> This will wait for 60 seconds before failing. >>>>>> Thanks, >>>>>> Kishore G >>>>>> >>>>>> >>>>>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]>wrote: >>>>>> >>>>>>> Thank you kishore. I'll definitely try the memory consumption of one >>>>>>> JVM per node.js server first. If its too much we'll likely do your >>>>>>> proposed >>>>>>> design but execute kills via the OS. This is to ensure no rogue servers. >>>>>>> >>>>>>> I have a small implementation question. when calling new >>>>>>> ZkHelixAdmin, when it fails it retries again and again infinitely. (val >>>>>>> admin = new ZKHelixAdmin("")) is there a method I can override to limit >>>>>>> the >>>>>>> number of reconnects and just have it fail? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Lance >>>>>>> >>>>>>> >>>>>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]>wrote: >>>>>>> >>>>>>>> Hi Lance, >>>>>>>> >>>>>>>> Looks good to me. Having a JVM per node.js server might add >>>>>>>> additional over head, you should definitely run this with production >>>>>>>> configuration and ensure that it does not impact performanace. If you >>>>>>>> find >>>>>>>> it consuming too many resources, you can probably try this approach. >>>>>>>> >>>>>>>> 1. Have one agent per node >>>>>>>> 2. Instead of creating a separate helix agent per node.js, you >>>>>>>> can create a multiple participants within the same agent. Each >>>>>>>> participant >>>>>>>> will represents node.js process. >>>>>>>> 3. The monitoring of participant LIVEINSTANCE and killing of >>>>>>>> node.js process can be done by one of the helix agents. You create >>>>>>>> an >>>>>>>> another resource using leader-standby model. Only one helix agent >>>>>>>> will be >>>>>>>> the leader and it will monitor the LIVEINSTANCES and if any Helix >>>>>>>> Agent >>>>>>>> dies it can ask node.js servers to kill itself( you can use http or >>>>>>>> any >>>>>>>> other mechanism of your choice). The idea here is to designate one >>>>>>>> leader >>>>>>>> in the system to ensure that helix-agent and node.js act like a >>>>>>>> pair. >>>>>>>> >>>>>>>> You can try this only if you find that overhead of JVM is >>>>>>>> significant with the approach you have listed. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kishore G >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> Thank you for your advise Santiago. That is certainly part of the >>>>>>>>> design as well. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Lance >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Helix user here (not developer) so take my words with a grain of >>>>>>>>>> salt. >>>>>>>>>> >>>>>>>>>> Regarding 6 you might want to consider the behavior of the >>>>>>>>>> node.js instance if that instance loses connection to zk, you'll >>>>>>>>>> probably >>>>>>>>>> want to kill it too, otherwise you could ignore the fact that the >>>>>>>>>> JVM lost >>>>>>>>>> the connection too. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Santiago >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> We have a working prototype of basically something like #2 you >>>>>>>>>>> proposed above. We're using the standard helix participant, and on >>>>>>>>>>> the >>>>>>>>>>> @Transitions of the state model send commands to node.js via Http. >>>>>>>>>>> >>>>>>>>>>> I want to run you through our general architecture to make sure >>>>>>>>>>> we are not violating anything on the Helix side. As a reminder, >>>>>>>>>>> what we >>>>>>>>>>> need to guarantee is that an any given time one and only one node.js >>>>>>>>>>> process is in charge of a task. >>>>>>>>>>> >>>>>>>>>>> 1. A machine with N cores will have N (pending testing) node.js >>>>>>>>>>> processes running >>>>>>>>>>> 2. Associated with each of the N node processes are also N Helix >>>>>>>>>>> participants (separate JVM instances -- reason for this to come >>>>>>>>>>> later) >>>>>>>>>>> 3. Separate helix controller will be running on the machine and >>>>>>>>>>> will just leader elect between machines. >>>>>>>>>>> 4. The spectator router will likely be HAProxy and thus a linux >>>>>>>>>>> kernel will run JVM to serve as Helix spectator >>>>>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode. >>>>>>>>>>> (however i do get error messages that say that i havent defined an >>>>>>>>>>> OFFLINE >>>>>>>>>>> to DROPPED mode, i was going to ask you this but this is a minor >>>>>>>>>>> detail >>>>>>>>>>> compared to the rest of the architecture) >>>>>>>>>>> 5. Simple Bash script will serve as a watch dog on each node.js >>>>>>>>>>> and helix participant pair. If any of the two are "dead" the other >>>>>>>>>>> process >>>>>>>>>>> must immediately be SIGKILLED, hence the need for one JVM serving >>>>>>>>>>> as Helix >>>>>>>>>>> Participant for every Node.js >>>>>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight >>>>>>>>>>> to zookeeper as an extra safety blanket. If it finds that it is NOT >>>>>>>>>>> in the >>>>>>>>>>> liveinstances it likely means that its JVM participant lost its >>>>>>>>>>> connection >>>>>>>>>>> to Zookeeper, but the process is still running so the bash script >>>>>>>>>>> has not >>>>>>>>>>> terminated the node server. In this case the node server must end >>>>>>>>>>> its own >>>>>>>>>>> process. >>>>>>>>>>> >>>>>>>>>>> Thank you for all your help. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Lance >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g >>>>>>>>>>> <[email protected]>wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Lance, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your interest in Helix. There are two possible >>>>>>>>>>>> approaches >>>>>>>>>>>> >>>>>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant in >>>>>>>>>>>> non-jvm language which in your case is node.js. There seem to be >>>>>>>>>>>> quite a >>>>>>>>>>>> few implementations in node.js that can interact with zookeeper. >>>>>>>>>>>> Helix >>>>>>>>>>>> participant does the following ( you got it right but i am >>>>>>>>>>>> providing right >>>>>>>>>>>> sequence) >>>>>>>>>>>> >>>>>>>>>>>> 1. Create an ephemeral node under LIVEINSTANCES >>>>>>>>>>>> 2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for >>>>>>>>>>>> transitions >>>>>>>>>>>> 3. After transition is completed it updates >>>>>>>>>>>> /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE >>>>>>>>>>>> >>>>>>>>>>>> Controller is doing most of the heavy lifting of ensuring that >>>>>>>>>>>> these transitions lead to the desired configuration. Its quite >>>>>>>>>>>> easy to >>>>>>>>>>>> re-implement this in any other language, the most difficult thing >>>>>>>>>>>> would be >>>>>>>>>>>> zookeeper binding. We have used java bindings and its solid. >>>>>>>>>>>> This is at a very high level, there are some more details I >>>>>>>>>>>> have left out like handling connection loss/session expiry etc >>>>>>>>>>>> that will >>>>>>>>>>>> require some thinking. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2. The other option is to use the Helix-agent as a proxy: We >>>>>>>>>>>> added Helix agent as part of 0.6.1, we havent documented it yet. >>>>>>>>>>>> Here is >>>>>>>>>>>> the gist of what it does. Think of it as a generic state transition >>>>>>>>>>>> handler. You can configure Helix to run a specific system command >>>>>>>>>>>> as part >>>>>>>>>>>> of each transition. Helix agent is a separate process that runs >>>>>>>>>>>> along side >>>>>>>>>>>> your actual process. Instead of the actual process getting the >>>>>>>>>>>> transition, >>>>>>>>>>>> Helix Agent gets the transition. As part of this transition the >>>>>>>>>>>> Helix agent >>>>>>>>>>>> can invoke api's on the actual process via RPC, HTTP etc. Helix >>>>>>>>>>>> agent >>>>>>>>>>>> simply acts as a proxy to the actual process. >>>>>>>>>>>> >>>>>>>>>>>> I have another approach and will try to write it up tonight, >>>>>>>>>>>> but before that I have few questions >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 1. How many node.js servers run on each node one or >1 >>>>>>>>>>>> 2. Spectator/router is java or non java based ? >>>>>>>>>>>> 3. Can you provide more details about your state machine. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> thanks, >>>>>>>>>>>> Kishore G >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys >>>>>>>>>>>>> did a tremendous job with Helix. We are looking to use it to >>>>>>>>>>>>> manage a >>>>>>>>>>>>> cluster primarily running Node.js. Our model for using Helix >>>>>>>>>>>>> would be to have node.js or some other non-JVM library be * >>>>>>>>>>>>> Participants*, a router as a *Spectator* and another set of >>>>>>>>>>>>> machines to serve as the *Controllers *(pending testing we >>>>>>>>>>>>> may just run master-slave controllers on the same instances as the >>>>>>>>>>>>> Participants) . The participants will be interacting with >>>>>>>>>>>>> Zookeeper in two >>>>>>>>>>>>> ways, one is to receive helix state transition messages through >>>>>>>>>>>>> the >>>>>>>>>>>>> instance of the HelixManager <Participant>, and another is to >>>>>>>>>>>>> directly >>>>>>>>>>>>> interact with Zookeeper just to maintain ephemeral nodes within >>>>>>>>>>>>> /INSTANCES. >>>>>>>>>>>>> Maintaining ephemeral nodes directly to Zookeeper would be done >>>>>>>>>>>>> instead of >>>>>>>>>>>>> using InstanceConfig and calling addInstance on HelixAdmin >>>>>>>>>>>>> because of the >>>>>>>>>>>>> basic health checking baked into maintaining ephemeral nodes. If >>>>>>>>>>>>> not we >>>>>>>>>>>>> would then have to write a health checker from Node.js and the >>>>>>>>>>>>> JVM running >>>>>>>>>>>>> the Participant. Are there better alternatives for non-JVM Helix >>>>>>>>>>>>> participants? I corresponded with Kishore briefly and he mentioned >>>>>>>>>>>>> HelixAgents specifically ProcessMonitorThread that came out in >>>>>>>>>>>>> the last >>>>>>>>>>>>> release. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you very much! >>>>>>>>>>>>> >>>>>>>>>>>>> Lance Co Ting Keh >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
