Hi Lance, Looks like we are not setting the connection timeout while connecting to zookeeper in zkHelixAdmin.
Fix is to change line 99 in ZkHelixAdmin.java _zkClient = newZkClient(zkAddress); to _zkClient = new ZkClient(zkAddress, timeout* 1000); Another workaround is to use HelixManager to get HelixAdmin manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin", InstanceType.ADMINISTRATOR, zkAddress); manager.connect(); admin= manager. getClusterManagmentTool(); This will wait for 60 seconds before failing. Thanks, Kishore G On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]> wrote: > Thank you kishore. I'll definitely try the memory consumption of one JVM > per node.js server first. If its too much we'll likely do your proposed > design but execute kills via the OS. This is to ensure no rogue servers. > > I have a small implementation question. when calling new ZkHelixAdmin, > when it fails it retries again and again infinitely. (val admin = new > ZKHelixAdmin("")) is there a method I can override to limit the number of > reconnects and just have it fail? > > > > Lance > > > On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]> wrote: > >> Hi Lance, >> >> Looks good to me. Having a JVM per node.js server might add additional >> over head, you should definitely run this with production configuration and >> ensure that it does not impact performanace. If you find it consuming too >> many resources, you can probably try this approach. >> >> 1. Have one agent per node >> 2. Instead of creating a separate helix agent per node.js, you can >> create a multiple participants within the same agent. Each participant >> will >> represents node.js process. >> 3. The monitoring of participant LIVEINSTANCE and killing of node.js >> process can be done by one of the helix agents. You create an another >> resource using leader-standby model. Only one helix agent will be the >> leader and it will monitor the LIVEINSTANCES and if any Helix Agent dies >> it >> can ask node.js servers to kill itself( you can use http or any other >> mechanism of your choice). The idea here is to designate one leader in the >> system to ensure that helix-agent and node.js act like a pair. >> >> You can try this only if you find that overhead of JVM is significant >> with the approach you have listed. >> >> Thanks, >> Kishore G >> >> >> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <[email protected]> wrote: >> >>> Thank you for your advise Santiago. That is certainly part of the design >>> as well. >>> >>> >>> Best, >>> Lance >>> >>> >>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <[email protected]>wrote: >>> >>>> Helix user here (not developer) so take my words with a grain of salt. >>>> >>>> Regarding 6 you might want to consider the behavior of the node.js >>>> instance if that instance loses connection to zk, you'll probably want to >>>> kill it too, otherwise you could ignore the fact that the JVM lost the >>>> connection too. >>>> >>>> Regards, >>>> Santiago >>>> >>>> >>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <[email protected]>wrote: >>>> >>>>> We have a working prototype of basically something like #2 you >>>>> proposed above. We're using the standard helix participant, and on the >>>>> @Transitions of the state model send commands to node.js via Http. >>>>> >>>>> I want to run you through our general architecture to make sure we are >>>>> not violating anything on the Helix side. As a reminder, what we need to >>>>> guarantee is that an any given time one and only one node.js process is in >>>>> charge of a task. >>>>> >>>>> 1. A machine with N cores will have N (pending testing) node.js >>>>> processes running >>>>> 2. Associated with each of the N node processes are also N Helix >>>>> participants (separate JVM instances -- reason for this to come later) >>>>> 3. Separate helix controller will be running on the machine and will >>>>> just leader elect between machines. >>>>> 4. The spectator router will likely be HAProxy and thus a linux kernel >>>>> will run JVM to serve as Helix spectator >>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode. >>>>> (however i do get error messages that say that i havent defined an OFFLINE >>>>> to DROPPED mode, i was going to ask you this but this is a minor detail >>>>> compared to the rest of the architecture) >>>>> 5. Simple Bash script will serve as a watch dog on each node.js and >>>>> helix participant pair. If any of the two are "dead" the other process >>>>> must >>>>> immediately be SIGKILLED, hence the need for one JVM serving as Helix >>>>> Participant for every Node.js >>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight to >>>>> zookeeper as an extra safety blanket. If it finds that it is NOT in the >>>>> liveinstances it likely means that its JVM participant lost its connection >>>>> to Zookeeper, but the process is still running so the bash script has not >>>>> terminated the node server. In this case the node server must end its own >>>>> process. >>>>> >>>>> Thank you for all your help. >>>>> >>>>> Sincerely, >>>>> Lance >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g <[email protected]>wrote: >>>>> >>>>>> Hi Lance, >>>>>> >>>>>> Thanks for your interest in Helix. There are two possible approaches >>>>>> >>>>>> 1. Similar to what you suggested: Write a Helix Participant in >>>>>> non-jvm language which in your case is node.js. There seem to be quite a >>>>>> few implementations in node.js that can interact with zookeeper. Helix >>>>>> participant does the following ( you got it right but i am providing >>>>>> right >>>>>> sequence) >>>>>> >>>>>> 1. Create an ephemeral node under LIVEINSTANCES >>>>>> 2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for >>>>>> transitions >>>>>> 3. After transition is completed it updates >>>>>> /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE >>>>>> >>>>>> Controller is doing most of the heavy lifting of ensuring that these >>>>>> transitions lead to the desired configuration. Its quite easy to >>>>>> re-implement this in any other language, the most difficult thing would >>>>>> be >>>>>> zookeeper binding. We have used java bindings and its solid. >>>>>> This is at a very high level, there are some more details I have left >>>>>> out like handling connection loss/session expiry etc that will require >>>>>> some >>>>>> thinking. >>>>>> >>>>>> >>>>>> 2. The other option is to use the Helix-agent as a proxy: We added >>>>>> Helix agent as part of 0.6.1, we havent documented it yet. Here is the >>>>>> gist >>>>>> of what it does. Think of it as a generic state transition handler. You >>>>>> can >>>>>> configure Helix to run a specific system command as part of each >>>>>> transition. Helix agent is a separate process that runs along side your >>>>>> actual process. Instead of the actual process getting the transition, >>>>>> Helix >>>>>> Agent gets the transition. As part of this transition the Helix agent can >>>>>> invoke api's on the actual process via RPC, HTTP etc. Helix agent simply >>>>>> acts as a proxy to the actual process. >>>>>> >>>>>> I have another approach and will try to write it up tonight, but >>>>>> before that I have few questions >>>>>> >>>>>> >>>>>> 1. How many node.js servers run on each node one or >1 >>>>>> 2. Spectator/router is java or non java based ? >>>>>> 3. Can you provide more details about your state machine. >>>>>> >>>>>> >>>>>> thanks, >>>>>> Kishore G >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh <[email protected]>wrote: >>>>>> >>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys did a >>>>>>> tremendous job with Helix. We are looking to use it to manage a cluster >>>>>>> primarily running Node.js. Our model for using Helix would be to >>>>>>> have node.js or some other non-JVM library be *Participants*, a >>>>>>> router as a *Spectator* and another set of machines to serve as the >>>>>>> *Controllers *(pending testing we may just run master-slave >>>>>>> controllers on the same instances as the Participants) . The >>>>>>> participants >>>>>>> will be interacting with Zookeeper in two ways, one is to receive helix >>>>>>> state transition messages through the instance of the HelixManager >>>>>>> <Participant>, and another is to directly interact with Zookeeper just >>>>>>> to >>>>>>> maintain ephemeral nodes within /INSTANCES. Maintaining ephemeral nodes >>>>>>> directly to Zookeeper would be done instead of using InstanceConfig and >>>>>>> calling addInstance on HelixAdmin because of the basic health checking >>>>>>> baked into maintaining ephemeral nodes. If not we would then have to >>>>>>> write >>>>>>> a health checker from Node.js and the JVM running the Participant. Are >>>>>>> there better alternatives for non-JVM Helix participants? I corresponded >>>>>>> with Kishore briefly and he mentioned HelixAgents specifically >>>>>>> ProcessMonitorThread that came out in the last release. >>>>>>> >>>>>>> >>>>>>> Thank you very much! >>>>>>> >>>>>>> Lance Co Ting Keh >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
