Thanks Kishore. Would you like me to file a bug fix for the first solution?
Also with the use of the factory, i get the following error message: [error] org.apache.helix.HelixException: Initial cluster structure is not set up for cluster: dev-box-cluster Seems it did not create the appropriate zNodes for me. was there something i was suppose to initialize before calling the factory? Thank you Lance On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]> wrote: > Hi Lance, > > Looks like we are not setting the connection timeout while connecting to > zookeeper in zkHelixAdmin. > > Fix is to change line 99 in ZkHelixAdmin.java _zkClient = > newZkClient(zkAddress); to > _zkClient = new ZkClient(zkAddress, timeout* 1000); > > Another workaround is to use HelixManager to get HelixAdmin > > manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin", > InstanceType.ADMINISTRATOR, zkAddress); > manager.connect(); > admin= manager. getClusterManagmentTool(); > > This will wait for 60 seconds before failing. > Thanks, > Kishore G > > > On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]> wrote: > >> Thank you kishore. I'll definitely try the memory consumption of one JVM >> per node.js server first. If its too much we'll likely do your proposed >> design but execute kills via the OS. This is to ensure no rogue servers. >> >> I have a small implementation question. when calling new ZkHelixAdmin, >> when it fails it retries again and again infinitely. (val admin = new >> ZKHelixAdmin("")) is there a method I can override to limit the number of >> reconnects and just have it fail? >> >> >> >> Lance >> >> >> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]> wrote: >> >>> Hi Lance, >>> >>> Looks good to me. Having a JVM per node.js server might add additional >>> over head, you should definitely run this with production configuration and >>> ensure that it does not impact performanace. If you find it consuming too >>> many resources, you can probably try this approach. >>> >>> 1. Have one agent per node >>> 2. Instead of creating a separate helix agent per node.js, you can >>> create a multiple participants within the same agent. Each participant >>> will >>> represents node.js process. >>> 3. The monitoring of participant LIVEINSTANCE and killing of node.js >>> process can be done by one of the helix agents. You create an another >>> resource using leader-standby model. Only one helix agent will be the >>> leader and it will monitor the LIVEINSTANCES and if any Helix Agent dies >>> it >>> can ask node.js servers to kill itself( you can use http or any other >>> mechanism of your choice). The idea here is to designate one leader in >>> the >>> system to ensure that helix-agent and node.js act like a pair. >>> >>> You can try this only if you find that overhead of JVM is significant >>> with the approach you have listed. >>> >>> Thanks, >>> Kishore G >>> >>> >>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <[email protected]>wrote: >>> >>>> Thank you for your advise Santiago. That is certainly part of the >>>> design as well. >>>> >>>> >>>> Best, >>>> Lance >>>> >>>> >>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez >>>> <[email protected]>wrote: >>>> >>>>> Helix user here (not developer) so take my words with a grain of salt. >>>>> >>>>> Regarding 6 you might want to consider the behavior of the node.js >>>>> instance if that instance loses connection to zk, you'll probably want to >>>>> kill it too, otherwise you could ignore the fact that the JVM lost the >>>>> connection too. >>>>> >>>>> Regards, >>>>> Santiago >>>>> >>>>> >>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <[email protected]>wrote: >>>>> >>>>>> We have a working prototype of basically something like #2 you >>>>>> proposed above. We're using the standard helix participant, and on the >>>>>> @Transitions of the state model send commands to node.js via Http. >>>>>> >>>>>> I want to run you through our general architecture to make sure we >>>>>> are not violating anything on the Helix side. As a reminder, what we need >>>>>> to guarantee is that an any given time one and only one node.js process >>>>>> is >>>>>> in charge of a task. >>>>>> >>>>>> 1. A machine with N cores will have N (pending testing) node.js >>>>>> processes running >>>>>> 2. Associated with each of the N node processes are also N Helix >>>>>> participants (separate JVM instances -- reason for this to come later) >>>>>> 3. Separate helix controller will be running on the machine and will >>>>>> just leader elect between machines. >>>>>> 4. The spectator router will likely be HAProxy and thus a linux >>>>>> kernel will run JVM to serve as Helix spectator >>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode. >>>>>> (however i do get error messages that say that i havent defined an >>>>>> OFFLINE >>>>>> to DROPPED mode, i was going to ask you this but this is a minor detail >>>>>> compared to the rest of the architecture) >>>>>> 5. Simple Bash script will serve as a watch dog on each node.js and >>>>>> helix participant pair. If any of the two are "dead" the other process >>>>>> must >>>>>> immediately be SIGKILLED, hence the need for one JVM serving as Helix >>>>>> Participant for every Node.js >>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight to >>>>>> zookeeper as an extra safety blanket. If it finds that it is NOT in the >>>>>> liveinstances it likely means that its JVM participant lost its >>>>>> connection >>>>>> to Zookeeper, but the process is still running so the bash script has not >>>>>> terminated the node server. In this case the node server must end its own >>>>>> process. >>>>>> >>>>>> Thank you for all your help. >>>>>> >>>>>> Sincerely, >>>>>> Lance >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g <[email protected]>wrote: >>>>>> >>>>>>> Hi Lance, >>>>>>> >>>>>>> Thanks for your interest in Helix. There are two possible approaches >>>>>>> >>>>>>> 1. Similar to what you suggested: Write a Helix Participant in >>>>>>> non-jvm language which in your case is node.js. There seem to be quite a >>>>>>> few implementations in node.js that can interact with zookeeper. Helix >>>>>>> participant does the following ( you got it right but i am providing >>>>>>> right >>>>>>> sequence) >>>>>>> >>>>>>> 1. Create an ephemeral node under LIVEINSTANCES >>>>>>> 2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for >>>>>>> transitions >>>>>>> 3. After transition is completed it updates >>>>>>> /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE >>>>>>> >>>>>>> Controller is doing most of the heavy lifting of ensuring that these >>>>>>> transitions lead to the desired configuration. Its quite easy to >>>>>>> re-implement this in any other language, the most difficult thing would >>>>>>> be >>>>>>> zookeeper binding. We have used java bindings and its solid. >>>>>>> This is at a very high level, there are some more details I have >>>>>>> left out like handling connection loss/session expiry etc that will >>>>>>> require >>>>>>> some thinking. >>>>>>> >>>>>>> >>>>>>> 2. The other option is to use the Helix-agent as a proxy: We added >>>>>>> Helix agent as part of 0.6.1, we havent documented it yet. Here is the >>>>>>> gist >>>>>>> of what it does. Think of it as a generic state transition handler. You >>>>>>> can >>>>>>> configure Helix to run a specific system command as part of each >>>>>>> transition. Helix agent is a separate process that runs along side your >>>>>>> actual process. Instead of the actual process getting the transition, >>>>>>> Helix >>>>>>> Agent gets the transition. As part of this transition the Helix agent >>>>>>> can >>>>>>> invoke api's on the actual process via RPC, HTTP etc. Helix agent simply >>>>>>> acts as a proxy to the actual process. >>>>>>> >>>>>>> I have another approach and will try to write it up tonight, but >>>>>>> before that I have few questions >>>>>>> >>>>>>> >>>>>>> 1. How many node.js servers run on each node one or >1 >>>>>>> 2. Spectator/router is java or non java based ? >>>>>>> 3. Can you provide more details about your state machine. >>>>>>> >>>>>>> >>>>>>> thanks, >>>>>>> Kishore G >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys did a >>>>>>>> tremendous job with Helix. We are looking to use it to manage a cluster >>>>>>>> primarily running Node.js. Our model for using Helix would be to >>>>>>>> have node.js or some other non-JVM library be *Participants*, a >>>>>>>> router as a *Spectator* and another set of machines to serve as >>>>>>>> the *Controllers *(pending testing we may just run master-slave >>>>>>>> controllers on the same instances as the Participants) . The >>>>>>>> participants >>>>>>>> will be interacting with Zookeeper in two ways, one is to receive helix >>>>>>>> state transition messages through the instance of the HelixManager >>>>>>>> <Participant>, and another is to directly interact with Zookeeper just >>>>>>>> to >>>>>>>> maintain ephemeral nodes within /INSTANCES. Maintaining ephemeral nodes >>>>>>>> directly to Zookeeper would be done instead of using InstanceConfig and >>>>>>>> calling addInstance on HelixAdmin because of the basic health checking >>>>>>>> baked into maintaining ephemeral nodes. If not we would then have to >>>>>>>> write >>>>>>>> a health checker from Node.js and the JVM running the Participant. Are >>>>>>>> there better alternatives for non-JVM Helix participants? I >>>>>>>> corresponded >>>>>>>> with Kishore briefly and he mentioned HelixAgents specifically >>>>>>>> ProcessMonitorThread that came out in the last release. >>>>>>>> >>>>>>>> >>>>>>>> Thank you very much! >>>>>>>> >>>>>>>> Lance Co Ting Keh >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
