Thanks Kishore, here is the link to the bug: https://issues.apache.org/jira/browse/HELIX-131
On Tue, Jun 18, 2013 at 9:13 AM, kishore g <[email protected]> wrote: > My bad, i dint realize that you needed helixadmin to actually create the > cluster. Please file a bug, fix it quite simple. > > thanks, > Kishore G > > > On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <[email protected]> wrote: > >> Thanks Kishore. Would you like me to file a bug fix for the first >> solution? >> >> Also with the use of the factory, i get the following error message: >> [error] org.apache.helix.HelixException: Initial cluster structure is not >> set up for cluster: dev-box-cluster >> >> Seems it did not create the appropriate zNodes for me. was there >> something i was suppose to initialize before calling the factory? >> >> Thank you >> Lance >> >> >> >> >> >> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <[email protected]> wrote: >> >>> Hi Lance, >>> >>> Looks like we are not setting the connection timeout while connecting to >>> zookeeper in zkHelixAdmin. >>> >>> Fix is to change line 99 in ZkHelixAdmin.java _zkClient = >>> newZkClient(zkAddress); to >>> _zkClient = new ZkClient(zkAddress, timeout* 1000); >>> >>> Another workaround is to use HelixManager to get HelixAdmin >>> >>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin", >>> InstanceType.ADMINISTRATOR, zkAddress); >>> manager.connect(); >>> admin= manager. getClusterManagmentTool(); >>> >>> This will wait for 60 seconds before failing. >>> Thanks, >>> Kishore G >>> >>> >>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <[email protected]>wrote: >>> >>>> Thank you kishore. I'll definitely try the memory consumption of one >>>> JVM per node.js server first. If its too much we'll likely do your proposed >>>> design but execute kills via the OS. This is to ensure no rogue servers. >>>> >>>> I have a small implementation question. when calling new ZkHelixAdmin, >>>> when it fails it retries again and again infinitely. (val admin = new >>>> ZKHelixAdmin("")) is there a method I can override to limit the number of >>>> reconnects and just have it fail? >>>> >>>> >>>> >>>> Lance >>>> >>>> >>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <[email protected]>wrote: >>>> >>>>> Hi Lance, >>>>> >>>>> Looks good to me. Having a JVM per node.js server might add additional >>>>> over head, you should definitely run this with production configuration >>>>> and >>>>> ensure that it does not impact performanace. If you find it consuming too >>>>> many resources, you can probably try this approach. >>>>> >>>>> 1. Have one agent per node >>>>> 2. Instead of creating a separate helix agent per node.js, you can >>>>> create a multiple participants within the same agent. Each participant >>>>> will >>>>> represents node.js process. >>>>> 3. The monitoring of participant LIVEINSTANCE and killing of >>>>> node.js process can be done by one of the helix agents. You create an >>>>> another resource using leader-standby model. Only one helix agent will >>>>> be >>>>> the leader and it will monitor the LIVEINSTANCES and if any Helix Agent >>>>> dies it can ask node.js servers to kill itself( you can use http or any >>>>> other mechanism of your choice). The idea here is to designate one >>>>> leader >>>>> in the system to ensure that helix-agent and node.js act like a pair. >>>>> >>>>> You can try this only if you find that overhead of JVM is significant >>>>> with the approach you have listed. >>>>> >>>>> Thanks, >>>>> Kishore G >>>>> >>>>> >>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <[email protected]>wrote: >>>>> >>>>>> Thank you for your advise Santiago. That is certainly part of the >>>>>> design as well. >>>>>> >>>>>> >>>>>> Best, >>>>>> Lance >>>>>> >>>>>> >>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Helix user here (not developer) so take my words with a grain of >>>>>>> salt. >>>>>>> >>>>>>> Regarding 6 you might want to consider the behavior of the node.js >>>>>>> instance if that instance loses connection to zk, you'll probably want >>>>>>> to >>>>>>> kill it too, otherwise you could ignore the fact that the JVM lost the >>>>>>> connection too. >>>>>>> >>>>>>> Regards, >>>>>>> Santiago >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <[email protected]>wrote: >>>>>>> >>>>>>>> We have a working prototype of basically something like #2 you >>>>>>>> proposed above. We're using the standard helix participant, and on the >>>>>>>> @Transitions of the state model send commands to node.js via Http. >>>>>>>> >>>>>>>> I want to run you through our general architecture to make sure we >>>>>>>> are not violating anything on the Helix side. As a reminder, what we >>>>>>>> need >>>>>>>> to guarantee is that an any given time one and only one node.js >>>>>>>> process is >>>>>>>> in charge of a task. >>>>>>>> >>>>>>>> 1. A machine with N cores will have N (pending testing) node.js >>>>>>>> processes running >>>>>>>> 2. Associated with each of the N node processes are also N Helix >>>>>>>> participants (separate JVM instances -- reason for this to come later) >>>>>>>> 3. Separate helix controller will be running on the machine and >>>>>>>> will just leader elect between machines. >>>>>>>> 4. The spectator router will likely be HAProxy and thus a linux >>>>>>>> kernel will run JVM to serve as Helix spectator >>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE mode. >>>>>>>> (however i do get error messages that say that i havent defined an >>>>>>>> OFFLINE >>>>>>>> to DROPPED mode, i was going to ask you this but this is a minor detail >>>>>>>> compared to the rest of the architecture) >>>>>>>> 5. Simple Bash script will serve as a watch dog on each node.js and >>>>>>>> helix participant pair. If any of the two are "dead" the other process >>>>>>>> must >>>>>>>> immediately be SIGKILLED, hence the need for one JVM serving as Helix >>>>>>>> Participant for every Node.js >>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES straight to >>>>>>>> zookeeper as an extra safety blanket. If it finds that it is NOT in the >>>>>>>> liveinstances it likely means that its JVM participant lost its >>>>>>>> connection >>>>>>>> to Zookeeper, but the process is still running so the bash script has >>>>>>>> not >>>>>>>> terminated the node server. In this case the node server must end its >>>>>>>> own >>>>>>>> process. >>>>>>>> >>>>>>>> Thank you for all your help. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Lance >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g <[email protected]>wrote: >>>>>>>> >>>>>>>>> Hi Lance, >>>>>>>>> >>>>>>>>> Thanks for your interest in Helix. There are two possible >>>>>>>>> approaches >>>>>>>>> >>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant in >>>>>>>>> non-jvm language which in your case is node.js. There seem to be >>>>>>>>> quite a >>>>>>>>> few implementations in node.js that can interact with zookeeper. Helix >>>>>>>>> participant does the following ( you got it right but i am providing >>>>>>>>> right >>>>>>>>> sequence) >>>>>>>>> >>>>>>>>> 1. Create an ephemeral node under LIVEINSTANCES >>>>>>>>> 2. watches /INSTANCES/<PARTICIPANT_NAME>/MESSAGES node for >>>>>>>>> transitions >>>>>>>>> 3. After transition is completed it updates >>>>>>>>> /INSTANCES/<PARTICIPANT_NAME>/CURRENTSTATE >>>>>>>>> >>>>>>>>> Controller is doing most of the heavy lifting of ensuring that >>>>>>>>> these transitions lead to the desired configuration. Its quite easy to >>>>>>>>> re-implement this in any other language, the most difficult thing >>>>>>>>> would be >>>>>>>>> zookeeper binding. We have used java bindings and its solid. >>>>>>>>> This is at a very high level, there are some more details I have >>>>>>>>> left out like handling connection loss/session expiry etc that will >>>>>>>>> require >>>>>>>>> some thinking. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2. The other option is to use the Helix-agent as a proxy: We added >>>>>>>>> Helix agent as part of 0.6.1, we havent documented it yet. Here is >>>>>>>>> the gist >>>>>>>>> of what it does. Think of it as a generic state transition handler. >>>>>>>>> You can >>>>>>>>> configure Helix to run a specific system command as part of each >>>>>>>>> transition. Helix agent is a separate process that runs along side >>>>>>>>> your >>>>>>>>> actual process. Instead of the actual process getting the transition, >>>>>>>>> Helix >>>>>>>>> Agent gets the transition. As part of this transition the Helix agent >>>>>>>>> can >>>>>>>>> invoke api's on the actual process via RPC, HTTP etc. Helix agent >>>>>>>>> simply >>>>>>>>> acts as a proxy to the actual process. >>>>>>>>> >>>>>>>>> I have another approach and will try to write it up tonight, but >>>>>>>>> before that I have few questions >>>>>>>>> >>>>>>>>> >>>>>>>>> 1. How many node.js servers run on each node one or >1 >>>>>>>>> 2. Spectator/router is java or non java based ? >>>>>>>>> 3. Can you provide more details about your state machine. >>>>>>>>> >>>>>>>>> >>>>>>>>> thanks, >>>>>>>>> Kishore G >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at Box. You guys did a >>>>>>>>>> tremendous job with Helix. We are looking to use it to manage a >>>>>>>>>> cluster >>>>>>>>>> primarily running Node.js. Our model for using Helix would be to >>>>>>>>>> have node.js or some other non-JVM library be *Participants*, a >>>>>>>>>> router as a *Spectator* and another set of machines to serve as >>>>>>>>>> the *Controllers *(pending testing we may just run master-slave >>>>>>>>>> controllers on the same instances as the Participants) . The >>>>>>>>>> participants >>>>>>>>>> will be interacting with Zookeeper in two ways, one is to receive >>>>>>>>>> helix >>>>>>>>>> state transition messages through the instance of the HelixManager >>>>>>>>>> <Participant>, and another is to directly interact with Zookeeper >>>>>>>>>> just to >>>>>>>>>> maintain ephemeral nodes within /INSTANCES. Maintaining ephemeral >>>>>>>>>> nodes >>>>>>>>>> directly to Zookeeper would be done instead of using InstanceConfig >>>>>>>>>> and >>>>>>>>>> calling addInstance on HelixAdmin because of the basic health >>>>>>>>>> checking >>>>>>>>>> baked into maintaining ephemeral nodes. If not we would then have to >>>>>>>>>> write >>>>>>>>>> a health checker from Node.js and the JVM running the Participant. >>>>>>>>>> Are >>>>>>>>>> there better alternatives for non-JVM Helix participants? I >>>>>>>>>> corresponded >>>>>>>>>> with Kishore briefly and he mentioned HelixAgents specifically >>>>>>>>>> ProcessMonitorThread that came out in the last release. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thank you very much! >>>>>>>>>> >>>>>>>>>> Lance Co Ting Keh >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
