> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
To put some background into this, I only chose to eschew use of ZK because it isn't a requirement in a YARN 2.2.0 cluster, and I could do what I needed via HDFS and some polling on the CC. I'm not opposed to integrating it further though (and making the YARN client take use of that). - Ian On Thu, Aug 27, 2015 at 7:58 PM, Till Westmann <[email protected]> wrote: > I’m not really deep into this topic, but I’d like to understand a little > better. > > As I understand it, we currently have 2 ways to deploy/manage AsterixDB: a) > using Managix and b) using YARN. > And Managix uses Zookeeper to mange its information, but YARN doesn’t. > Also, neither the Asterix CC or NC depend on the existence of Zookeeper. > > Is this correct so far? > > Now we are trying to find a way to ensure that an AsterixDB client can > reliably know if the cluster is up or down. > > My first assumption for the properties that the solution to this problem > would have is: > 1) The knowledge if the cluster is up or down is available in the CC (as it > controls the cluster). > 2) The mechanism used to expose that information works for both ways to > deploy/manage a cluster. > > As simple way to do that seems to be to send a request “waitUntilStarted” to > the CC that returns to the client once the CC has determined that everything > has started. The response to that request would either be “yes" (cluster is > up), “no” (an error occurred and it won’t be up without intervention), or > “not sure” (timeout - please ask again later). This would imply that the > client is polling, but it wouldn’t be very busy if the timeout is reasonable. > > Now this doesn’t seem to be where the discussion is going and I’d like to > find out where is is going and why. > > Could you help me? > > Thanks, > Till > > >> On Aug 25, 2015, at 7:23 AM, Raman Grover <[email protected]> wrote: >> >> As I mentioned before... >> "The information for an AsterixDB instance is "lazily" refreshed when a >> management operation is invoked (using managix set of commands) or an >> explicit describe command is invoked. " >> >> Above, the commands are the Managix set of commands (create, start, >> describe etc.) that trigger a refresh and so its "lazy". Currently CC does >> not notify Managix. what we are discussing are the elegant way to have CC >> relay information to Managix. >> >> On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <[email protected]> >> wrote: >> >>> I don't think that is there yet but the intention is to have it at some >>> point in the future. >>> >>> Cheers, >>> Abdullah. >>> >>> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <[email protected]> >>> wrote: >>> >>>> Very interesting, thank you. Can you point out a couple places in the >>> code >>>> where some of this logic is kept? Specifically where "CC can update this >>>> information and notify Managix" sounds interesting... >>>> >>>> Ceej >>>> aka Chris Hillery >>>> >>>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <[email protected]> >>>> wrote: >>>> >>>>>> , and what code is >>>>>> responsible for keeping it up-to-date? >>>>>> >>>>> Apparently, no one is :-) >>>>> >>>>> The information for an AsterixDB instance is "lazily" refreshed when a >>>>> management operation is invoked (using managix set of commands) or an >>>>> explicit describe command is invoked. >>>>> Between the time t1 (when state of an AsterixDB instance changes, say >>> due >>>>> to NC failure) and t2 (when a management operation is invoked), the >>>>> information about the AsterixDB instance inside Zookeeper remains >>> stale. >>>> CC >>>>> can update this information and notify Managix; this way Managix >>> realizes >>>>> the changed state as soon as it has occurred. This can be particularly >>>>> useful when showing on a management console the up-to-date state of an >>>>> instance in real time or having Managix respond to an event. >>>>> >>>>> Regards, >>>>> Raman >>>>> >>>>> ---------- Forwarded message ---------- >>>>> From: abdullah alamoudi <[email protected]> >>>>> Date: Tue, Aug 25, 2015 at 12:27 AM >>>>> Subject: Re: The solution to the sporadic connection refused exceptions >>>>> To: [email protected] >>>>> >>>>> >>>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <[email protected]> >>>>> wrote: >>>>> >>>>>> Perhaps an aside, but: exactly what is kept in Zookeeper >>>>> >>>>> >>>>> A serialized instance of >>> edu.uci.ics.asterix.event.model.AsterixInstance >>>>> >>>>> >>>>>> , and what code is >>>>>> responsible for keeping it up-to-date? >>>>>> >>>>> Apparently, no one is :-) >>>>> >>>>> >>>>>> >>>>>> Ceej >>>>>> >>>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover < >>> [email protected] >>>>> >>>>>> wrote: >>>>>> >>>>>>> Well, the state of an instance (and metadata including >>> configuration) >>>>> is >>>>>>> kept in Zookeeper instance that is accessible to Managix and CC. CC >>>>>> should >>>>>>> be able to set the state of the cluster in Zookeeper under the >>> right >>>>>> znode >>>>>>> which can viewed by Managix. >>>>>>> >>>>>>> There exists a communication channel for CC and Managix to share >>>>>>> information on state etc. I am not sure if we need another channel >>>> such >>>>>> as >>>>>>> RMI between Managix and CC. >>>>>>> >>>>>>> Regards, >>>>>>> Raman >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi < >>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Well, it depends on your definition of the boundaries of managix. >>>>> What >>>>>> I >>>>>>>> did is that I added an RMI object in the InstallerDriver which >>>>>> basically >>>>>>>> listen for state changes from the cluster controller. This means >>>> some >>>>>>>> additional logic in the CCApplicationEntryPoint where after the >>> CC >>>> is >>>>>>>> ready, it contacts the InstallerDriver using RMI and at that >>> point >>>>>> only, >>>>>>>> the InstallerDriver can return to managix and tells it that the >>>>> startup >>>>>>> is >>>>>>>> complete. >>>>>>>> >>>>>>>> Not sure if this is the right way to do it but it definitely is >>>>> better >>>>>>> than >>>>>>>> what we currently have. >>>>>>>> Abdullah. >>>>>>>> >>>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery >>>>> <[email protected] >>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hopefully the solution won't involve additional important logic >>>>>> inside >>>>>>>>> Managix itself? >>>>>>>>> >>>>>>>>> Ceej >>>>>>>>> aka Chris Hillery >>>>>>>>> >>>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi < >>>>>> [email protected] >>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> That works but it doesn't feel right doing it this way. I am >>>>> going >>>>>> to >>>>>>>> fix >>>>>>>>>> this one for good. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Abdullah. >>>>>>>>>> >>>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <[email protected]> >>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The way I assured liveness for the YARN installer was to >>> try >>>>>>> running >>>>>>>>> "for >>>>>>>>>>> $x in dataset Metadata.Dataset return $x" via the API. I >>> just >>>>>>> polled >>>>>>>>> for >>>>>>>>>> a >>>>>>>>>>> reasonable amount of time (though honestly, thinking about >>>> it >>>>>> now, >>>>>>>> the >>>>>>>>>>> correct parameter to use for the polling interval is the >>>>> startup >>>>>>> wait >>>>>>>>>> time >>>>>>>>>>> in the parameters file :) ). It's not perfect, but it gives >>>>> less >>>>>>>> false >>>>>>>>>>> positives than just checking ps for processes that look >>> like >>>>>>> CCs/NCs. >>>>>>>>>>> >>>>>>>>>>> - Ian. >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi < >>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Now that I think about it. Maybe we should provide >>> multiple >>>>>> ways >>>>>>> to >>>>>>>>> do >>>>>>>>>>>> this. A polling mechanism to be used for arbitrary time >>>> and a >>>>>>>> pushing >>>>>>>>>>>> mechanism on startup. >>>>>>>>>>>> I am going to start implementation of this and will >>>> probably >>>>>> use >>>>>>>> RMI >>>>>>>>>> for >>>>>>>>>>>> this task both ways (CC to InstallerDriver and >>>>> InstallerDriver >>>>>> to >>>>>>>>> CC). >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Abdullah. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi < >>>>>>>>> [email protected] >>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> So after further investigation, turned out our startup >>>>>> process >>>>>>>> just >>>>>>>>>>>> starts >>>>>>>>>>>>> the CC and NC processes and then make sure the >>> processes >>>>> are >>>>>>>>> running >>>>>>>>>>> and >>>>>>>>>>>> if >>>>>>>>>>>>> the processes were found to be running, it returns the >>>>> state >>>>>> of >>>>>>>> the >>>>>>>>>>>> cluster >>>>>>>>>>>>> to be active and the subsequent test commands can start >>>>>>>>> immediately. >>>>>>>>>>>>> >>>>>>>>>>>>> This means that the CC could've started but is not yet >>>>> ready >>>>>>> when >>>>>>>>> we >>>>>>>>>>> try >>>>>>>>>>>>> to process the next command. To address this, we need a >>>>>> better >>>>>>>> way >>>>>>>>> to >>>>>>>>>>>> tell >>>>>>>>>>>>> when the startup procedure has completed. we can do >>> this >>>> by >>>>>>>> pushing >>>>>>>>>> (CC >>>>>>>>>>>>> informs installer driver when the startup is complete) >>> or >>>>>>> polling >>>>>>>>>> (The >>>>>>>>>>>>> installer driver needs to actually query the CC for the >>>>> state >>>>>>> of >>>>>>>>> the >>>>>>>>>>>>> cluster). >>>>>>>>>>>>> >>>>>>>>>>>>> I can do either way so let's vote. My vote goes to the >>>>>> pushing >>>>>>>>>>> mechanism. >>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi < >>>>>>>>>>> [email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> This solution turned out to be incorrect. Actually, >>> the >>>>> test >>>>>>>> cases >>>>>>>>>>> when >>>>>>>>>>>> I >>>>>>>>>>>>>> build after using the join method never fails but >>>> running >>>>> an >>>>>>>>> actual >>>>>>>>>>>> asterix >>>>>>>>>>>>>> instance never succeeds which is quite confusing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I also think that the startup script has a major bug >>>> where >>>>>> it >>>>>>>>> might >>>>>>>>>>>>>> returns before the startup is complete. More on this >>>>>>> later...... >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48 AM, abdullah alamoudi < >>>>>>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> It is highly unlikely that it is related. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> Abdullah. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45 AM, Chen Li < >>>>> [email protected] >>>>>>> >>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @Abdullah: Is this issue related to >>>>>>>>>>>>>>>> >>> https://issues.apache.org/jira/browse/ASTERIXDB-1074? >>>>> Ian >>>>>>>> and I >>>>>>>>>>> plan >>>>>>>>>>>> to >>>>>>>>>>>>>>>> look into the details on Monday. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 10:08 AM, abdullah alamoudi >>> < >>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> About 3-4 days ago, I was working on the addition >>> of >>>>> the >>>>>>>>>>> filesystem >>>>>>>>>>>>>>>> based >>>>>>>>>>>>>>>>> feed adapter and it didn't take anytime to >>> complete. >>>>>>>> However, >>>>>>>>>>> when I >>>>>>>>>>>>>>>> wanted >>>>>>>>>>>>>>>>> to build and make sure all tests pass, I kept >>>> getting >>>>>>>>>>>>>>>> ConnectionRefused >>>>>>>>>>>>>>>>> errors which caused the installer tests to fail >>>> every >>>>>> now >>>>>>>> and >>>>>>>>>>> then. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I knew the new change had nothing to do with this >>>>>> failure, >>>>>>>>> yet, >>>>>>>>>> I >>>>>>>>>>>>>>>> couldn't >>>>>>>>>>>>>>>>> direct my attention away from this bug (It just >>>>> bothered >>>>>>> me >>>>>>>> so >>>>>>>>>>> much >>>>>>>>>>>>>>>> and I >>>>>>>>>>>>>>>>> knew it needs to be resolved ASAP). After wasting >>>>>>> countless >>>>>>>>>>> hours, I >>>>>>>>>>>>>>>> was >>>>>>>>>>>>>>>>> finally able to figure out what was happening :-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In the startup routine, we start three Jetty web >>>>> servers >>>>>>>> (Web >>>>>>>>>>>>>>>> interface >>>>>>>>>>>>>>>>> server, JSON API server, and Feed server). >>> Sometime >>>>> ago, >>>>>>> we >>>>>>>>> used >>>>>>>>>>> to >>>>>>>>>>>>>>>> end the >>>>>>>>>>>>>>>>> startup call before making sure the >>>> server.isStarted() >>>>>>>> method >>>>>>>>>>>> returns >>>>>>>>>>>>>>>> true >>>>>>>>>>>>>>>>> on all servers. At that time, I introduced the >>>>>>>>>>> waitUntilServerStarts >>>>>>>>>>>>>>>> method >>>>>>>>>>>>>>>>> to make sure we don't return before the servers >>> are >>>>>> ready. >>>>>>>>>> Turned >>>>>>>>>>>>>>>> out, that >>>>>>>>>>>>>>>>> was an incorrect way to handle this (We can blame >>>>>>>>> stackoverflow >>>>>>>>>>> for >>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>> one!) and it is not enough that the server >>>> isStarted() >>>>>>>> returns >>>>>>>>>>> true. >>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>> correct way to do this is to call the >>> server.join() >>>>>> method >>>>>>>>> after >>>>>>>>>>> the >>>>>>>>>>>>>>>>> server.start(). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> See: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This was equally satisfying as it was frustrating >>>> and >>>>>> you >>>>>>>> are >>>>>>>>>>>> welcome >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> the future time I saved each of you :) >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Amoudi, Abdullah. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Raman >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Amoudi, Abdullah. >>>>> >>>>> >>>>> >>>>> -- >>>>> Raman >>>>> >>>> >>> >>> >>> >>> -- >>> Amoudi, Abdullah. >>> >> >> >> >> -- >> Raman >
