Continuing this discussion on https://asterix-gerrit.ics.uci.edu/#/c/365 <https://asterix-gerrit.ics.uci.edu/#/c/365> (which gets mirrored on this list anyway).
Cheers, Till > On Aug 27, 2015, at 11:52 PM, Ian Maxon <[email protected]> wrote: > >> And Managix uses Zookeeper to mange its information, but YARN doesn’t. > > To put some background into this, I only chose to eschew use of ZK > because it isn't a requirement in a YARN 2.2.0 cluster, and I could do > what I needed via HDFS and some polling on the CC. I'm not opposed to > integrating it further though (and making the YARN client take use of > that). > > - Ian > > On Thu, Aug 27, 2015 at 7:58 PM, Till Westmann <[email protected]> wrote: >> I’m not really deep into this topic, but I’d like to understand a little >> better. >> >> As I understand it, we currently have 2 ways to deploy/manage AsterixDB: a) >> using Managix and b) using YARN. >> And Managix uses Zookeeper to mange its information, but YARN doesn’t. >> Also, neither the Asterix CC or NC depend on the existence of Zookeeper. >> >> Is this correct so far? >> >> Now we are trying to find a way to ensure that an AsterixDB client can >> reliably know if the cluster is up or down. >> >> My first assumption for the properties that the solution to this problem >> would have is: >> 1) The knowledge if the cluster is up or down is available in the CC (as it >> controls the cluster). >> 2) The mechanism used to expose that information works for both ways to >> deploy/manage a cluster. >> >> As simple way to do that seems to be to send a request “waitUntilStarted” to >> the CC that returns to the client once the CC has determined that everything >> has started. The response to that request would either be “yes" (cluster is >> up), “no” (an error occurred and it won’t be up without intervention), or >> “not sure” (timeout - please ask again later). This would imply that the >> client is polling, but it wouldn’t be very busy if the timeout is reasonable. >> >> Now this doesn’t seem to be where the discussion is going and I’d like to >> find out where is is going and why. >> >> Could you help me? >> >> Thanks, >> Till >> >> >>> On Aug 25, 2015, at 7:23 AM, Raman Grover <[email protected]> wrote: >>> >>> As I mentioned before... >>> "The information for an AsterixDB instance is "lazily" refreshed when a >>> management operation is invoked (using managix set of commands) or an >>> explicit describe command is invoked. " >>> >>> Above, the commands are the Managix set of commands (create, start, >>> describe etc.) that trigger a refresh and so its "lazy". Currently CC does >>> not notify Managix. what we are discussing are the elegant way to have CC >>> relay information to Managix. >>> >>> On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <[email protected]> >>> wrote: >>> >>>> I don't think that is there yet but the intention is to have it at some >>>> point in the future. >>>> >>>> Cheers, >>>> Abdullah. >>>> >>>> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <[email protected]> >>>> wrote: >>>> >>>>> Very interesting, thank you. Can you point out a couple places in the >>>> code >>>>> where some of this logic is kept? Specifically where "CC can update this >>>>> information and notify Managix" sounds interesting... >>>>> >>>>> Ceej >>>>> aka Chris Hillery >>>>> >>>>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <[email protected]> >>>>> wrote: >>>>> >>>>>>> , and what code is >>>>>>> responsible for keeping it up-to-date? >>>>>>> >>>>>> Apparently, no one is :-) >>>>>> >>>>>> The information for an AsterixDB instance is "lazily" refreshed when a >>>>>> management operation is invoked (using managix set of commands) or an >>>>>> explicit describe command is invoked. >>>>>> Between the time t1 (when state of an AsterixDB instance changes, say >>>> due >>>>>> to NC failure) and t2 (when a management operation is invoked), the >>>>>> information about the AsterixDB instance inside Zookeeper remains >>>> stale. >>>>> CC >>>>>> can update this information and notify Managix; this way Managix >>>> realizes >>>>>> the changed state as soon as it has occurred. This can be particularly >>>>>> useful when showing on a management console the up-to-date state of an >>>>>> instance in real time or having Managix respond to an event. >>>>>> >>>>>> Regards, >>>>>> Raman >>>>>> >>>>>> ---------- Forwarded message ---------- >>>>>> From: abdullah alamoudi <[email protected]> >>>>>> Date: Tue, Aug 25, 2015 at 12:27 AM >>>>>> Subject: Re: The solution to the sporadic connection refused exceptions >>>>>> To: [email protected] >>>>>> >>>>>> >>>>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Perhaps an aside, but: exactly what is kept in Zookeeper >>>>>> >>>>>> >>>>>> A serialized instance of >>>> edu.uci.ics.asterix.event.model.AsterixInstance >>>>>> >>>>>> >>>>>>> , and what code is >>>>>>> responsible for keeping it up-to-date? >>>>>>> >>>>>> Apparently, no one is :-) >>>>>> >>>>>> >>>>>>> >>>>>>> Ceej >>>>>>> >>>>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover < >>>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Well, the state of an instance (and metadata including >>>> configuration) >>>>>> is >>>>>>>> kept in Zookeeper instance that is accessible to Managix and CC. CC >>>>>>> should >>>>>>>> be able to set the state of the cluster in Zookeeper under the >>>> right >>>>>>> znode >>>>>>>> which can viewed by Managix. >>>>>>>> >>>>>>>> There exists a communication channel for CC and Managix to share >>>>>>>> information on state etc. I am not sure if we need another channel >>>>> such >>>>>>> as >>>>>>>> RMI between Managix and CC. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Raman >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi < >>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Well, it depends on your definition of the boundaries of managix. >>>>>> What >>>>>>> I >>>>>>>>> did is that I added an RMI object in the InstallerDriver which >>>>>>> basically >>>>>>>>> listen for state changes from the cluster controller. This means >>>>> some >>>>>>>>> additional logic in the CCApplicationEntryPoint where after the >>>> CC >>>>> is >>>>>>>>> ready, it contacts the InstallerDriver using RMI and at that >>>> point >>>>>>> only, >>>>>>>>> the InstallerDriver can return to managix and tells it that the >>>>>> startup >>>>>>>> is >>>>>>>>> complete. >>>>>>>>> >>>>>>>>> Not sure if this is the right way to do it but it definitely is >>>>>> better >>>>>>>> than >>>>>>>>> what we currently have. >>>>>>>>> Abdullah. >>>>>>>>> >>>>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery >>>>>> <[email protected] >>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hopefully the solution won't involve additional important logic >>>>>>> inside >>>>>>>>>> Managix itself? >>>>>>>>>> >>>>>>>>>> Ceej >>>>>>>>>> aka Chris Hillery >>>>>>>>>> >>>>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi < >>>>>>> [email protected] >>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> That works but it doesn't feel right doing it this way. I am >>>>>> going >>>>>>> to >>>>>>>>> fix >>>>>>>>>>> this one for good. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Abdullah. >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <[email protected]> >>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> The way I assured liveness for the YARN installer was to >>>> try >>>>>>>> running >>>>>>>>>> "for >>>>>>>>>>>> $x in dataset Metadata.Dataset return $x" via the API. I >>>> just >>>>>>>> polled >>>>>>>>>> for >>>>>>>>>>> a >>>>>>>>>>>> reasonable amount of time (though honestly, thinking about >>>>> it >>>>>>> now, >>>>>>>>> the >>>>>>>>>>>> correct parameter to use for the polling interval is the >>>>>> startup >>>>>>>> wait >>>>>>>>>>> time >>>>>>>>>>>> in the parameters file :) ). It's not perfect, but it gives >>>>>> less >>>>>>>>> false >>>>>>>>>>>> positives than just checking ps for processes that look >>>> like >>>>>>>> CCs/NCs. >>>>>>>>>>>> >>>>>>>>>>>> - Ian. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi < >>>>>>>>> [email protected] >>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Now that I think about it. Maybe we should provide >>>> multiple >>>>>>> ways >>>>>>>> to >>>>>>>>>> do >>>>>>>>>>>>> this. A polling mechanism to be used for arbitrary time >>>>> and a >>>>>>>>> pushing >>>>>>>>>>>>> mechanism on startup. >>>>>>>>>>>>> I am going to start implementation of this and will >>>>> probably >>>>>>> use >>>>>>>>> RMI >>>>>>>>>>> for >>>>>>>>>>>>> this task both ways (CC to InstallerDriver and >>>>>> InstallerDriver >>>>>>> to >>>>>>>>>> CC). >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Abdullah. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi < >>>>>>>>>> [email protected] >>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> So after further investigation, turned out our startup >>>>>>> process >>>>>>>>> just >>>>>>>>>>>>> starts >>>>>>>>>>>>>> the CC and NC processes and then make sure the >>>> processes >>>>>> are >>>>>>>>>> running >>>>>>>>>>>> and >>>>>>>>>>>>> if >>>>>>>>>>>>>> the processes were found to be running, it returns the >>>>>> state >>>>>>> of >>>>>>>>> the >>>>>>>>>>>>> cluster >>>>>>>>>>>>>> to be active and the subsequent test commands can start >>>>>>>>>> immediately. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This means that the CC could've started but is not yet >>>>>> ready >>>>>>>> when >>>>>>>>>> we >>>>>>>>>>>> try >>>>>>>>>>>>>> to process the next command. To address this, we need a >>>>>>> better >>>>>>>>> way >>>>>>>>>> to >>>>>>>>>>>>> tell >>>>>>>>>>>>>> when the startup procedure has completed. we can do >>>> this >>>>> by >>>>>>>>> pushing >>>>>>>>>>> (CC >>>>>>>>>>>>>> informs installer driver when the startup is complete) >>>> or >>>>>>>> polling >>>>>>>>>>> (The >>>>>>>>>>>>>> installer driver needs to actually query the CC for the >>>>>> state >>>>>>>> of >>>>>>>>>> the >>>>>>>>>>>>>> cluster). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can do either way so let's vote. My vote goes to the >>>>>>> pushing >>>>>>>>>>>> mechanism. >>>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi < >>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> This solution turned out to be incorrect. Actually, >>>> the >>>>>> test >>>>>>>>> cases >>>>>>>>>>>> when >>>>>>>>>>>>> I >>>>>>>>>>>>>>> build after using the join method never fails but >>>>> running >>>>>> an >>>>>>>>>> actual >>>>>>>>>>>>> asterix >>>>>>>>>>>>>>> instance never succeeds which is quite confusing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I also think that the startup script has a major bug >>>>> where >>>>>>> it >>>>>>>>>> might >>>>>>>>>>>>>>> returns before the startup is complete. More on this >>>>>>>> later...... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48 AM, abdullah alamoudi < >>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It is highly unlikely that it is related. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Abdullah. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45 AM, Chen Li < >>>>>> [email protected] >>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @Abdullah: Is this issue related to >>>>>>>>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/ASTERIXDB-1074? >>>>>> Ian >>>>>>>>> and I >>>>>>>>>>>> plan >>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> look into the details on Monday. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 10:08 AM, abdullah alamoudi >>>> < >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> About 3-4 days ago, I was working on the addition >>>> of >>>>>> the >>>>>>>>>>>> filesystem >>>>>>>>>>>>>>>>> based >>>>>>>>>>>>>>>>>> feed adapter and it didn't take anytime to >>>> complete. >>>>>>>>> However, >>>>>>>>>>>> when I >>>>>>>>>>>>>>>>> wanted >>>>>>>>>>>>>>>>>> to build and make sure all tests pass, I kept >>>>> getting >>>>>>>>>>>>>>>>> ConnectionRefused >>>>>>>>>>>>>>>>>> errors which caused the installer tests to fail >>>>> every >>>>>>> now >>>>>>>>> and >>>>>>>>>>>> then. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I knew the new change had nothing to do with this >>>>>>> failure, >>>>>>>>>> yet, >>>>>>>>>>> I >>>>>>>>>>>>>>>>> couldn't >>>>>>>>>>>>>>>>>> direct my attention away from this bug (It just >>>>>> bothered >>>>>>>> me >>>>>>>>> so >>>>>>>>>>>> much >>>>>>>>>>>>>>>>> and I >>>>>>>>>>>>>>>>>> knew it needs to be resolved ASAP). After wasting >>>>>>>> countless >>>>>>>>>>>> hours, I >>>>>>>>>>>>>>>>> was >>>>>>>>>>>>>>>>>> finally able to figure out what was happening :-) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In the startup routine, we start three Jetty web >>>>>> servers >>>>>>>>> (Web >>>>>>>>>>>>>>>>> interface >>>>>>>>>>>>>>>>>> server, JSON API server, and Feed server). >>>> Sometime >>>>>> ago, >>>>>>>> we >>>>>>>>>> used >>>>>>>>>>>> to >>>>>>>>>>>>>>>>> end the >>>>>>>>>>>>>>>>>> startup call before making sure the >>>>> server.isStarted() >>>>>>>>> method >>>>>>>>>>>>> returns >>>>>>>>>>>>>>>>> true >>>>>>>>>>>>>>>>>> on all servers. At that time, I introduced the >>>>>>>>>>>> waitUntilServerStarts >>>>>>>>>>>>>>>>> method >>>>>>>>>>>>>>>>>> to make sure we don't return before the servers >>>> are >>>>>>> ready. >>>>>>>>>>> Turned >>>>>>>>>>>>>>>>> out, that >>>>>>>>>>>>>>>>>> was an incorrect way to handle this (We can blame >>>>>>>>>> stackoverflow >>>>>>>>>>>> for >>>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>> one!) and it is not enough that the server >>>>> isStarted() >>>>>>>>> returns >>>>>>>>>>>> true. >>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>> correct way to do this is to call the >>>> server.join() >>>>>>> method >>>>>>>>>> after >>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> server.start(). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> See: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This was equally satisfying as it was frustrating >>>>> and >>>>>>> you >>>>>>>>> are >>>>>>>>>>>>> welcome >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> the future time I saved each of you :) >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Amoudi, Abdullah. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Amoudi, Abdullah. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Raman >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Amoudi, Abdullah. >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Raman >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Amoudi, Abdullah. >>>> >>> >>> >>> >>> -- >>> Raman >>
