Re: The solution to the sporadic connection refused exceptions

Till Westmann Fri, 28 Aug 2015 23:01:42 -0700

Continuing this discussion on https://asterix-gerrit.ics.uci.edu/#/c/365 
<https://asterix-gerrit.ics.uci.edu/#/c/365> (which gets mirrored on this list 
anyway).


Cheers,
Till

> On Aug 27, 2015, at 11:52 PM, Ian Maxon <[email protected]> wrote:
> 
>> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
> 
> To put some background into this, I only chose to eschew use of ZK
> because it isn't a requirement in a YARN 2.2.0 cluster, and I could do
> what I needed via HDFS and some polling on the CC. I'm not opposed to
> integrating it further though (and making the YARN client take use of
> that).
> 
> - Ian
> 
> On Thu, Aug 27, 2015 at 7:58 PM, Till Westmann <[email protected]> wrote:
>> I’m not really deep into this topic, but I’d like to understand a little 
>> better.
>> 
>> As I understand it, we currently have 2 ways to deploy/manage AsterixDB: a) 
>> using Managix and b) using YARN.
>> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
>> Also, neither the Asterix CC or NC depend on the existence of Zookeeper.
>> 
>> Is this correct so far?
>> 
>> Now we are trying to find a way to ensure that an AsterixDB client can 
>> reliably know if the cluster is up or down.
>> 
>> My first assumption for the properties that the solution to this problem 
>> would have is:
>> 1) The knowledge if the cluster is up or down is available in the CC (as it 
>> controls the cluster).
>> 2) The mechanism used to expose that information works for both ways to 
>> deploy/manage a cluster.
>> 
>> As simple way to do that seems to be to send a request “waitUntilStarted” to 
>> the CC that returns to the client once the CC has determined that everything 
>> has started. The response to that request would either be “yes" (cluster is 
>> up), “no” (an error occurred and it won’t be up without intervention), or 
>> “not sure” (timeout - please ask again later). This would imply that the 
>> client is polling, but it wouldn’t be very busy if the timeout is reasonable.
>> 
>> Now this doesn’t seem to be where the discussion is going and I’d like to 
>> find out where is is going and why.
>> 
>> Could you help me?
>> 
>> Thanks,
>> Till
>> 
>> 
>>> On Aug 25, 2015, at 7:23 AM, Raman Grover <[email protected]> wrote:
>>> 
>>> As I mentioned before...
>>> "The information for an AsterixDB instance is "lazily" refreshed when a
>>> management operation is invoked (using managix set of commands) or an
>>> explicit describe command is invoked. "
>>> 
>>> Above, the commands are the Managix set of commands (create, start,
>>> describe etc.) that trigger a refresh and so its "lazy". Currently CC does
>>> not notify Managix. what we are discussing are the elegant way to have CC
>>> relay information to Managix.
>>> 
>>> On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <[email protected]>
>>> wrote:
>>> 
>>>> I don't think that is there yet but the intention is to have it at some
>>>> point in the future.
>>>> 
>>>> Cheers,
>>>> Abdullah.
>>>> 
>>>> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <[email protected]>
>>>> wrote:
>>>> 
>>>>> Very interesting, thank you. Can you point out a couple places in the
>>>> code
>>>>> where some of this logic is kept? Specifically where "CC can update this
>>>>> information and notify Managix" sounds interesting...
>>>>> 
>>>>> Ceej
>>>>> aka Chris Hillery
>>>>> 
>>>>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>>> , and what code is
>>>>>>> responsible for keeping it up-to-date?
>>>>>>> 
>>>>>> Apparently, no one is :-)
>>>>>> 
>>>>>> The information for an AsterixDB instance is "lazily" refreshed when a
>>>>>> management operation is invoked (using managix set of commands) or an
>>>>>> explicit describe command is invoked.
>>>>>> Between the time t1 (when state of an AsterixDB instance changes, say
>>>> due
>>>>>> to NC failure) and t2 (when  a management operation is invoked), the
>>>>>> information about the AsterixDB instance inside Zookeeper remains
>>>> stale.
>>>>> CC
>>>>>> can update this information and notify Managix; this way Managix
>>>> realizes
>>>>>> the changed state as soon as it has occurred. This can be particularly
>>>>>> useful when showing on a management console the up-to-date state of an
>>>>>> instance in real time or having Managix respond to an event.
>>>>>> 
>>>>>> Regards,
>>>>>> Raman
>>>>>> 
>>>>>> ---------- Forwarded message ----------
>>>>>> From: abdullah alamoudi <[email protected]>
>>>>>> Date: Tue, Aug 25, 2015 at 12:27 AM
>>>>>> Subject: Re: The solution to the sporadic connection refused exceptions
>>>>>> To: [email protected]
>>>>>> 
>>>>>> 
>>>>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Perhaps an aside, but: exactly what is kept in Zookeeper
>>>>>> 
>>>>>> 
>>>>>> A serialized instance of
>>>> edu.uci.ics.asterix.event.model.AsterixInstance
>>>>>> 
>>>>>> 
>>>>>>> , and what code is
>>>>>>> responsible for keeping it up-to-date?
>>>>>>> 
>>>>>> Apparently, no one is :-)
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Ceej
>>>>>>> 
>>>>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover <
>>>> [email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Well, the state of an instance (and metadata including
>>>> configuration)
>>>>>> is
>>>>>>>> kept in Zookeeper instance that is accessible to Managix and CC. CC
>>>>>>> should
>>>>>>>> be able to set the state of the cluster in Zookeeper under the
>>>> right
>>>>>>> znode
>>>>>>>> which can viewed by Managix.
>>>>>>>> 
>>>>>>>> There exists a communication channel for CC and Managix to share
>>>>>>>> information on state etc. I am not sure if we need another channel
>>>>> such
>>>>>>> as
>>>>>>>> RMI between Managix and CC.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Raman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Well, it depends on your definition of the boundaries of managix.
>>>>>> What
>>>>>>> I
>>>>>>>>> did is that I added an RMI object in the InstallerDriver which
>>>>>>> basically
>>>>>>>>> listen for state changes from the cluster controller. This means
>>>>> some
>>>>>>>>> additional logic in the CCApplicationEntryPoint where after the
>>>> CC
>>>>> is
>>>>>>>>> ready, it contacts the InstallerDriver using RMI and at that
>>>> point
>>>>>>> only,
>>>>>>>>> the InstallerDriver can return to managix and tells it that the
>>>>>> startup
>>>>>>>> is
>>>>>>>>> complete.
>>>>>>>>> 
>>>>>>>>> Not sure if this is the right way to do it but it definitely is
>>>>>> better
>>>>>>>> than
>>>>>>>>> what we currently have.
>>>>>>>>> Abdullah.
>>>>>>>>> 
>>>>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery
>>>>>> <[email protected]
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hopefully the solution won't involve additional important logic
>>>>>>> inside
>>>>>>>>>> Managix itself?
>>>>>>>>>> 
>>>>>>>>>> Ceej
>>>>>>>>>> aka Chris Hillery
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi <
>>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> That works but it doesn't feel right doing it this way. I am
>>>>>> going
>>>>>>> to
>>>>>>>>> fix
>>>>>>>>>>> this one for good.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Abdullah.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <[email protected]>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> The way I assured liveness for the YARN installer was to
>>>> try
>>>>>>>> running
>>>>>>>>>> "for
>>>>>>>>>>>> $x in dataset Metadata.Dataset return $x" via the API. I
>>>> just
>>>>>>>> polled
>>>>>>>>>> for
>>>>>>>>>>> a
>>>>>>>>>>>> reasonable amount of time  (though honestly, thinking about
>>>>> it
>>>>>>> now,
>>>>>>>>> the
>>>>>>>>>>>> correct parameter to use for the polling interval is the
>>>>>> startup
>>>>>>>> wait
>>>>>>>>>>> time
>>>>>>>>>>>> in the parameters file :) ). It's not perfect, but it gives
>>>>>> less
>>>>>>>>> false
>>>>>>>>>>>> positives than just checking ps for processes that look
>>>> like
>>>>>>>> CCs/NCs.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Ian.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi <
>>>>>>>>> [email protected]
>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Now that I think about it. Maybe we should provide
>>>> multiple
>>>>>>> ways
>>>>>>>> to
>>>>>>>>>> do
>>>>>>>>>>>>> this. A polling mechanism to be used for arbitrary time
>>>>> and a
>>>>>>>>> pushing
>>>>>>>>>>>>> mechanism on startup.
>>>>>>>>>>>>> I am going to start implementation of this and will
>>>>> probably
>>>>>>> use
>>>>>>>>> RMI
>>>>>>>>>>> for
>>>>>>>>>>>>> this task both ways (CC to InstallerDriver and
>>>>>> InstallerDriver
>>>>>>> to
>>>>>>>>>> CC).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi <
>>>>>>>>>> [email protected]
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So after further investigation, turned out our startup
>>>>>>> process
>>>>>>>>> just
>>>>>>>>>>>>> starts
>>>>>>>>>>>>>> the CC and NC processes and then make sure the
>>>> processes
>>>>>> are
>>>>>>>>>> running
>>>>>>>>>>>> and
>>>>>>>>>>>>> if
>>>>>>>>>>>>>> the processes were found to be running, it returns the
>>>>>> state
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> to be active and the subsequent test commands can start
>>>>>>>>>> immediately.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This means that the CC could've started but is not yet
>>>>>> ready
>>>>>>>> when
>>>>>>>>>> we
>>>>>>>>>>>> try
>>>>>>>>>>>>>> to process the next command. To address this, we need a
>>>>>>> better
>>>>>>>>> way
>>>>>>>>>> to
>>>>>>>>>>>>> tell
>>>>>>>>>>>>>> when the startup procedure has completed. we can do
>>>> this
>>>>> by
>>>>>>>>> pushing
>>>>>>>>>>> (CC
>>>>>>>>>>>>>> informs installer driver when the startup is complete)
>>>> or
>>>>>>>> polling
>>>>>>>>>>> (The
>>>>>>>>>>>>>> installer driver needs to actually query the CC for the
>>>>>> state
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>> cluster).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can do either way so let's vote. My vote goes to the
>>>>>>> pushing
>>>>>>>>>>>> mechanism.
>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This solution turned out to be incorrect. Actually,
>>>> the
>>>>>> test
>>>>>>>>> cases
>>>>>>>>>>>> when
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> build after using the join method never fails but
>>>>> running
>>>>>> an
>>>>>>>>>> actual
>>>>>>>>>>>>> asterix
>>>>>>>>>>>>>>> instance never succeeds which is quite confusing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I also think that the startup script has a major bug
>>>>> where
>>>>>>> it
>>>>>>>>>> might
>>>>>>>>>>>>>>> returns before the startup is complete. More on this
>>>>>>>> later......
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48 AM, abdullah alamoudi <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It is highly unlikely that it is related.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45 AM, Chen Li <
>>>>>> [email protected]
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> @Abdullah: Is this issue related to
>>>>>>>>>>>>>>>>> 
>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1074?
>>>>>> Ian
>>>>>>>>> and I
>>>>>>>>>>>> plan
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> look into the details on Monday.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 10:08 AM, abdullah alamoudi
>>>> <
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> About 3-4 days ago, I was working on the addition
>>>> of
>>>>>> the
>>>>>>>>>>>> filesystem
>>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>> feed adapter and it didn't take anytime to
>>>> complete.
>>>>>>>>> However,
>>>>>>>>>>>> when I
>>>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>>>> to build and make sure all tests pass, I kept
>>>>> getting
>>>>>>>>>>>>>>>>> ConnectionRefused
>>>>>>>>>>>>>>>>>> errors which caused the installer tests to fail
>>>>> every
>>>>>>> now
>>>>>>>>> and
>>>>>>>>>>>> then.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I knew the new change had nothing to do with this
>>>>>>> failure,
>>>>>>>>>> yet,
>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> couldn't
>>>>>>>>>>>>>>>>>> direct my attention away from this bug (It just
>>>>>> bothered
>>>>>>>> me
>>>>>>>>> so
>>>>>>>>>>>> much
>>>>>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>>> knew it needs to be resolved ASAP). After wasting
>>>>>>>> countless
>>>>>>>>>>>> hours, I
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>> finally able to figure out what was happening :-)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In the startup routine, we start three Jetty web
>>>>>> servers
>>>>>>>>> (Web
>>>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>>> server, JSON API server, and Feed server).
>>>> Sometime
>>>>>> ago,
>>>>>>>> we
>>>>>>>>>> used
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> end the
>>>>>>>>>>>>>>>>>> startup call before making sure the
>>>>> server.isStarted()
>>>>>>>>> method
>>>>>>>>>>>>> returns
>>>>>>>>>>>>>>>>> true
>>>>>>>>>>>>>>>>>> on all servers. At that time, I introduced the
>>>>>>>>>>>> waitUntilServerStarts
>>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>>> to make sure we don't return before the servers
>>>> are
>>>>>>> ready.
>>>>>>>>>>> Turned
>>>>>>>>>>>>>>>>> out, that
>>>>>>>>>>>>>>>>>> was an incorrect way to handle this (We can blame
>>>>>>>>>> stackoverflow
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> one!) and it is not enough that the server
>>>>> isStarted()
>>>>>>>>> returns
>>>>>>>>>>>> true.
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>> correct way to do this is to call the
>>>> server.join()
>>>>>>> method
>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> server.start().
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> See:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This was equally satisfying as it was frustrating
>>>>> and
>>>>>>> you
>>>>>>>>> are
>>>>>>>>>>>>> welcome
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the future time I saved each of you :)
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Raman
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Amoudi, Abdullah.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Raman
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Amoudi, Abdullah.
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Raman
>>

Re: The solution to the sporadic connection refused exceptions

Reply via email to