Re: The solution to the sporadic connection refused exceptions

Ian Maxon Thu, 27 Aug 2015 23:53:32 -0700

> And Managix uses Zookeeper to mange its information, but YARN doesn’t.


To put some background into this, I only chose to eschew use of ZK
because it isn't a requirement in a YARN 2.2.0 cluster, and I could do
what I needed via HDFS and some polling on the CC. I'm not opposed to
integrating it further though (and making the YARN client take use of
that).

- Ian

On Thu, Aug 27, 2015 at 7:58 PM, Till Westmann <[email protected]> wrote:
> I’m not really deep into this topic, but I’d like to understand a little 
> better.
>
> As I understand it, we currently have 2 ways to deploy/manage AsterixDB: a) 
> using Managix and b) using YARN.
> And Managix uses Zookeeper to mange its information, but YARN doesn’t.
> Also, neither the Asterix CC or NC depend on the existence of Zookeeper.
>
> Is this correct so far?
>
> Now we are trying to find a way to ensure that an AsterixDB client can 
> reliably know if the cluster is up or down.
>
> My first assumption for the properties that the solution to this problem 
> would have is:
> 1) The knowledge if the cluster is up or down is available in the CC (as it 
> controls the cluster).
> 2) The mechanism used to expose that information works for both ways to 
> deploy/manage a cluster.
>
> As simple way to do that seems to be to send a request “waitUntilStarted” to 
> the CC that returns to the client once the CC has determined that everything 
> has started. The response to that request would either be “yes" (cluster is 
> up), “no” (an error occurred and it won’t be up without intervention), or 
> “not sure” (timeout - please ask again later). This would imply that the 
> client is polling, but it wouldn’t be very busy if the timeout is reasonable.
>
> Now this doesn’t seem to be where the discussion is going and I’d like to 
> find out where is is going and why.
>
> Could you help me?
>
> Thanks,
> Till
>
>
>> On Aug 25, 2015, at 7:23 AM, Raman Grover <[email protected]> wrote:
>>
>> As I mentioned before...
>> "The information for an AsterixDB instance is "lazily" refreshed when a
>> management operation is invoked (using managix set of commands) or an
>> explicit describe command is invoked. "
>>
>> Above, the commands are the Managix set of commands (create, start,
>> describe etc.) that trigger a refresh and so its "lazy". Currently CC does
>> not notify Managix. what we are discussing are the elegant way to have CC
>> relay information to Managix.
>>
>> On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <[email protected]>
>> wrote:
>>
>>> I don't think that is there yet but the intention is to have it at some
>>> point in the future.
>>>
>>> Cheers,
>>> Abdullah.
>>>
>>> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <[email protected]>
>>> wrote:
>>>
>>>> Very interesting, thank you. Can you point out a couple places in the
>>> code
>>>> where some of this logic is kept? Specifically where "CC can update this
>>>> information and notify Managix" sounds interesting...
>>>>
>>>> Ceej
>>>> aka Chris Hillery
>>>>
>>>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover <[email protected]>
>>>> wrote:
>>>>
>>>>>> , and what code is
>>>>>> responsible for keeping it up-to-date?
>>>>>>
>>>>> Apparently, no one is :-)
>>>>>
>>>>> The information for an AsterixDB instance is "lazily" refreshed when a
>>>>> management operation is invoked (using managix set of commands) or an
>>>>> explicit describe command is invoked.
>>>>> Between the time t1 (when state of an AsterixDB instance changes, say
>>> due
>>>>> to NC failure) and t2 (when  a management operation is invoked), the
>>>>> information about the AsterixDB instance inside Zookeeper remains
>>> stale.
>>>> CC
>>>>> can update this information and notify Managix; this way Managix
>>> realizes
>>>>> the changed state as soon as it has occurred. This can be particularly
>>>>> useful when showing on a management console the up-to-date state of an
>>>>> instance in real time or having Managix respond to an event.
>>>>>
>>>>> Regards,
>>>>> Raman
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: abdullah alamoudi <[email protected]>
>>>>> Date: Tue, Aug 25, 2015 at 12:27 AM
>>>>> Subject: Re: The solution to the sporadic connection refused exceptions
>>>>> To: [email protected]
>>>>>
>>>>>
>>>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Perhaps an aside, but: exactly what is kept in Zookeeper
>>>>>
>>>>>
>>>>> A serialized instance of
>>> edu.uci.ics.asterix.event.model.AsterixInstance
>>>>>
>>>>>
>>>>>> , and what code is
>>>>>> responsible for keeping it up-to-date?
>>>>>>
>>>>> Apparently, no one is :-)
>>>>>
>>>>>
>>>>>>
>>>>>> Ceej
>>>>>>
>>>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover <
>>> [email protected]
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Well, the state of an instance (and metadata including
>>> configuration)
>>>>> is
>>>>>>> kept in Zookeeper instance that is accessible to Managix and CC. CC
>>>>>> should
>>>>>>> be able to set the state of the cluster in Zookeeper under the
>>> right
>>>>>> znode
>>>>>>> which can viewed by Managix.
>>>>>>>
>>>>>>> There exists a communication channel for CC and Managix to share
>>>>>>> information on state etc. I am not sure if we need another channel
>>>> such
>>>>>> as
>>>>>>> RMI between Managix and CC.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Raman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi <
>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Well, it depends on your definition of the boundaries of managix.
>>>>> What
>>>>>> I
>>>>>>>> did is that I added an RMI object in the InstallerDriver which
>>>>>> basically
>>>>>>>> listen for state changes from the cluster controller. This means
>>>> some
>>>>>>>> additional logic in the CCApplicationEntryPoint where after the
>>> CC
>>>> is
>>>>>>>> ready, it contacts the InstallerDriver using RMI and at that
>>> point
>>>>>> only,
>>>>>>>> the InstallerDriver can return to managix and tells it that the
>>>>> startup
>>>>>>> is
>>>>>>>> complete.
>>>>>>>>
>>>>>>>> Not sure if this is the right way to do it but it definitely is
>>>>> better
>>>>>>> than
>>>>>>>> what we currently have.
>>>>>>>> Abdullah.
>>>>>>>>
>>>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery
>>>>> <[email protected]
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hopefully the solution won't involve additional important logic
>>>>>> inside
>>>>>>>>> Managix itself?
>>>>>>>>>
>>>>>>>>> Ceej
>>>>>>>>> aka Chris Hillery
>>>>>>>>>
>>>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi <
>>>>>> [email protected]
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> That works but it doesn't feel right doing it this way. I am
>>>>> going
>>>>>> to
>>>>>>>> fix
>>>>>>>>>> this one for good.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Abdullah.
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <[email protected]>
>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The way I assured liveness for the YARN installer was to
>>> try
>>>>>>> running
>>>>>>>>> "for
>>>>>>>>>>> $x in dataset Metadata.Dataset return $x" via the API. I
>>> just
>>>>>>> polled
>>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>>> reasonable amount of time  (though honestly, thinking about
>>>> it
>>>>>> now,
>>>>>>>> the
>>>>>>>>>>> correct parameter to use for the polling interval is the
>>>>> startup
>>>>>>> wait
>>>>>>>>>> time
>>>>>>>>>>> in the parameters file :) ). It's not perfect, but it gives
>>>>> less
>>>>>>>> false
>>>>>>>>>>> positives than just checking ps for processes that look
>>> like
>>>>>>> CCs/NCs.
>>>>>>>>>>>
>>>>>>>>>>> - Ian.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi <
>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Now that I think about it. Maybe we should provide
>>> multiple
>>>>>> ways
>>>>>>> to
>>>>>>>>> do
>>>>>>>>>>>> this. A polling mechanism to be used for arbitrary time
>>>> and a
>>>>>>>> pushing
>>>>>>>>>>>> mechanism on startup.
>>>>>>>>>>>> I am going to start implementation of this and will
>>>> probably
>>>>>> use
>>>>>>>> RMI
>>>>>>>>>> for
>>>>>>>>>>>> this task both ways (CC to InstallerDriver and
>>>>> InstallerDriver
>>>>>> to
>>>>>>>>> CC).
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi <
>>>>>>>>> [email protected]
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So after further investigation, turned out our startup
>>>>>> process
>>>>>>>> just
>>>>>>>>>>>> starts
>>>>>>>>>>>>> the CC and NC processes and then make sure the
>>> processes
>>>>> are
>>>>>>>>> running
>>>>>>>>>>> and
>>>>>>>>>>>> if
>>>>>>>>>>>>> the processes were found to be running, it returns the
>>>>> state
>>>>>> of
>>>>>>>> the
>>>>>>>>>>>> cluster
>>>>>>>>>>>>> to be active and the subsequent test commands can start
>>>>>>>>> immediately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This means that the CC could've started but is not yet
>>>>> ready
>>>>>>> when
>>>>>>>>> we
>>>>>>>>>>> try
>>>>>>>>>>>>> to process the next command. To address this, we need a
>>>>>> better
>>>>>>>> way
>>>>>>>>> to
>>>>>>>>>>>> tell
>>>>>>>>>>>>> when the startup procedure has completed. we can do
>>> this
>>>> by
>>>>>>>> pushing
>>>>>>>>>> (CC
>>>>>>>>>>>>> informs installer driver when the startup is complete)
>>> or
>>>>>>> polling
>>>>>>>>>> (The
>>>>>>>>>>>>> installer driver needs to actually query the CC for the
>>>>> state
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>>>> cluster).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can do either way so let's vote. My vote goes to the
>>>>>> pushing
>>>>>>>>>>> mechanism.
>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi <
>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This solution turned out to be incorrect. Actually,
>>> the
>>>>> test
>>>>>>>> cases
>>>>>>>>>>> when
>>>>>>>>>>>> I
>>>>>>>>>>>>>> build after using the join method never fails but
>>>> running
>>>>> an
>>>>>>>>> actual
>>>>>>>>>>>> asterix
>>>>>>>>>>>>>> instance never succeeds which is quite confusing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also think that the startup script has a major bug
>>>> where
>>>>>> it
>>>>>>>>> might
>>>>>>>>>>>>>> returns before the startup is complete. More on this
>>>>>>> later......
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48 AM, abdullah alamoudi <
>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is highly unlikely that it is related.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45 AM, Chen Li <
>>>>> [email protected]
>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Abdullah: Is this issue related to
>>>>>>>>>>>>>>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1074?
>>>>> Ian
>>>>>>>> and I
>>>>>>>>>>> plan
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> look into the details on Monday.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 10:08 AM, abdullah alamoudi
>>> <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> About 3-4 days ago, I was working on the addition
>>> of
>>>>> the
>>>>>>>>>>> filesystem
>>>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>> feed adapter and it didn't take anytime to
>>> complete.
>>>>>>>> However,
>>>>>>>>>>> when I
>>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>>> to build and make sure all tests pass, I kept
>>>> getting
>>>>>>>>>>>>>>>> ConnectionRefused
>>>>>>>>>>>>>>>>> errors which caused the installer tests to fail
>>>> every
>>>>>> now
>>>>>>>> and
>>>>>>>>>>> then.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I knew the new change had nothing to do with this
>>>>>> failure,
>>>>>>>>> yet,
>>>>>>>>>> I
>>>>>>>>>>>>>>>> couldn't
>>>>>>>>>>>>>>>>> direct my attention away from this bug (It just
>>>>> bothered
>>>>>>> me
>>>>>>>> so
>>>>>>>>>>> much
>>>>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>> knew it needs to be resolved ASAP). After wasting
>>>>>>> countless
>>>>>>>>>>> hours, I
>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>> finally able to figure out what was happening :-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the startup routine, we start three Jetty web
>>>>> servers
>>>>>>>> (Web
>>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>> server, JSON API server, and Feed server).
>>> Sometime
>>>>> ago,
>>>>>>> we
>>>>>>>>> used
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> end the
>>>>>>>>>>>>>>>>> startup call before making sure the
>>>> server.isStarted()
>>>>>>>> method
>>>>>>>>>>>> returns
>>>>>>>>>>>>>>>> true
>>>>>>>>>>>>>>>>> on all servers. At that time, I introduced the
>>>>>>>>>>> waitUntilServerStarts
>>>>>>>>>>>>>>>> method
>>>>>>>>>>>>>>>>> to make sure we don't return before the servers
>>> are
>>>>>> ready.
>>>>>>>>>> Turned
>>>>>>>>>>>>>>>> out, that
>>>>>>>>>>>>>>>>> was an incorrect way to handle this (We can blame
>>>>>>>>> stackoverflow
>>>>>>>>>>> for
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> one!) and it is not enough that the server
>>>> isStarted()
>>>>>>>> returns
>>>>>>>>>>> true.
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> correct way to do this is to call the
>>> server.join()
>>>>>> method
>>>>>>>>> after
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> server.start().
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> See:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This was equally satisfying as it was frustrating
>>>> and
>>>>>> you
>>>>>>>> are
>>>>>>>>>>>> welcome
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the future time I saved each of you :)
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Amoudi, Abdullah.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Amoudi, Abdullah.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Raman
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Amoudi, Abdullah.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Raman
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Amoudi, Abdullah.
>>>
>>
>>
>>
>> --
>> Raman
>

Re: The solution to the sporadic connection refused exceptions

Reply via email to