Re: Blueprints - RCO - Related question.

Bhuvnesh Chaudhary Mon, 14 Mar 2016 17:57:06 -0700

I have created a placeholder JIRA documenting the feature and if we all
agree let's do it.
https://issues.apache.org/jira/browse/AMBARI-15417


Thanks,
Bhuvnesh Chaudhary
Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io
Desk: +1-650-846-1696 | Mobile: +1-973-906-6976

On Mon, Mar 14, 2016 at 11:17 AM, Alejandro Fernandez <
afernan...@hortonworks.com> wrote:

> I agree configuring this with a flag is ideal.
>
> Thanks,
> Alejandro
>
> From: Bhuvnesh Chaudhary <bchaudh...@pivotal.io>
> Date: Monday, March 14, 2016 at 11:06 AM
> To: Ambari <dev@ambari.apache.org>
> Cc: Sumit Mohanty <smoha...@hortonworks.com>, Alejandro Fernandez <
> afernan...@hortonworks.com>
> Subject: Re: Blueprints - RCO - Related question.
>
> Thank you very much Robert for the detailed explanation. It helps
> to understand the background.
>
> Regarding HAWQ to capitalize on retry: We can potentially do some
> tweaks to verify if HAWQ has been initialized or not according to the
> current behavior, and change the way of doing init so that it can utilize
> retry.
> Currently, it goes for retry but it has certain pre-requisites which fails
> after the first
> failed installed attempt and retry is also not successul.
> Will have to investigate on it.
>
> Regarding alternatives:
> Was the option to put a flag in blueprints enabling / disabling RCO
> considered ? Say, by default use_rco is true, and if someone want's
> to override the behavior they can override that in blueprint.
>
> As quoted by Eric in the above email, in some cases, the retry can also
> cause
> increase in the amount of time required due to
> 1) number of retries before it completes successfully, or it fails
> completely
> 2) Before retry there has to be some cleanup steps which may be
> required for a service (for hawq currently), services must incorporate
> that logic.
>
> Also with RCO, the sequence of startup is predictable and all the
> dependencies will be met.
>
> So probably, making use of rco configurable in blueprints satisfies both
> the worlds
> who want to use rco vs not use it.
> Your thoughts ?
>
>
>
>
> Thanks,
> Bhuvnesh Chaudhary
> Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io
> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>
> On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <eric...@gmail.com> wrote:
>
>> We have a use case where a service depends on Sqoop, Hive Metastore, HBase
>> Client, Hadoop Client on a worker node.  We found that Hadoop Client is
>> sometimes not yet installed when our service installation has already
>> started.  This looks like a big problem for our use case.  Is there a way
>> to keep RCO by using a flag?  Parallel install with retries is Chef and
>> Puppet approach of configuring distributed loosely coupled service that
>> has
>> no strong tight relationship between nodes.  It doesn't solve the problem
>> of virtual services where a component depends on availability of other
>> services.  We had been scratching our heads on this since August last
>> year.  It is good to know the problem so we can work out the kinks.
>>
>> If component is also monster size that it takes 60 minutes to download and
>> install.  We can bump up retries for Hadoop client to very large number,
>> but does this mean that while the monster size component is retrying,
>> Hadoop clients maybe installed in parallel, hence second attempt of the
>> monster component could succeed?  It seems like in this use case, the new
>> optimization doesn't improve installation time because Ambari needs 120
>> minutes to complete second retry of installation frequently.
>>
>> regards,
>> Eric
>>
>> On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
>> rnettle...@hortonworks.com> wrote:
>>
>> > Hi Bhuvnesh,
>> >
>> > You are correct.  The Blueprints deployment mechanism in Ambari no
>> longer
>> > relies on Role-command ordering to install or start components across
>> the
>> > cluster.
>> >
>> > This change to Blueprints was actually implemented in Ambari 2.1.0, so
>> it
>> > has been around for several releases now.  The new approach was
>> implemented
>> > to improve the performance times of cluster deployments, and provide
>> better
>> > support for dynamic scaling of clusters.
>> >
>> > That being said, the new deployment mechanism does indeed remove the
>> > guarantee of ordering, which can potentially cause some problems for
>> > certain types of clusters.  There were also changes implemented on the
>> > Ambari Agent side to mitigate this problem or ordering.  The
>> ambari-agent
>> > will now retry INSTALL and START operations if those operations happen
>> to
>> > fail.  The START operation is probably the most relevant in your case,
>> and
>> > is also the operation that does show the ordering issues you’ve
>> mentioned
>> > in some deployments.
>> >
>> > The idea is that the ambari-agent retries should help to resolve any
>> > issues with services starting in an unexpected order.
>> >
>> > This ambari-agent feature is on by default, but can be configured in a
>> > more fine-grained fashion by setting some properties in “cluster-env” in
>> > your Blueprint or Cluster Creation Template.
>> >
>> > Unfortunately, this is not documented very well, but the three
>> properties
>> > in question are set by default in the BlueprintConfigurationProcessor in
>> > the following method:
>> >
>> >
>> >
>> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
>> >
>> > The properties set in this method allow control over the types of
>> > operations that are retried, the max number of retries attempted, and
>> the
>> > maximum amount of time that the agent should attempt a retry.
>> >
>> > We’ve seen many clusters using this new approach, and have not run into
>> > that many problems with respect to ordering.
>> >
>> > One possible problem we’ve seen is in a small number of components that
>> > launch services as a background command.  In that case, the ambari-agent
>> > cannot detect that a retry is required, and so cannot attempt a restart
>> of
>> > a failed service.  This problem can usually be resolved with
>> > component-specific retries.
>> >
>> > I don’t know much about the HAWQ component, but I would expect that
>> > customizing the retry settings may help this problem.  Do the HAWQ
>> > components implement retry attempts when booting up?
>> >
>> > Hope this helps.
>> >
>> > Thanks,
>> > Bob
>> >
>> >
>> >
>> >
>> > On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
>> > afernan...@hortonworks.com> wrote:
>> >
>> > > +others who have more insight into BluePrints
>> > >
>> > > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bchaudh...@pivotal.io>
>> wrote:
>> > >
>> > >> Hello Sebastian, Alejandro, Andrew,
>> > >>
>> > >> Referring to the discussion on RB:
>> https://reviews.apache.org/r/43948
>> > >> <https://reviews.apache.org/r/43948/#review120537>, it appears that
>> > while
>> > >> deploying clusters using Blueprints, RCO is not honored. Please
>> confirm
>> > if
>> > >> this understanding is correct.
>> > >>
>> > >> While running internal test suites for HAWQ, we deploy the clusters
>> > using
>> > >> BP, and we need a specific order in which the HAWQ components must be
>> > >> initialized / started.
>> > >>
>> > >> "HAWQ Standby" component should be initialized after "HAWQ Master"
>> > >> component as it has to copy the contents from HAWQ Master. However,
>> > since
>> > >> RCO is not honored, we often come across issues as HAWQ Standby
>> start /
>> > >> initialization before HAWQ Master.
>> > >>
>> > >> Could you please let us know if there any work already going on for
>> > >> bringing in RCO dependency for Blueprints, if not is there any other
>> > >> alternative which can be used to enforce the dependency locally, or
>> > >> something else which you suggest.
>> > >>
>> > >> Thanks,
>> > >> Bhuvnesh Chaudhary
>> > >> Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io
>> > >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>> > >
>> >
>> >
>>
>
>

Re: Blueprints - RCO - Related question.

Reply via email to