We have a use case where a service depends on Sqoop, Hive Metastore, HBase
Client, Hadoop Client on a worker node.  We found that Hadoop Client is
sometimes not yet installed when our service installation has already
started.  This looks like a big problem for our use case.  Is there a way
to keep RCO by using a flag?  Parallel install with retries is Chef and
Puppet approach of configuring distributed loosely coupled service that has
no strong tight relationship between nodes.  It doesn't solve the problem
of virtual services where a component depends on availability of other
services.  We had been scratching our heads on this since August last
year.  It is good to know the problem so we can work out the kinks.

If component is also monster size that it takes 60 minutes to download and
install.  We can bump up retries for Hadoop client to very large number,
but does this mean that while the monster size component is retrying,
Hadoop clients maybe installed in parallel, hence second attempt of the
monster component could succeed?  It seems like in this use case, the new
optimization doesn't improve installation time because Ambari needs 120
minutes to complete second retry of installation frequently.

regards,
Eric

On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
rnettle...@hortonworks.com> wrote:

> Hi Bhuvnesh,
>
> You are correct.  The Blueprints deployment mechanism in Ambari no longer
> relies on Role-command ordering to install or start components across the
> cluster.
>
> This change to Blueprints was actually implemented in Ambari 2.1.0, so it
> has been around for several releases now.  The new approach was implemented
> to improve the performance times of cluster deployments, and provide better
> support for dynamic scaling of clusters.
>
> That being said, the new deployment mechanism does indeed remove the
> guarantee of ordering, which can potentially cause some problems for
> certain types of clusters.  There were also changes implemented on the
> Ambari Agent side to mitigate this problem or ordering.  The ambari-agent
> will now retry INSTALL and START operations if those operations happen to
> fail.  The START operation is probably the most relevant in your case, and
> is also the operation that does show the ordering issues you’ve mentioned
> in some deployments.
>
> The idea is that the ambari-agent retries should help to resolve any
> issues with services starting in an unexpected order.
>
> This ambari-agent feature is on by default, but can be configured in a
> more fine-grained fashion by setting some properties in “cluster-env” in
> your Blueprint or Cluster Creation Template.
>
> Unfortunately, this is not documented very well, but the three properties
> in question are set by default in the BlueprintConfigurationProcessor in
> the following method:
>
>
> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
>
> The properties set in this method allow control over the types of
> operations that are retried, the max number of retries attempted, and the
> maximum amount of time that the agent should attempt a retry.
>
> We’ve seen many clusters using this new approach, and have not run into
> that many problems with respect to ordering.
>
> One possible problem we’ve seen is in a small number of components that
> launch services as a background command.  In that case, the ambari-agent
> cannot detect that a retry is required, and so cannot attempt a restart of
> a failed service.  This problem can usually be resolved with
> component-specific retries.
>
> I don’t know much about the HAWQ component, but I would expect that
> customizing the retry settings may help this problem.  Do the HAWQ
> components implement retry attempts when booting up?
>
> Hope this helps.
>
> Thanks,
> Bob
>
>
>
>
> On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
> afernan...@hortonworks.com> wrote:
>
> > +others who have more insight into BluePrints
> >
> > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bchaudh...@pivotal.io> wrote:
> >
> >> Hello Sebastian, Alejandro, Andrew,
> >>
> >> Referring to the discussion on RB: https://reviews.apache.org/r/43948
> >> <https://reviews.apache.org/r/43948/#review120537>, it appears that
> while
> >> deploying clusters using Blueprints, RCO is not honored. Please confirm
> if
> >> this understanding is correct.
> >>
> >> While running internal test suites for HAWQ, we deploy the clusters
> using
> >> BP, and we need a specific order in which the HAWQ components must be
> >> initialized / started.
> >>
> >> "HAWQ Standby" component should be initialized after "HAWQ Master"
> >> component as it has to copy the contents from HAWQ Master. However,
> since
> >> RCO is not honored, we often come across issues as HAWQ Standby start /
> >> initialization before HAWQ Master.
> >>
> >> Could you please let us know if there any work already going on for
> >> bringing in RCO dependency for Blueprints, if not is there any other
> >> alternative which can be used to enforce the dependency locally, or
> >> something else which you suggest.
> >>
> >> Thanks,
> >> Bhuvnesh Chaudhary
> >> Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io
> >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
> >
>
>

Reply via email to