Thank you very much Robert for the detailed explanation. It helps to understand the background.
Regarding HAWQ to capitalize on retry: We can potentially do some tweaks to verify if HAWQ has been initialized or not according to the current behavior, and change the way of doing init so that it can utilize retry. Currently, it goes for retry but it has certain pre-requisites which fails after the first failed installed attempt and retry is also not successul. Will have to investigate on it. Regarding alternatives: Was the option to put a flag in blueprints enabling / disabling RCO considered ? Say, by default use_rco is true, and if someone want's to override the behavior they can override that in blueprint. As quoted by Eric in the above email, in some cases, the retry can also cause increase in the amount of time required due to 1) number of retries before it completes successfully, or it fails completely 2) Before retry there has to be some cleanup steps which may be required for a service (for hawq currently), services must incorporate that logic. Also with RCO, the sequence of startup is predictable and all the dependencies will be met. So probably, making use of rco configurable in blueprints satisfies both the worlds who want to use rco vs not use it. Your thoughts ? Thanks, Bhuvnesh Chaudhary Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io Desk: +1-650-846-1696 | Mobile: +1-973-906-6976 On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <eric...@gmail.com> wrote: > We have a use case where a service depends on Sqoop, Hive Metastore, HBase > Client, Hadoop Client on a worker node. We found that Hadoop Client is > sometimes not yet installed when our service installation has already > started. This looks like a big problem for our use case. Is there a way > to keep RCO by using a flag? Parallel install with retries is Chef and > Puppet approach of configuring distributed loosely coupled service that has > no strong tight relationship between nodes. It doesn't solve the problem > of virtual services where a component depends on availability of other > services. We had been scratching our heads on this since August last > year. It is good to know the problem so we can work out the kinks. > > If component is also monster size that it takes 60 minutes to download and > install. We can bump up retries for Hadoop client to very large number, > but does this mean that while the monster size component is retrying, > Hadoop clients maybe installed in parallel, hence second attempt of the > monster component could succeed? It seems like in this use case, the new > optimization doesn't improve installation time because Ambari needs 120 > minutes to complete second retry of installation frequently. > > regards, > Eric > > On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton < > rnettle...@hortonworks.com> wrote: > > > Hi Bhuvnesh, > > > > You are correct. The Blueprints deployment mechanism in Ambari no longer > > relies on Role-command ordering to install or start components across the > > cluster. > > > > This change to Blueprints was actually implemented in Ambari 2.1.0, so it > > has been around for several releases now. The new approach was > implemented > > to improve the performance times of cluster deployments, and provide > better > > support for dynamic scaling of clusters. > > > > That being said, the new deployment mechanism does indeed remove the > > guarantee of ordering, which can potentially cause some problems for > > certain types of clusters. There were also changes implemented on the > > Ambari Agent side to mitigate this problem or ordering. The ambari-agent > > will now retry INSTALL and START operations if those operations happen to > > fail. The START operation is probably the most relevant in your case, > and > > is also the operation that does show the ordering issues you’ve mentioned > > in some deployments. > > > > The idea is that the ambari-agent retries should help to resolve any > > issues with services starting in an unexpected order. > > > > This ambari-agent feature is on by default, but can be configured in a > > more fine-grained fashion by setting some properties in “cluster-env” in > > your Blueprint or Cluster Creation Template. > > > > Unfortunately, this is not documented very well, but the three properties > > in question are set by default in the BlueprintConfigurationProcessor in > > the following method: > > > > > > > org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration > > > > The properties set in this method allow control over the types of > > operations that are retried, the max number of retries attempted, and the > > maximum amount of time that the agent should attempt a retry. > > > > We’ve seen many clusters using this new approach, and have not run into > > that many problems with respect to ordering. > > > > One possible problem we’ve seen is in a small number of components that > > launch services as a background command. In that case, the ambari-agent > > cannot detect that a retry is required, and so cannot attempt a restart > of > > a failed service. This problem can usually be resolved with > > component-specific retries. > > > > I don’t know much about the HAWQ component, but I would expect that > > customizing the retry settings may help this problem. Do the HAWQ > > components implement retry attempts when booting up? > > > > Hope this helps. > > > > Thanks, > > Bob > > > > > > > > > > On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez < > > afernan...@hortonworks.com> wrote: > > > > > +others who have more insight into BluePrints > > > > > > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bchaudh...@pivotal.io> > wrote: > > > > > >> Hello Sebastian, Alejandro, Andrew, > > >> > > >> Referring to the discussion on RB: https://reviews.apache.org/r/43948 > > >> <https://reviews.apache.org/r/43948/#review120537>, it appears that > > while > > >> deploying clusters using Blueprints, RCO is not honored. Please > confirm > > if > > >> this understanding is correct. > > >> > > >> While running internal test suites for HAWQ, we deploy the clusters > > using > > >> BP, and we need a specific order in which the HAWQ components must be > > >> initialized / started. > > >> > > >> "HAWQ Standby" component should be initialized after "HAWQ Master" > > >> component as it has to copy the contents from HAWQ Master. However, > > since > > >> RCO is not honored, we often come across issues as HAWQ Standby start > / > > >> initialization before HAWQ Master. > > >> > > >> Could you please let us know if there any work already going on for > > >> bringing in RCO dependency for Blueprints, if not is there any other > > >> alternative which can be used to enforce the dependency locally, or > > >> something else which you suggest. > > >> > > >> Thanks, > > >> Bhuvnesh Chaudhary > > >> Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io > > >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976 > > > > > > > >