I have created a placeholder JIRA documenting the feature and if we all agree let's do it. https://issues.apache.org/jira/browse/AMBARI-15417
Thanks, Bhuvnesh Chaudhary Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io Desk: +1-650-846-1696 | Mobile: +1-973-906-6976 On Mon, Mar 14, 2016 at 11:17 AM, Alejandro Fernandez < afernan...@hortonworks.com> wrote: > I agree configuring this with a flag is ideal. > > Thanks, > Alejandro > > From: Bhuvnesh Chaudhary <bchaudh...@pivotal.io> > Date: Monday, March 14, 2016 at 11:06 AM > To: Ambari <dev@ambari.apache.org> > Cc: Sumit Mohanty <smoha...@hortonworks.com>, Alejandro Fernandez < > afernan...@hortonworks.com> > Subject: Re: Blueprints - RCO - Related question. > > Thank you very much Robert for the detailed explanation. It helps > to understand the background. > > Regarding HAWQ to capitalize on retry: We can potentially do some > tweaks to verify if HAWQ has been initialized or not according to the > current behavior, and change the way of doing init so that it can utilize > retry. > Currently, it goes for retry but it has certain pre-requisites which fails > after the first > failed installed attempt and retry is also not successul. > Will have to investigate on it. > > Regarding alternatives: > Was the option to put a flag in blueprints enabling / disabling RCO > considered ? Say, by default use_rco is true, and if someone want's > to override the behavior they can override that in blueprint. > > As quoted by Eric in the above email, in some cases, the retry can also > cause > increase in the amount of time required due to > 1) number of retries before it completes successfully, or it fails > completely > 2) Before retry there has to be some cleanup steps which may be > required for a service (for hawq currently), services must incorporate > that logic. > > Also with RCO, the sequence of startup is predictable and all the > dependencies will be met. > > So probably, making use of rco configurable in blueprints satisfies both > the worlds > who want to use rco vs not use it. > Your thoughts ? > > > > > Thanks, > Bhuvnesh Chaudhary > Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io > Desk: +1-650-846-1696 | Mobile: +1-973-906-6976 > > On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <eric...@gmail.com> wrote: > >> We have a use case where a service depends on Sqoop, Hive Metastore, HBase >> Client, Hadoop Client on a worker node. We found that Hadoop Client is >> sometimes not yet installed when our service installation has already >> started. This looks like a big problem for our use case. Is there a way >> to keep RCO by using a flag? Parallel install with retries is Chef and >> Puppet approach of configuring distributed loosely coupled service that >> has >> no strong tight relationship between nodes. It doesn't solve the problem >> of virtual services where a component depends on availability of other >> services. We had been scratching our heads on this since August last >> year. It is good to know the problem so we can work out the kinks. >> >> If component is also monster size that it takes 60 minutes to download and >> install. We can bump up retries for Hadoop client to very large number, >> but does this mean that while the monster size component is retrying, >> Hadoop clients maybe installed in parallel, hence second attempt of the >> monster component could succeed? It seems like in this use case, the new >> optimization doesn't improve installation time because Ambari needs 120 >> minutes to complete second retry of installation frequently. >> >> regards, >> Eric >> >> On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton < >> rnettle...@hortonworks.com> wrote: >> >> > Hi Bhuvnesh, >> > >> > You are correct. The Blueprints deployment mechanism in Ambari no >> longer >> > relies on Role-command ordering to install or start components across >> the >> > cluster. >> > >> > This change to Blueprints was actually implemented in Ambari 2.1.0, so >> it >> > has been around for several releases now. The new approach was >> implemented >> > to improve the performance times of cluster deployments, and provide >> better >> > support for dynamic scaling of clusters. >> > >> > That being said, the new deployment mechanism does indeed remove the >> > guarantee of ordering, which can potentially cause some problems for >> > certain types of clusters. There were also changes implemented on the >> > Ambari Agent side to mitigate this problem or ordering. The >> ambari-agent >> > will now retry INSTALL and START operations if those operations happen >> to >> > fail. The START operation is probably the most relevant in your case, >> and >> > is also the operation that does show the ordering issues you’ve >> mentioned >> > in some deployments. >> > >> > The idea is that the ambari-agent retries should help to resolve any >> > issues with services starting in an unexpected order. >> > >> > This ambari-agent feature is on by default, but can be configured in a >> > more fine-grained fashion by setting some properties in “cluster-env” in >> > your Blueprint or Cluster Creation Template. >> > >> > Unfortunately, this is not documented very well, but the three >> properties >> > in question are set by default in the BlueprintConfigurationProcessor in >> > the following method: >> > >> > >> > >> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration >> > >> > The properties set in this method allow control over the types of >> > operations that are retried, the max number of retries attempted, and >> the >> > maximum amount of time that the agent should attempt a retry. >> > >> > We’ve seen many clusters using this new approach, and have not run into >> > that many problems with respect to ordering. >> > >> > One possible problem we’ve seen is in a small number of components that >> > launch services as a background command. In that case, the ambari-agent >> > cannot detect that a retry is required, and so cannot attempt a restart >> of >> > a failed service. This problem can usually be resolved with >> > component-specific retries. >> > >> > I don’t know much about the HAWQ component, but I would expect that >> > customizing the retry settings may help this problem. Do the HAWQ >> > components implement retry attempts when booting up? >> > >> > Hope this helps. >> > >> > Thanks, >> > Bob >> > >> > >> > >> > >> > On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez < >> > afernan...@hortonworks.com> wrote: >> > >> > > +others who have more insight into BluePrints >> > > >> > > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bchaudh...@pivotal.io> >> wrote: >> > > >> > >> Hello Sebastian, Alejandro, Andrew, >> > >> >> > >> Referring to the discussion on RB: >> https://reviews.apache.org/r/43948 >> > >> <https://reviews.apache.org/r/43948/#review120537>, it appears that >> > while >> > >> deploying clusters using Blueprints, RCO is not honored. Please >> confirm >> > if >> > >> this understanding is correct. >> > >> >> > >> While running internal test suites for HAWQ, we deploy the clusters >> > using >> > >> BP, and we need a specific order in which the HAWQ components must be >> > >> initialized / started. >> > >> >> > >> "HAWQ Standby" component should be initialized after "HAWQ Master" >> > >> component as it has to copy the contents from HAWQ Master. However, >> > since >> > >> RCO is not honored, we often come across issues as HAWQ Standby >> start / >> > >> initialization before HAWQ Master. >> > >> >> > >> Could you please let us know if there any work already going on for >> > >> bringing in RCO dependency for Blueprints, if not is there any other >> > >> alternative which can be used to enforce the dependency locally, or >> > >> something else which you suggest. >> > >> >> > >> Thanks, >> > >> Bhuvnesh Chaudhary >> > >> Email: bchau <bchaudh...@gopivotal.com>dh...@pivotal.io >> > >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976 >> > > >> > >> > >> > >