Let me see if I can give a better summary of what we are trying to do. Our use case is such that we have a set of mySQL instances and we would like to control the number of connections that we establish to them for sqoop extractions. Within each instance we can have several tables we are targeting for that daily extraction. Our ETL process involves the mentioned sqoop table extractions into a Hive warehouse and then a transformation from the Hive staging area into a date partitioned set of Hive tables (with a few column name transformations as well). We would like to establish an Oozie workflow per mySQL instance and use the DAG to properly queue sqoop table extractions such that no more than one sqoop action is happening at any time. The issue I am running into is that I need to find a way to have the Hive extraction run asynchronously from the serial Sqoop queue. In other words I would like to avoid 1) having the next sqoop table extraction have to wait on the previous Hive transformation and 2) not having to move all of the Hive transformations to the bottom of the DAG (I would like to be able to run them as soon as the sqoop table has been extracted).
I have tinkered with the thought of having a coordinator job staged for every Hive transformation and then doing a data availability clause that allowed it to run but this gets more difficult when you are trying to watch data folders that have been directly imported into Hive. The other route I have looked into is a series of nested forks in which I call the Hive transformation and the next Sqoop action in parallel from a completed Sqoop Action. Let me know if there are any documented best practices around these kind of flows or if I need to try to balance this across more than just Oozie. -- Matt Goeke On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <[email protected]> wrote: > Matt, > Its always better to have a join for the corresponding fork. I think it > would be better if you clarify in the question more about your workflow > design and the requirement for asynchronous spikes. > > Thanks, > Virag > > > On 7/17/12 2:30 PM, "Matt Goeke" <[email protected]> wrote: > > > Virag, > > > > Thanks for the response. I have read the workflow spec and while I > realize > > there is the ability to fork within a workflow my issue is that all forks > > must be paired with joins. What I was looking for was some way to fork > but > > not require all of the forked nodes to rejoin the primary workflow (hence > > some of the nodes becoming asynchronous spikes). I feel like this > > capability might already exist and this might just be an issue of > > workflow/subworkflow composition. > > > > -- > > Matt Goeke > > > > On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <[email protected]> > wrote: > > > >> Hi Matt, > >> I think you can fork the hive actions using the fork/join control nodes > in > >> Oozie. > >> > >> > http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio > >> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes. > >> > >> I have no idea why the attachment doesn't work. > >> > >> Thanks, > >> Virag > >> > >> > >> On 7/17/12 12:13 PM, "Matt Goeke" <[email protected]> wrote: > >> > >>> Apparently when I put an imagur link in the reply the spam score gets > >> high > >>> enough that the delivery is denied... is there anyway to link an image? > >>> Also, if not then is there anything I can clarify in the question that > >>> would make it more straightforward? > >>> > >>> -- > >>> Matt Goeke > >>> > >>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <[email protected] > >>> wrote: > >>> > >>>> The attachment hasn't come through. This had happened with an earlier > >>>> email with the Oozie Meetup slides attachments too. Any solutions? > >>>> > >>>> -- > >>>> Mona Chitnis > >>>> > >>>> From: Matt Goeke <[email protected]<mailto: > >> [email protected]>> > >>>> Reply-To: "[email protected]<mailto: > >>>> [email protected]>" <[email protected] > >>>> <mailto:[email protected]>> > >>>> To: "[email protected]<mailto: > >>>> [email protected]>" <[email protected] > >>>> <mailto:[email protected]>> > >>>> Subject: Oozie: asynchronous forking > >>>> > >>>> All, > >>>> > >>>> Does anyone know if it is possible to do asynchronous forking in > Oozie? > >>>> Currently we are running a set of ETL extractions that are pairs of > >> actions > >>>> (sqoop action then a hive transformation) but we would like to have > the > >>>> Sqoop actions be serial and the Hive actions be called asynchronously > >> when > >>>> the paired Sqoop job finishes. The reason the Sqoop actions are serial > >> is > >>>> we would like to limit the number of concurrent mappers hitting the > data > >>>> source and we could do this through the fair scheduler but that would > >>>> require a pool per data source. Attached is a picture of suggested ETL > >> flow. > >>>> > >>>> If anyone has any suggestions on best practices around this I would > love > >>>> to hear them. > >>>> > >>>> Thanks, > >>>> Matt > >>>> > >> > >> > >
