Re: Oozie: asynchronous forking

Matt Goeke Wed, 18 Jul 2012 10:53:14 -0700

Let me see if I can give a better summary of what we are trying to do. Our
use case is such that we have a set of mySQL instances and we would like to
control the number of connections that we establish to them for sqoop
extractions. Within each instance we can have several tables we
are targeting for that daily extraction. Our ETL process involves the
mentioned sqoop table extractions into a Hive warehouse and then a
transformation from the Hive staging area into a date partitioned set of
Hive tables (with a few column name transformations as well). We would like
to establish an Oozie workflow per mySQL instance and use the DAG to
properly queue sqoop table extractions such that no more than one sqoop
action is happening at any time. The issue I am running into is that I need
to find a way to have the Hive extraction run asynchronously from the
serial Sqoop queue. In other words I would like to avoid 1) having the next
sqoop table extraction have to wait on the previous Hive transformation and
2) not having to move all of the Hive transformations to the bottom of the
DAG (I would like to be able to run them as soon as the sqoop table has
been extracted).


I have tinkered with the thought of having a coordinator job staged for
every Hive transformation and then doing a data availability clause that
allowed it to run but this gets more difficult when you are trying to watch
data folders that have been directly imported into Hive. The other route I
have looked into is a series of nested forks in which I call the Hive
transformation and the next Sqoop action in parallel from a completed Sqoop
Action.

Let me know if there are any documented best practices around these kind of
flows or if I need to try to balance this across more than just Oozie.

--
Matt Goeke

On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <[email protected]> wrote:

> Matt,
> Its always better to have a join for the corresponding fork. I think it
> would be better if you clarify in the question more about your workflow
> design and the requirement for asynchronous spikes.
>
> Thanks,
> Virag
>
>
> On 7/17/12 2:30 PM, "Matt Goeke" <[email protected]> wrote:
>
> > Virag,
> >
> > Thanks for the response. I have read the workflow spec and while I
> realize
> > there is the ability to fork within a workflow my issue is that all forks
> > must be paired with joins. What I was looking for was some way to fork
> but
> > not require all of the forked nodes to rejoin the primary workflow (hence
> > some of the nodes becoming asynchronous spikes). I feel like this
> > capability might already exist and this might just be an issue of
> > workflow/subworkflow composition.
> >
> > --
> > Matt Goeke
> >
> > On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <[email protected]>
> wrote:
> >
> >> Hi Matt,
> >> I think you can fork the hive actions using the fork/join control nodes
> in
> >> Oozie.
> >>
> >>
> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio
> >> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
> >>
> >> I have no idea why the attachment doesn't work.
> >>
> >> Thanks,
> >> Virag
> >>
> >>
> >> On 7/17/12 12:13 PM, "Matt Goeke" <[email protected]> wrote:
> >>
> >>> Apparently when I put an imagur link in the reply the spam score gets
> >> high
> >>> enough that the delivery is denied... is there anyway to link an image?
> >>> Also, if not then is there anything I can clarify in the question that
> >>> would make it more straightforward?
> >>>
> >>> --
> >>> Matt Goeke
> >>>
> >>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <[email protected]
> >>> wrote:
> >>>
> >>>> The attachment hasn't come through. This had happened with an earlier
> >>>> email with the Oozie Meetup slides attachments too. Any solutions?
> >>>>
> >>>> --
> >>>> Mona Chitnis
> >>>>
> >>>> From: Matt Goeke <[email protected]<mailto:
> >> [email protected]>>
> >>>> Reply-To: "[email protected]<mailto:
> >>>> [email protected]>" <[email protected]
> >>>> <mailto:[email protected]>>
> >>>> To: "[email protected]<mailto:
> >>>> [email protected]>" <[email protected]
> >>>> <mailto:[email protected]>>
> >>>> Subject: Oozie: asynchronous forking
> >>>>
> >>>> All,
> >>>>
> >>>> Does anyone know if it is possible to do asynchronous forking in
> Oozie?
> >>>> Currently we are running a set of ETL extractions that are pairs of
> >> actions
> >>>> (sqoop action then a hive transformation) but we would like to have
> the
> >>>> Sqoop actions be serial and the Hive actions be called asynchronously
> >> when
> >>>> the paired Sqoop job finishes. The reason the Sqoop actions are serial
> >> is
> >>>> we would like to limit the number of concurrent mappers hitting the
> data
> >>>> source and we could do this through the fair scheduler but that would
> >>>> require a pool per data source. Attached is a picture of suggested ETL
> >> flow.
> >>>>
> >>>> If anyone has any suggestions on best practices around this I would
> love
> >>>> to hear them.
> >>>>
> >>>> Thanks,
> >>>> Matt
> >>>>
> >>
> >>
>
>

Re: Oozie: asynchronous forking

Reply via email to