Re: Quick question Hadoop DR solutions

Seetharam Venkatesh Thu, 19 Jun 2014 22:28:24 -0700

+ Dev ML

You can model this as a process in Falcon with no inputs and outputs and
run azkaban jobs as a java action. This will not add any data dependency to
your pipeline.


You could also write a simple Workflow engine implementation for Azkaban
and use enhance falcon to drive azkaban flows.


On Sun, Jun 15, 2014 at 8:27 PM, Venkat R <[email protected]> wrote:

> Hey Venkatesh,
>
> Good to know. Have a great time in Bangalore.
>
> I want to leverage Falcon, but one hurdle that we are facing is the
> migration. Users have a large number of Azkaban flows that is free form --
> meaning no need to declare the input and output feeds etc. An Azkaban
> package contains some properties files and the pig script or MR jar files
> and it just runs them. No data dependency or data availability trigger etc.
> It's up to your java code to check if the input is ready and move to the
> next step in the Azkaban flow.
>
> Now, converting all of them into Falcon input/output/process definitions
> is daunting and I'm hoping it can be mitigated by some tools -- though not
> finalized how to modify a Pig script/MR jobs to redefine the LOAD statement
> to use the Falcon INPUT/OUTPUT variables. Let me know if you have any
> thoughts on this.
>
> Enjoy your vacation and hope to talk to you soon.
>
> Thanks
> Venkat
>
>   On Saturday, June 14, 2014 10:47 PM, Seetharam Venkatesh <
> [email protected]> wrote:
>
>
> HI Venkat,
>
> Vacation in India for a few weeks.
>
>
>
>
> On Wed, Jun 11, 2014 at 3:54 PM, Venkat R <[email protected]> wrote:
>
> Hey Venkatesh,
>
> There is some idea on Hadoop DR implementation based on parsing the HDFS
> audit log to see what folders are accessed by a set of users and
> periodically replicate it to the stand by cluster.
>
> This is exactly what Oozie does but polls dir. This is not a public API
> and depending on a log is odd since formats could change.
>
>
>
> Since this can take care of the input datasets needed to launch the jobs
> on the stand-by clusters, the flows can be restarted on the stand-by
> clusters. This sort of looks like because users don't need to define input
> and output datasets etc.
>
> This is already done by falcon with out you writing custom code. This is
> active-passive config.
>
>
> I'm sure you would have thought about this implementation -- any idea
> where this will break? does DR gets implemented at Yahoo like this?
>
>
> What are you gaining but going out and doing this by hand which are
> already solved by existing tools.
>
>
> Appreciate your insights
>  Venkat
>
>
>
>
> --
> Regards,
>
> Venkatesh
>
> “Perfection (in design) is achieved not when there is nothing more to add,
> but rather when there is nothing more to take away.”
> - Antoine de Saint-Exupéry
>
>
>


-- 
Regards,
Venkatesh

“Perfection (in design) is achieved not when there is nothing more to add,
but rather when there is nothing more to take away.”
- Antoine de Saint-Exupéry

Re: Quick question Hadoop DR solutions

Reply via email to