Re: how to run jobs every 30 minutes?

edward choi Wed, 15 Dec 2010 22:37:23 -0800

The first recommendation (gluing all my command line apps) is what I am
currently using.
The other ones you mentioned are just out of my league right now, since I am
quite new to Java world, not to mention JRuby, Groovy, Jython, etc.
But when I get comfortable with the environment and start to look for more
options I'll refer to your message. Thanks for the advanced info :-)


2010/12/15 Chris K Wensel <ch...@wensel.net>

>
> I see it this way.
>
> You can glue a bunch of discrete command line apps together that may or may
> not have dependencies between one another in a new syntax. which is darn
> nice if you already have a bunch of discrete ready to run command line apps
> sitting around that need to be strung together, that can't be used as
> libraries and instantiated through their APIs.
>
> Or, you can string all your work together through the APIs with a turing
> complete language and run them all from a single command line interface (and
> hand that to cron, or some other tool).
>
> In this case you can use Java, or easier languages like JRuby, Groovy,
> Jython, Clojure, etc which were designed for this purpose. (They don't run
> on the cluster, they only run Hadoop client side).
>
> Think ant vs graddle (or any other build tool that uses a scripting
> language and not a configuration file) if you want a concrete example.
>
> Cascading itself is a query API (and query planner). But it also exposes to
> the user the ability to run discrete 'processes' in dependency order for
> you. Either Cascading (Hadoop) Flows or Riffle annotated process objects.
> They all can be intermingled and managed from the same dependency scheduler.
> Cascading has one, and Riffle has one.
>
> So you can run> Flow -> Mahout -> Pig -> Mahout -> Flow -> shell ->
> whattheheckever from the same application.
>
> Cascading also has the ability to only run 'stale' processes. Think 'make'
> file. When re-running a job where only one file of many has changed, this is
> a big win.
>
> I personally like parameterizing my applications via the command line and
> letting my cli options drive the workflows. for example, my testing,
> integration, production environments are much different, so its very easy to
> drive specific runs of the jobs by changing a cli arg. (args4j makes this
> darn simple)
>
> if I am chaining multiple CLI apps into a bigger production app,
> parameterizing that I suspect will be error prone, esp if the input/output
> data points (jdbc vs file) are different in different contexts.
>
> you can find Riffle here, https://github.com/cwensel/riffle  (its Apache
> Licensed, contributions welcomed)
>
> ckw
>
> On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote:
>
> > Ed,
> >
> > Actually Oozie is quite different from Cascading.
> >
> > * Cascading allows you to write 'queries' using a Java API and they get
> > translated into MR jobs.
> > * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a
> DAG
> > (workflow jobs) and has timer+data dependency triggers (coordinator
> jobs).
> >
> > Regards.
> >
> > Alejandro
> >
> > On Tue, Dec 14, 2010 at 1:26 PM, edward choi <mp2...@gmail.com> wrote:
> >
> >> Thanks for the tip. I took a look at it.
> >> Looks similar to Cascading I guess...?
> >> Anyway thanks for the info!!
> >>
> >> Ed
> >>
> >> 2010/12/8 Alejandro Abdelnur <t...@cloudera.com>
> >>
> >>> Or, if you want to do it in a reliable way you could use an Oozie
> >>> coordinator job.
> >>>
> >>> On Wed, Dec 8, 2010 at 1:53 PM, edward choi <mp2...@gmail.com> wrote:
> >>>> My mistake. Come to think about it, you are right, I can just make an
> >>>> infinite loop inside the Hadoop application.
> >>>> Thanks for the reply.
> >>>>
> >>>> 2010/12/7 Harsh J <qwertyman...@gmail.com>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> On Tue, Dec 7, 2010 at 2:25 PM, edward choi <mp2...@gmail.com>
> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm planning to crawl a certain web site every 30 minutes.
> >>>>>> How would I get it done in Hadoop?
> >>>>>>
> >>>>>> In pure Java, I used Thread.sleep() method, but I guess this won't
> >>> work
> >>>>> in
> >>>>>> Hadoop.
> >>>>>
> >>>>> Why wouldn't it? You need to manage your post-job logic mostly, but
> >>>>> sleep and resubmission should work just fine.
> >>>>>
> >>>>>> Or if it could work, could anyone show me an example?
> >>>>>>
> >>>>>> Ed.
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Harsh J
> >>>>> www.harshj.com
> >>>>>
> >>>>
> >>>
> >>
>
> --
> Chris K Wensel
> ch...@concurrentinc.com
> http://www.concurrentinc.com
>
> -- Concurrent, Inc. offers mentoring, support, and licensing for Cascading
>
>

Re: how to run jobs every 30 minutes?

Reply via email to