I think it really depends on your script and your environment.

A good approach may be to split up the script into logical code blocks
(jobs), then execute those jobs in series via a bash script. I have found
it also helpful to persist the data from these jobs (not the intermediate
data) to a persistent data store; in case something goes wrong, you don't
have to rerun prior computations, just from the last failed job (at the
cost of additional loads).  This modular approach has been helpful in
development; you still get the Pig optimization benefits per module, and
this will allow for future expansion, such as concurrent job execution on
your cluster and optimizing cluster capacity.

Hope this helps,

-Dan



On Wed, Mar 5, 2014 at 10:33 AM, Christopher Petrino <c...@yesware.com>wrote:

> Hi all, what is everyone's approach for managing a Pig scripts that has
> become very long? What is your best way to break it up into smaller pieces?
>

Reply via email to