Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Sumona Routh Tue, 11 Apr 2017 13:42:19 -0700

Hi Sam,
I would absolutely be interested in reading a blog write-up of how you are
doing this. We have pieced together a relatively decent pipeline ourselves,
in jenkins, but have many kinks to work out. We also have some new
requirements to start running side by side comparisons of different
versions of the spark job, so this introduces some additional complexity
for us, and opens up the opportunity to redesign this approach.


To answer the question of "what if jenkins is down," we simply added our
jenkins to our application monitoring (AppD currently, future NewRelic) and
cloudwatch to monitor the instance. We manage our own jenkins box, so there
are no hoops to do what we want in this respect.

Thanks!
Sumona

On Tue, Apr 11, 2017 at 7:20 AM Sam Elamin <hussam.ela...@gmail.com> wrote:

> Hi Steve
>
>
> Thanks for the detailed response, I think this problem doesn't have an
> industry standard solution as of yet and I am sure a lot of people would
> benefit from the discussion
>
> I realise now what you are saying so thanks for clarifying, that said let
> me try and explain how we approached the problem
>
> There are 2 problems you highlighted, the first if moving the code from
> SCM to prod, and the other is enusiring the data your code uses is correct.
> (using the latest data from prod)
>
>
> *"how do you get your code from SCM into production?"*
>
> We currently have our pipeline being run via airflow, we have our dags in
> S3, with regards to how we get our code from SCM to production
>
> 1) Jenkins build that builds our spark applications and runs tests
> 2) Once the first build is successful we trigger another build to copy the
> dags to an s3 folder
>
> We then routinely sync this folder to the local airflow dags folder every
> X amount of mins
>
> Re test data
> *" but what's your strategy for test data: that's always the troublespot."*
>
> Our application is using versioning against the data, so we expect the
> source data to be in a certain version and the output data to also be in a
> certain version
>
> We have a test resources folder that we have following the same convention
> of versioning - this is the data that our application tests use - to ensure
> that the data is in the correct format
>
> so for example if we have Table X with version 1 that depends on data from
> Table A and B also version 1, we run our spark application then ensure the
> transformed table X has the correct columns and row values
>
> Then when we have a new version 2 of the source data or adding a new
> column in Table X (version 2), we generate a new version of the data and
> ensure the tests are updated
>
> That way we ensure any new version of the data has tests against it
>
> *"I've never seen any good strategy there short of "throw it at a copy of
> the production dataset"."*
>
> I agree which is why we have a sample of the production data and version
> the schemas we expect the source and target data to look like.
>
> If people are interested I am happy writing a blog about it in the hopes
> this helps people build more reliable pipelines
>
> Kind Regards
> Sam
>
>
>
>
>
>
>
>
>
>
> On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>
> On 7 Apr 2017, at 18:40, Sam Elamin <hussam.ela...@gmail.com> wrote:
>
> Definitely agree with gourav there. I wouldn't want jenkins to run my work
> flow. Seems to me that you would only be using jenkins for its scheduling
> capabilities
>
>
> Maybe I was just looking at this differenlty
>
> Yes you can run tests but you wouldn't want it to run your orchestration
> of jobs
>
> What happens if jenkijs goes down for any particular reason. How do you
> have the conversation with your stakeholders that your pipeline is not
> working and they don't have data because the build server is going through
> an upgrade or going through an upgrade
>
>
>
> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly
> consider as the pipeline for getting your code out to the cluster, so you
> don't have to explain why you just pushed out something broken
>
> As example, here's Renault's pipeline as discussed last week in Munich
> https://flic.kr/p/Tw3Emu
>
> However to be fair I understand what you are saying Steve if someone is in
> a place where you only have access to jenkins and have to go through hoops
> to setup:get access to new instances then engineers will do what they
> always do, find ways to game the system to get their work done
>
>
>
>
> This isn't about trying to "Game the system", this is about what makes a
> replicable workflow for getting code into production, either at the press
> of a button or as part of a scheduled "we push out an update every night,
> rerun the deployment tests and then switch over to the new installation"
> mech.
>
> Put differently: how do you get your code from SCM into production? Not
> just for CI, but what's your strategy for test data: that's always the
> troublespot. Random selection of rows may work, although it will skip the
> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
> to 0, etc), and for work joining > 1 table, you need rows which join well.
> I've never seen any good strategy there short of "throw it at a copy of the
> production dataset".
>
>
> -Steve
>
>
>
>
>
>
> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi Steve,
>
> Why would you ever do that? You are suggesting the use of a CI tool as a
> workflow and orchestration engine.
>
> Regards,
> Gourav Sengupta
>
> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
> If you have Jenkins set up for some CI workflow, that can do scheduled
> builds and tests. Works well if you can do some build test before even
> submitting it to a remote cluster
>
> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>
> Hi Shyla
>
> You have multiple options really some of which have been already listed
> but let me try and clarify
>
> Assuming you have a spark application in a jar you have a variety of
> options
>
> You have to have an existing spark cluster that is either running on EMR
> or somewhere else.
>
> *Super simple / hacky*
> Cron job on EC2 that calls a simple shell script that does a spart submit
> to a Spark Cluster OR create or add step to an EMR cluster
>
> *More Elegant*
> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
> do the above step but have scheduling and potential backfilling and error
> handling(retries,alerts etc)
>
> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
> does some Spark jobs but I do not think its available worldwide just yet
>
> Hope I cleared things up
>
> Regards
> Sam
>
>
> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> > wrote:
>
> Hi Shyla,
>
> why would you want to schedule a spark job in EC2 instead of EMR?
>
> Regards,
> Gourav
>
> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com>
> wrote:
>
> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>
>
>
>
>
>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Reply via email to