"Building data products is a very different discipline from that of
building software."

That is a fundamentally incorrect assumption.

There will always be a need for figuring out how to apply said principles,
but saying 'we're different' has always turned out to be incorrect and I
have seen no reason to think otherwise for data products.

At some point it always comes down to 'how do I get this to my customer, in
a reliable and repeatable fashion'.  The CI/CD patterns that we've come to
rely on are designed to optimize that process.

I have seen no evidence that 'data products' don't benefit from those
practices and I have definitely seen evidence that not following those
patterns has had substantial costs.

Of course there's always a balancing act in the early phases of discovery,
but at some point the needle swings from: "Do I have a valuable product"
to: "How do I get this to customers"

Gary Lucas

On 12 April 2017 at 10:46, Steve Loughran <ste...@hortonworks.com> wrote:

> On 12 Apr 2017, at 17:25, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
> Hi,
> Your answer is like saying, I know how to code in assembly level language
> and I am going to build the next GUI in assembly level code and I think
> that there is a genuine functional requirement to see a color of a button
> in green on the screen.
> well, I reserve the right to have incomplete knowledge, and look forward
> to improving it.
> Perhaps it may be pertinent to read the first preface of a CI/ CD book and
> realize to what kind of software development disciplines is it applicable
> to.
> the original introduction on CI was probably Fowler's Cruise Control
> article,
> https://martinfowler.com/articles/originalContinuousIntegration.html
> "The key is to automate absolutely everything and run the process so often
> that integration errors are found quickly"
> Java Development with Ant, 2003, looks at Cruise Control, Anthill and
> Gump, again, with that focus on team coding and automated regression
> testing, both of unit tests, and, with things like HttpUnit, web UIs.
> There's no discussion of "Data" per-se, though databases are implicit.
> Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get
> the entire ASF project portfolio to build and test against the latest build
> of everything else". Lots of finger pointing there, especially when
> something foundational like Ant or Xerces did bad.
> AFAIK, The earliest known in-print reference to Continuous Deployme3nt is
> the HP Labs 2002 paper, *Making Web Services that Work*. That introduced
> the concept with a focus on automating deployment, staging testing and
> treating ops problems as use cases for which engineers could often write
> tests for, and, perhaps, even design their applications to support. "We are
> exploring extending this model to one we term Continuous Deployment —after
> passing the local test suite, a service can be automatically deployed to a
> public staging server for stress and acceptance testing by physically
> remote calling parties"
> At this time, the applications weren't modern "big data" apps as they
> didn't have affordable storage or the tools to schedule work over it. It
> wasn't that the people writing the books and papers looked at big data and
> said "not for us", it just wasn't on their horizons. 1TB was a lot of
> storage in those days, not a high-end SSD.
> Otherwise your approach is just another line of defense in saving your job
> by applying an impertinent, incorrect, and outdated skill and tool to a
> problem.
> please be a bit more constructive here, the ASF code of conduct encourages
> empathy and coillaboration. https://www.apache.org/foundation/
> policies/conduct . Thanks.,
> Which is why we ned to consider how to take what are core methodologies
> for software and apply them, and, where appropriate, supercede them with
> new workflows, ideas, technologies. But doing so with an understanding of
> the reasoning behind today's tools and workflows. I'm really interested in
> how do we get from experimental notebook code to something usable in
> production, pushing it out, finding the dirty-data-problems before it goes
> live, etc, etc. I do think today's tools have been outgrown by the
> applications we now build, and am thinking not so much "which tools to
> use', but one step further, "what are the new tools and techniques to
> use?".
> I look forward to whatever insight people have here.
> My genuine advice to everyone in all spheres of activities will be to
> first understand the problem to solve before solving it and definitely
> before selecting the tools to solve it, otherwise you will land up with a
> bowl of soup and fork in hand and argue that CI/ CD is still applicable to
> building data products and data warehousing.
> I concur
> Regards,
> Gourav
> -Steve
> On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>> On 11 Apr 2017, at 20:46, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>> And once again JAVA programmers are trying to solve a data analytics and
>> data warehousing problem using programming paradigms. It genuinely a pain
>> to see this happen.
>> While I'm happy to be faulted for treating things as software processes,
>> having a full automated mechanism for testing the latest code before
>> production is something I'd consider foundational today. This is what
>> "Contiunous Deployment" was about when it was first conceived. Does it mean
>> you should blindly deploy that way? well, not if you worry about security,
>> but having that review process and then a final manual "deploy" button can
>> address that.
>> Cloud infras let you integrate cluster instantiation to the process;
>> which helps you automate things like "stage the deployment in some new VMs,
>> run acceptance tests (*), then switch the load balancer over to the new
>> cluster, being ready to switch back if you need. I've not tried that with
>> streaming apps though; I don't know how to do it there. Boot the new
>> cluster off checkpointed state requires deserialization to work, which
>> can't be guaranteed if you are changing the objects which get serialized.
>> I'd argue then, it's not a problem which has already been solved by data
>> analystics/warehousing —though if you've got pointers there, I'd be
>> grateful. Always good to see work by others. Indeed, the telecoms industry
>> have led the way in testing and HA deployment: if you look at Erlang you
>> can see a system designed with hot upgrades in mind, the way java code "add
>> a JAR to a web server" never was.
>> -Steve
>> (*) do always make sure this is the test cluster with a snapshot of test
>> data, not production machines/data. There are always horror stories there.
>> Regards,
>> Gourav
>> On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>> Hi Steve
>>> Thanks for the detailed response, I think this problem doesn't have an
>>> industry standard solution as of yet and I am sure a lot of people would
>>> benefit from the discussion
>>> I realise now what you are saying so thanks for clarifying, that said
>>> let me try and explain how we approached the problem
>>> There are 2 problems you highlighted, the first if moving the code from
>>> SCM to prod, and the other is enusiring the data your code uses is correct.
>>> (using the latest data from prod)
>>> *"how do you get your code from SCM into production?"*
>>> We currently have our pipeline being run via airflow, we have our dags
>>> in S3, with regards to how we get our code from SCM to production
>>> 1) Jenkins build that builds our spark applications and runs tests
>>> 2) Once the first build is successful we trigger another build to copy
>>> the dags to an s3 folder
>>> We then routinely sync this folder to the local airflow dags folder
>>> every X amount of mins
>>> Re test data
>>> *" but what's your strategy for test data: that's always the
>>> troublespot."*
>>> Our application is using versioning against the data, so we expect the
>>> source data to be in a certain version and the output data to also be in a
>>> certain version
>>> We have a test resources folder that we have following the same
>>> convention of versioning - this is the data that our application tests use
>>> - to ensure that the data is in the correct format
>>> so for example if we have Table X with version 1 that depends on data
>>> from Table A and B also version 1, we run our spark application then ensure
>>> the transformed table X has the correct columns and row values
>>> Then when we have a new version 2 of the source data or adding a new
>>> column in Table X (version 2), we generate a new version of the data and
>>> ensure the tests are updated
>>> That way we ensure any new version of the data has tests against it
>>> *"I've never seen any good strategy there short of "throw it at a copy
>>> of the production dataset"."*
>>> I agree which is why we have a sample of the production data and version
>>> the schemas we expect the source and target data to look like.
>>> If people are interested I am happy writing a blog about it in the hopes
>>> this helps people build more reliable pipelines
>> Love to see that.
>> Kind Regards
>>> Sam

