"Building data products is a very different discipline from that of building software."
That is a fundamentally incorrect assumption. There will always be a need for figuring out how to apply said principles, but saying 'we're different' has always turned out to be incorrect and I have seen no reason to think otherwise for data products. At some point it always comes down to 'how do I get this to my customer, in a reliable and repeatable fashion'. The CI/CD patterns that we've come to rely on are designed to optimize that process. I have seen no evidence that 'data products' don't benefit from those practices and I have definitely seen evidence that not following those patterns has had substantial costs. Of course there's always a balancing act in the early phases of discovery, but at some point the needle swings from: "Do I have a valuable product" to: "How do I get this to customers" Gary Lucas On 12 April 2017 at 10:46, Steve Loughran <ste...@hortonworks.com> wrote: > > On 12 Apr 2017, at 17:25, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi, > > Your answer is like saying, I know how to code in assembly level language > and I am going to build the next GUI in assembly level code and I think > that there is a genuine functional requirement to see a color of a button > in green on the screen. > > > well, I reserve the right to have incomplete knowledge, and look forward > to improving it. > > Perhaps it may be pertinent to read the first preface of a CI/ CD book and > realize to what kind of software development disciplines is it applicable > to. > > > the original introduction on CI was probably Fowler's Cruise Control > article, > https://martinfowler.com/articles/originalContinuousIntegration.html > > "The key is to automate absolutely everything and run the process so often > that integration errors are found quickly" > > Java Development with Ant, 2003, looks at Cruise Control, Anthill and > Gump, again, with that focus on team coding and automated regression > testing, both of unit tests, and, with things like HttpUnit, web UIs. > There's no discussion of "Data" per-se, though databases are implicit. > > Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get > the entire ASF project portfolio to build and test against the latest build > of everything else". Lots of finger pointing there, especially when > something foundational like Ant or Xerces did bad. > > AFAIK, The earliest known in-print reference to Continuous Deployme3nt is > the HP Labs 2002 paper, *Making Web Services that Work*. That introduced > the concept with a focus on automating deployment, staging testing and > treating ops problems as use cases for which engineers could often write > tests for, and, perhaps, even design their applications to support. "We are > exploring extending this model to one we term Continuous Deployment —after > passing the local test suite, a service can be automatically deployed to a > public staging server for stress and acceptance testing by physically > remote calling parties" > > At this time, the applications weren't modern "big data" apps as they > didn't have affordable storage or the tools to schedule work over it. It > wasn't that the people writing the books and papers looked at big data and > said "not for us", it just wasn't on their horizons. 1TB was a lot of > storage in those days, not a high-end SSD. > > Otherwise your approach is just another line of defense in saving your job > by applying an impertinent, incorrect, and outdated skill and tool to a > problem. > > > please be a bit more constructive here, the ASF code of conduct encourages > empathy and coillaboration. https://www.apache.org/foundation/ > policies/conduct . Thanks., > > > Building data products is a very different discipline from that of > building software. > > > Which is why we ned to consider how to take what are core methodologies > for software and apply them, and, where appropriate, supercede them with > new workflows, ideas, technologies. But doing so with an understanding of > the reasoning behind today's tools and workflows. I'm really interested in > how do we get from experimental notebook code to something usable in > production, pushing it out, finding the dirty-data-problems before it goes > live, etc, etc. I do think today's tools have been outgrown by the > applications we now build, and am thinking not so much "which tools to > use', but one step further, "what are the new tools and techniques to > use?". > > I look forward to whatever insight people have here. > > > My genuine advice to everyone in all spheres of activities will be to > first understand the problem to solve before solving it and definitely > before selecting the tools to solve it, otherwise you will land up with a > bowl of soup and fork in hand and argue that CI/ CD is still applicable to > building data products and data warehousing. > > > I concur > > Regards, > Gourav > > > -Steve > > On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> On 11 Apr 2017, at 20:46, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >> And once again JAVA programmers are trying to solve a data analytics and >> data warehousing problem using programming paradigms. It genuinely a pain >> to see this happen. >> >> >> >> While I'm happy to be faulted for treating things as software processes, >> having a full automated mechanism for testing the latest code before >> production is something I'd consider foundational today. This is what >> "Contiunous Deployment" was about when it was first conceived. Does it mean >> you should blindly deploy that way? well, not if you worry about security, >> but having that review process and then a final manual "deploy" button can >> address that. >> >> Cloud infras let you integrate cluster instantiation to the process; >> which helps you automate things like "stage the deployment in some new VMs, >> run acceptance tests (*), then switch the load balancer over to the new >> cluster, being ready to switch back if you need. I've not tried that with >> streaming apps though; I don't know how to do it there. Boot the new >> cluster off checkpointed state requires deserialization to work, which >> can't be guaranteed if you are changing the objects which get serialized. >> >> I'd argue then, it's not a problem which has already been solved by data >> analystics/warehousing —though if you've got pointers there, I'd be >> grateful. Always good to see work by others. Indeed, the telecoms industry >> have led the way in testing and HA deployment: if you look at Erlang you >> can see a system designed with hot upgrades in mind, the way java code "add >> a JAR to a web server" never was. >> >> -Steve >> >> >> (*) do always make sure this is the test cluster with a snapshot of test >> data, not production machines/data. There are always horror stories there. >> >> >> Regards, >> Gourav >> >> On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >>> Hi Steve >>> >>> >>> Thanks for the detailed response, I think this problem doesn't have an >>> industry standard solution as of yet and I am sure a lot of people would >>> benefit from the discussion >>> >>> I realise now what you are saying so thanks for clarifying, that said >>> let me try and explain how we approached the problem >>> >>> There are 2 problems you highlighted, the first if moving the code from >>> SCM to prod, and the other is enusiring the data your code uses is correct. >>> (using the latest data from prod) >>> >>> >>> *"how do you get your code from SCM into production?"* >>> >>> We currently have our pipeline being run via airflow, we have our dags >>> in S3, with regards to how we get our code from SCM to production >>> >>> 1) Jenkins build that builds our spark applications and runs tests >>> 2) Once the first build is successful we trigger another build to copy >>> the dags to an s3 folder >>> >>> We then routinely sync this folder to the local airflow dags folder >>> every X amount of mins >>> >>> Re test data >>> *" but what's your strategy for test data: that's always the >>> troublespot."* >>> >>> Our application is using versioning against the data, so we expect the >>> source data to be in a certain version and the output data to also be in a >>> certain version >>> >>> We have a test resources folder that we have following the same >>> convention of versioning - this is the data that our application tests use >>> - to ensure that the data is in the correct format >>> >>> so for example if we have Table X with version 1 that depends on data >>> from Table A and B also version 1, we run our spark application then ensure >>> the transformed table X has the correct columns and row values >>> >>> Then when we have a new version 2 of the source data or adding a new >>> column in Table X (version 2), we generate a new version of the data and >>> ensure the tests are updated >>> >>> That way we ensure any new version of the data has tests against it >>> >>> *"I've never seen any good strategy there short of "throw it at a copy >>> of the production dataset"."* >>> >>> I agree which is why we have a sample of the production data and version >>> the schemas we expect the source and target data to look like. >>> >>> If people are interested I am happy writing a blog about it in the hopes >>> this helps people build more reliable pipelines >>> >>> >> Love to see that. >> >> Kind Regards >>> Sam >>> >> >> > >