Hello Massy, I just answer on reddit, I copy/paste answer here in case someone is interested too.
Dataflow support python 3.5<https://beam.apache.org/roadmap/python-sdk/#python-3-support>. In my company we do use apache-beam/dataflow in prod with a setup.py to initialize dependencies, even non-python one<https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython> like polyglot<https://polyglot.readthedocs.io/en/latest/Installation.html>. The juliaset example is helpful to start. We have the same constraint as you regarding DS, but in our side it is mainly tensorflow. Don't hesitate to take a look at this article<https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f> which give an overview on how we work with DS. You should be able to wrap apache-beam/dataflow code to have the same syntax as sklearn. Then, DS will be able to be autonomous with the scalability and without to know the complexity of the cluster-computing framework. Hope this helps. Germain. From: Massy Bourennani <massybourenn...@gmail.com> Reply-To: "user@beam.apache.org" <user@beam.apache.org> Date: Tuesday 16 July 2019 at 10:49 To: "user@beam.apache.org" <user@beam.apache.org> Subject: Industrializing batch ML algorithm using Apache Beam/Dataflow (on Google Cloud Platform) Hi all, Here is the link to the Reddit post[1] Many thanks for your help. Massy [1] https://www.reddit.com/r/dataengineering/comments/cdp5i3/industrializing_batch_ml_algorithm_using_apache/<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Fdataengineering%2Fcomments%2Fcdp5i3%2Findustrializing_batch_ml_algorithm_using_apache%2F&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C5fade1523efd48ce9add08d709ca7bbb%7C37530da3f7a748f4ba462dc336d55387%7C0%7C1%7C636988637568962821&sdata=jGht7l2BuvLMzGCn42l3M1opsKIz%2FHZuPTMQg%2FllDoM%3D&reserved=0>