Hi Rodric and Carlos,
ApacheHadoop has three major components: HDFS (distributed filesystem), MapReduce(distributed batch processing engine), YARN (Yet Another Resource Negotiator) (containerengine). While MapReduce has been largely replaced by Apache Tez, Apache Spark,and Apache Flink, HDFS and YARN are still widely used for data analytics use cases. YARN is unique as a container engine because, unlike Mesos and Kubernetes, it was designed for ephemeral, short-livedcontainers rather than for long running micro-services. The jobs and queries that run on YARN are split intosmall tasks that run to completion and generally only last for seconds or maybe minutes. Overthe last couple years, YARN has been expanding its support for long running usecases, but is still focused on data-driven use cases over more generic micro-serviceuse cases (like web apps). The primary long running technologies on YARN are currently Spark Streamingand TensorFlow. Here is an articlefrom LinkedIn about why they created a project for TensorFlow on YARN. Asimilar case could be made for OpenWhisk: https://engineering.linkedin.com/blog/2018/09/open-sourcing-tony--native-support-of-tensorflow-on-hadoop. Bringing OpenWhisk onto YARN makes FaaS more accessible to thethousands of organizations with existing Hadoop clusters. Between Cloudera’s 2,000+ customers; Azure, AWS,and GCP cloud customers; and the organizations self-supporting like Netflix, theinstall base of YARN is very high and still growing. ThisPR is a first level of integration, but YARN’s focus on ephemeral containerscould be more fully leveraged by OpenWhisk to improve scalability andperformance. Here is an interesting article on the scalability of YARN fromMicrosoft: https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/ Thanks, Sam Hjelmfelt
