Re: PR to enable actions on YARN

Samuel Hjelmfelt Fri, 22 Feb 2019 17:22:57 -0800

Hi Rodric and Carlos,

ApacheHadoop has three major components: HDFS (distributed filesystem),
MapReduce(distributed batch processing engine), YARN (Yet Another Resource
Negotiator) (containerengine). While MapReduce has been largely replaced by
Apache Tez, Apache Spark,and Apache Flink, HDFS and YARN are still widely used
for data analytics use cases.

YARN is unique as a container engine because, unlike Mesos and Kubernetes, it
was designed for ephemeral, short-livedcontainers rather than for long running
micro-services. The jobs and queries that run on YARN are split intosmall tasks
that run to completion and generally only last for seconds or maybe minutes.
Overthe last couple years, YARN has been expanding its support for long running
usecases, but is still focused on data-driven use cases over more generic
micro-serviceuse cases (like web apps). The primary long running technologies
on YARN are currently Spark Streamingand TensorFlow. Here is an articlefrom
LinkedIn about why they created a project for TensorFlow on YARN. Asimilar case
could be made for OpenWhisk:
https://engineering.linkedin.com/blog/2018/09/open-sourcing-tony--native-support-of-tensorflow-on-hadoop.

Bringing OpenWhisk onto YARN makes FaaS more accessible to thethousands of
organizations with existing Hadoop clusters. Between Cloudera’s 2,000+
customers; Azure, AWS,and GCP cloud customers; and the organizations
self-supporting like Netflix, theinstall base of YARN is very high and still
growing.

ThisPR is a first level of integration, but YARN’s focus on ephemeral
containerscould be more fully leveraged by OpenWhisk to improve scalability
andperformance. Here is an interesting article on the scalability of YARN
fromMicrosoft:
https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/

Thanks,
Sam Hjelmfelt

Re: PR to enable actions on YARN

Reply via email to