Hi Rodric and Carlos,

ApacheHadoop has three major components: HDFS (distributed filesystem), 
MapReduce(distributed batch processing engine), YARN (Yet Another Resource 
Negotiator) (containerengine). While MapReduce has been largely replaced by 
Apache Tez, Apache Spark,and Apache Flink, HDFS and YARN are still widely used 
for data analytics use cases. 



YARN is unique as a container engine because, unlike Mesos and Kubernetes, it 
was designed for ephemeral, short-livedcontainers rather than for long running 
micro-services. The jobs and queries that run on YARN are split intosmall tasks 
that run to completion and generally only last for seconds or maybe minutes. 
Overthe last couple years, YARN has been expanding its support for long running 
usecases, but is still focused on data-driven use cases over more generic 
micro-serviceuse cases (like web apps). The primary long running technologies 
on YARN are currently Spark Streamingand TensorFlow. Here is an articlefrom 
LinkedIn about why they created a project for TensorFlow on YARN. Asimilar case 
could be made for OpenWhisk: 
https://engineering.linkedin.com/blog/2018/09/open-sourcing-tony--native-support-of-tensorflow-on-hadoop.
 



Bringing OpenWhisk onto YARN makes FaaS more accessible to thethousands of 
organizations with existing Hadoop clusters. Between Cloudera’s 2,000+ 
customers; Azure, AWS,and GCP cloud customers; and the organizations 
self-supporting like Netflix, theinstall base of YARN is very high and still 
growing.

 

ThisPR is a first level of integration, but YARN’s focus on ephemeral 
containerscould be more fully leveraged by OpenWhisk to improve scalability 
andperformance. Here is an interesting article on the scalability of YARN 
fromMicrosoft: 
https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/

Thanks,
Sam Hjelmfelt

Reply via email to