Hi,

We're currently using an EMR cluster (which uses YARN) but submitting Spark
jobs to it using spark-submit from different machines outside the cluster.
We haven't had time to investigate using something like Livy
<https://github.com/cloudera/livy>, yet.

We also have a need to use a mix of cluster and client modes in this
configuration.

Three things we've struggled with here are

   1. Configuring spark-submit with the necessary master node host & ports
   2. Setting up the cluster to support file staging
   3. S3 implementation choices

I'm curious -- how do others handle these?

Here's what we're doing in case it helps anybody --

*Configuring spark-submit *

As far as I can tell, you can't tell spark-submit the YARN resource manager
info on the command-line with --conf properties. You must set a
SPARK_CONF_DIR or HADOOP_CONF_DIR environment variable pointing to a local
directory with core-site.xml, yarn-site.xml, and optionally hive-site.xml.

However, these setting files will override what's on the cluster, so you
have to be careful and try to assemble just what you need (since you might
use differently configured clusters).

A starting point is to start a cluster and grab the files out of
/etc/hadoop/conf and then whittle them down.

*Setting up the cluster to support file staging*

Out of the box, spark-submit will fail when trying to stage files because
the cluster will try to put them in /user/(local user name on the machine
the job was submitted from). That directory and user won't exist on the
cluster.

I think spark.yarn.stagingDir can change the directory, but you seem to
need to setup your cluster with a bootstrap action to create and give fully
open write permissions.

*S3 implementation choices*

Back in the "Role-based S3 access outside of EMR' thread, we talked about
using S3A when running with the local master on an EC2 instance, which
works in Hadoop 2.7+ with the right libraries.

AWS provides their own Hadoop FileSystem implementation for S3 called
EMRFS, and the default EMR cluster setup uses it for "s3://" scheme URIs.
As far as I know, they haven't released this library for use elsewhere. It
supports "consistency view", which uses a DynamoDB to overcome any S3
list-key inconsistency/lag for I/O ops from the cluster. Presumably, also,
they maintain it and its config, and keep them up to date and performing
well.

If you use cluster mode and "s3://" scheme URIs, things work fine.

However, if you use client mode, it seems like Spark will try to use the
Hadoop "s3://" scheme FileSystem on the submitting host for something, and
it will fail because the default implementation won't know the credentials.
One work-around is to set environment variables or Hadoop conf properties
with your secret keys (!).

Another solution is to use the S3A implementation in Hadoop 2.7.x or later.
However, if you use "s3a://" scheme URIs, they'll also be used on the
cluster -- you'll use the S3A implementation for cluster operations instead
of the EMRFS implementation.

Similarly, if you change core-site.xml locally to use the S3A
implementation for "s3://" scheme URIs, that will cause the cluster to also
use the S3A implementation, when it could have used EMRFS.

Haven't figured out how to work around this, yet, or if it's important.

Reply via email to