Turns out I was using the s3:// prefix (in a standalone Spark cluster). It
was writing a LOT of block_* files to my S3 bucket, which was the cause for
the slowness. I was coming from Amazon EMR, where Amazon's underlying FS
implementation has re-mapped s3:// to s3n://, which doesn't use the
Is the operation slow every time or does it run normally if you repeat the
operation within the same app?
Nick
On Thu, Dec 18, 2014 at 8:56 AM, Jon Chase jon.ch...@gmail.com wrote:
I'm running a very simple Spark application that downloads files from S3,
does a bit of mapping, then uploads
I would suggest checking out disk IO on the nodes in your cluster and then
reading up on the limiting behaviors that accompany different kinds of EC2
storage. Depending on how things are configured for your nodes, you may
have a local storage configuration that provides bursty IOPS where you
get
I'm running a very simple Spark application that downloads files from S3,
does a bit of mapping, then uploads new files. Each file is roughly 2MB
and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not
having any download speed issues (Amazon's EMR provides a custom