Hi, How large is the dataset you're saving into S3? Actually saving to S3 is done in two steps: 1) writing temporary files 2) commiting them to proper directory Step 2) could be slow because S3 do not have a quick atomic "move" operation, you have to copy (server side but still takes time) and then delete the original.
I've overcome this but using a jobconf with NullOutputCommitter jobConf.setOutputCommitter(classOf[NullOutputCommitter]) Where NullOutputCommiter is a Class that doesn't do anything: class NullOutputCommitter extends OutputCommitter { def abortTask(taskContext: TaskAttemptContext) = { } override def cleanupJob(jobContext: JobContext ) = { } def commitTask(taskContext: TaskAttemptContext ) = { } def needsTaskCommit(taskContext: TaskAttemptContext ) = { false } def setupJob(jobContext: JobContext) { } def setupTask(taskContext: TaskAttemptContext) { } } This works but maybe someone has a better solution. /Raf anny9699 wrote: > Hi, > > I found writing output back to s3 using rdd.saveAsTextFile() is extremely > slow, much slower than reading from s3. Is there a way to make it faster? > The rdd has 150 partitions so parallelism is enough I assume. > > Thanks a lot! > Anny > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org