Hi,
How large is the dataset you're saving into S3?

Actually saving to S3 is done in two steps:
1) writing temporary files
2) commiting them to proper directory
Step 2) could be slow because S3 do not have a quick atomic "move"
operation, you have to copy (server side but still takes time) and then
delete the original.

I've overcome this but using a jobconf with NullOutputCommitter
      jobConf.setOutputCommitter(classOf[NullOutputCommitter])

Where NullOutputCommiter is a Class that doesn't do anything:

  class NullOutputCommitter extends OutputCommitter {
    def abortTask(taskContext: TaskAttemptContext) =  { }
    override  def cleanupJob(jobContext: JobContext ) = { }
    def commitTask(taskContext: TaskAttemptContext ) = { }
    def needsTaskCommit(taskContext: TaskAttemptContext ) = {  false  }
    def setupJob(jobContext: JobContext) { }
    def setupTask(taskContext: TaskAttemptContext) { }
  }

This works but maybe someone has a better solution.

/Raf

anny9699 wrote:
> Hi,
>
> I found writing output back to s3 using rdd.saveAsTextFile() is extremely
> slow, much slower than reading from s3. Is there a way to make it faster?
> The rdd has 150 partitions so parallelism is enough I assume.
>
> Thanks a lot!
> Anny
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to