Hi Rafal, Thanks for the explanation and solution! I need to write maybe 100 GB to s3. I will try your way and see whether it works for me.
Thanks again! On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny <[email protected]> wrote: > Hi, > How large is the dataset you're saving into S3? > > Actually saving to S3 is done in two steps: > 1) writing temporary files > 2) commiting them to proper directory > Step 2) could be slow because S3 do not have a quick atomic "move" > operation, you have to copy (server side but still takes time) and then > delete the original. > > I've overcome this but using a jobconf with NullOutputCommitter > jobConf.setOutputCommitter(classOf[NullOutputCommitter]) > > Where NullOutputCommiter is a Class that doesn't do anything: > > class NullOutputCommitter extends OutputCommitter { > def abortTask(taskContext: TaskAttemptContext) = { } > override def cleanupJob(jobContext: JobContext ) = { } > def commitTask(taskContext: TaskAttemptContext ) = { } > def needsTaskCommit(taskContext: TaskAttemptContext ) = { false } > def setupJob(jobContext: JobContext) { } > def setupTask(taskContext: TaskAttemptContext) { } > } > > This works but maybe someone has a better solution. > > /Raf > > anny9699 wrote: > > Hi, > > > > I found writing output back to s3 using rdd.saveAsTextFile() is extremely > > slow, much slower than reading from s3. Is there a way to make it faster? > > The rdd has 150 partitions so parallelism is enough I assume. > > > > Thanks a lot! > > Anny > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > >
