Re: Spark output to s3 extremely slow

Anny Chen Thu, 16 Oct 2014 09:52:40 -0700

Hi Rafal,

Thanks for the explanation and solution! I need to write maybe 100 GB to
s3. I will try your way and see whether it works for me.


Thanks again!

On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny <[email protected]> wrote:

> Hi,
> How large is the dataset you're saving into S3?
>
> Actually saving to S3 is done in two steps:
> 1) writing temporary files
> 2) commiting them to proper directory
> Step 2) could be slow because S3 do not have a quick atomic "move"
> operation, you have to copy (server side but still takes time) and then
> delete the original.
>
> I've overcome this but using a jobconf with NullOutputCommitter
>       jobConf.setOutputCommitter(classOf[NullOutputCommitter])
>
> Where NullOutputCommiter is a Class that doesn't do anything:
>
>   class NullOutputCommitter extends OutputCommitter {
>     def abortTask(taskContext: TaskAttemptContext) =  { }
>     override  def cleanupJob(jobContext: JobContext ) = { }
>     def commitTask(taskContext: TaskAttemptContext ) = { }
>     def needsTaskCommit(taskContext: TaskAttemptContext ) = {  false  }
>     def setupJob(jobContext: JobContext) { }
>     def setupTask(taskContext: TaskAttemptContext) { }
>   }
>
> This works but maybe someone has a better solution.
>
> /Raf
>
> anny9699 wrote:
> > Hi,
> >
> > I found writing output back to s3 using rdd.saveAsTextFile() is extremely
> > slow, much slower than reading from s3. Is there a way to make it faster?
> > The rdd has 150 partitions so parallelism is enough I assume.
> >
> > Thanks a lot!
> > Anny
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>

Re: Spark output to s3 extremely slow

Reply via email to