interesting in this topic as well, why the DirectFileOutputCommitter not
included?

we added it in our fork, under
core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala

moreover, this DirectFileOutputCommitter is not working for the insert
operations in HiveContext, since the Committer is called by hive (means
uses dependencies in hive package)

we made some hack to fix this, you can take a look:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando

may bring some ideas to other spark contributors to find a better way to
use s3.


2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>:

> Hi,
> Wanted to understand if anybody uses DirectFileOutputCommitter or alikes
> especially when working with s3?
> I know that there is one impl in spark distro for parquet format, but not
> for files -  why?
>
> Imho, it can bring huge performance boost.
> Using default FileOutputCommiter with s3 has big overhead at commit stage
> when all parts are copied one-by-one to destination dir from _temporary,
> which is bottleneck when number of partitions is high.
>
> Also, wanted to know if there are some problems when using
> DirectFileOutputCommitter?
> If writing one partition directly will fail in the middle is spark will
> notice this and will fail job(say after all retries)?
>
> thanks in advance
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to