Hi, Great work! What is the concrete performance gain of the committer on s3? I'd like to know.
I think there is no direct committer for files because these kinds of committer has risks to loss data (See: SPARK-10063). Until this resolved, ISTM files cannot support direct commits. thanks, On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <teng...@gmail.com> wrote: > yes, should be this one > https://gist.github.com/aarondav/c513916e72101bbe14ec > > then need to set it in spark-defaults.conf : > https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 > > Am Freitag, 26. Februar 2016 schrieb Yin Yang : > > The header of DirectOutputCommitter.scala says Databricks. > > Did you get it from Databricks ? > > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote: > >> > >> interesting in this topic as well, why the DirectFileOutputCommitter > not included? > >> we added it in our fork, > under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala > >> moreover, this DirectFileOutputCommitter is not working for the insert > operations in HiveContext, since the Committer is called by hive (means > uses dependencies in hive package) > >> we made some hack to fix this, you can take a look: > >> > https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando > >> > >> may bring some ideas to other spark contributors to find a better way > to use s3. > >> > >> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>: > >>> > >>> Hi, > >>> Wanted to understand if anybody uses DirectFileOutputCommitter or > alikes > >>> especially when working with s3? > >>> I know that there is one impl in spark distro for parquet format, but > not > >>> for files - why? > >>> > >>> Imho, it can bring huge performance boost. > >>> Using default FileOutputCommiter with s3 has big overhead at commit > stage > >>> when all parts are copied one-by-one to destination dir from > _temporary, > >>> which is bottleneck when number of partitions is high. > >>> > >>> Also, wanted to know if there are some problems when using > >>> DirectFileOutputCommitter? > >>> If writing one partition directly will fail in the middle is spark will > >>> notice this and will fail job(say after all retries)? > >>> > >>> thanks in advance > >>> > >>> > >>> > >>> > >>> -- > >>> View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > >> > > > > > -- --- Takeshi Yamamuro