Re: DirectFileOutputCommiter

Takeshi Yamamuro Thu, 25 Feb 2016 22:33:40 -0800

Hi,

Great work!
What is the concrete performance gain of the committer on s3?
I'd like to know.


I think there is no direct committer for files because these kinds of
committer has risks
to loss data (See: SPARK-10063).
Until this resolved, ISTM files cannot support direct commits.

thanks,



On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <[email protected]> wrote:

> yes, should be this one
> https://gist.github.com/aarondav/c513916e72101bbe14ec
>
> then need to set it in spark-defaults.conf :
> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>
> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
> > The header of DirectOutputCommitter.scala says Databricks.
> > Did you get it from Databricks ?
> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <[email protected]> wrote:
> >>
> >> interesting in this topic as well, why the DirectFileOutputCommitter
> not included?
> >> we added it in our fork,
> under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
> >> moreover, this DirectFileOutputCommitter is not working for the insert
> operations in HiveContext, since the Committer is called by hive (means
> uses dependencies in hive package)
> >> we made some hack to fix this, you can take a look:
> >>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
> >>
> >> may bring some ideas to other spark contributors to find a better way
> to use s3.
> >>
> >> 2016-02-22 23:18 GMT+01:00 igor.berman <[email protected]>:
> >>>
> >>> Hi,
> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
> alikes
> >>> especially when working with s3?
> >>> I know that there is one impl in spark distro for parquet format, but
> not
> >>> for files -  why?
> >>>
> >>> Imho, it can bring huge performance boost.
> >>> Using default FileOutputCommiter with s3 has big overhead at commit
> stage
> >>> when all parts are copied one-by-one to destination dir from
> _temporary,
> >>> which is bottleneck when number of partitions is high.
> >>>
> >>> Also, wanted to know if there are some problems when using
> >>> DirectFileOutputCommitter?
> >>> If writing one partition directly will fail in the middle is spark will
> >>> notice this and will fail job(say after all retries)?
> >>>
> >>> thanks in advance
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >
> >
>



-- 
---
Takeshi Yamamuro

Re: DirectFileOutputCommiter

Reply via email to