Re: DirectFileOutputCommiter

Teng Qiu Fri, 26 Feb 2016 02:44:55 -0800

Hi, thanks :) performance gain is huge, we have a INSERT INTO query, ca.
30GB in JSON format will be written to s3 at the end, without
DirectOutputCommitter and our hack in hive and InsertIntoHiveTable.scala,
it took more than 40min, with our changes, only 15min then.


DirectOutputCommitter works for SparkContext and SqlContext, but for
HiveContext, it only solved the problem with "staging folder" in target
table, problem for HiveContext is here:
https://github.com/apache/spark/blob/v1.6.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L132

besides staging folder created by Committer, Hive will use a temp location
as well... so we made some hack on this:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R134

mainly idea is, we added an internal var runID, and use HiveConf
spark.hive.insert.skip.temp to disable Hive to use temp location, but with
this hack, we need to change Hive's implementation... we put our Hive.java
file under
sql/hive/src/main/java/org/apache/hadoop/hive/ql/metadata/Hive.java

you can find the full change using this link:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R134


i would like to forward this discuss to spark-dev, hope spark team can
think about it, and hope there will be a better solution for this, like
some more official hack :D


2016-02-26 7:24 GMT+01:00 Takeshi Yamamuro <linguin....@gmail.com>:

> Hi,
>
> Great work!
> What is the concrete performance gain of the committer on s3?
> I'd like to know.
>
> I think there is no direct committer for files because these kinds of
> committer has risks
> to loss data (See: SPARK-10063).
> Until this resolved, ISTM files cannot support direct commits.
>
> thanks,
>
>
>
> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <teng...@gmail.com> wrote:
>
>> yes, should be this one
>> https://gist.github.com/aarondav/c513916e72101bbe14ec
>>
>> then need to set it in spark-defaults.conf :
>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>>
>> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
>> > The header of DirectOutputCommitter.scala says Databricks.
>> > Did you get it from Databricks ?
>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote:
>> >>
>> >> interesting in this topic as well, why the DirectFileOutputCommitter
>> not included?
>> >> we added it in our fork,
>> under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>> >> moreover, this DirectFileOutputCommitter is not working for the insert
>> operations in HiveContext, since the Committer is called by hive (means
>> uses dependencies in hive package)
>> >> we made some hack to fix this, you can take a look:
>> >>
>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>> >>
>> >> may bring some ideas to other spark contributors to find a better way
>> to use s3.
>> >>
>> >> 2016-02-22 23:18 GMT+01:00 igor.berman <igor.ber...@gmail.com>:
>> >>>
>> >>> Hi,
>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
>> alikes
>> >>> especially when working with s3?
>> >>> I know that there is one impl in spark distro for parquet format, but
>> not
>> >>> for files -  why?
>> >>>
>> >>> Imho, it can bring huge performance boost.
>> >>> Using default FileOutputCommiter with s3 has big overhead at commit
>> stage
>> >>> when all parts are copied one-by-one to destination dir from
>> _temporary,
>> >>> which is bottleneck when number of partitions is high.
>> >>>
>> >>> Also, wanted to know if there are some problems when using
>> >>> DirectFileOutputCommitter?
>> >>> If writing one partition directly will fail in the middle is spark
>> will
>> >>> notice this and will fail job(say after all retries)?
>> >>>
>> >>> thanks in advance
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>> >
>> >
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: DirectFileOutputCommiter

Reply via email to