Hi Michael,

Thanks for the hint! So if I turn off speculation, consecutive appends like
above will not produce temporary files right?
Which class is responsible for disabling the use of DirectOutputCommitter?

Thank you,

Jerry


On Tue, Jan 12, 2016 at 4:12 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> There can be dataloss when you are using the DirectOutputCommitter and
> speculation is turned on, so we disable it automatically.
>
> On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi spark users and developers,
>>
>> I wonder if the following observed behaviour is expected. I'm writing
>> dataframe to parquet into s3. I'm using append mode when I'm writing to it.
>> Since I'm using org.apache.spark.sql.
>> parquet.DirectParquetOutputCommitter as
>> the spark.sql.parquet.output.committer.class, I expected that no _temporary
>> files will be generated.
>>
>> I appended the same dataframe twice to the same directory. The first
>> "append" works as expected; no _temporary files are generated because of
>> the DirectParquetOutputCommitter but the second "append" does generate
>> _temporary files and then it moved the files under the _temporary to the
>> output directory.
>>
>> Is this behavior expected? Or is it a bug?
>>
>> I'm using Spark 1.5.2.
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Reply via email to