There can be dataloss when you are using the DirectOutputCommitter and speculation is turned on, so we disable it automatically.
On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi spark users and developers, > > I wonder if the following observed behaviour is expected. I'm writing > dataframe to parquet into s3. I'm using append mode when I'm writing to it. > Since I'm using org.apache.spark.sql. > parquet.DirectParquetOutputCommitter as > the spark.sql.parquet.output.committer.class, I expected that no _temporary > files will be generated. > > I appended the same dataframe twice to the same directory. The first > "append" works as expected; no _temporary files are generated because of > the DirectParquetOutputCommitter but the second "append" does generate > _temporary files and then it moved the files under the _temporary to the > output directory. > > Is this behavior expected? Or is it a bug? > > I'm using Spark 1.5.2. > > Best Regards, > > Jerry >