Hi, I have seen the doc in Spark 2.2 about Structured Steaming > Append mode (default) - This is the default mode, where only the new rows > added to the Result Table since the last trigger will be outputted to the > sink. This is supported for only those queries where rows added to the > Result Table is never going to change. Hence, this mode guarantees that > each row will be output only once (assuming fault-tolerant sink). For > example, <b>queries with only select, where, map, flatMap, filter, join, > etc. will support Append mode.</b>
So I tried to output Streaming DataFrame to HDFS with sample code but get many smaller files in target output path, > > val df = spark.readStream .option("sep", ",") .option("header", true) > .option("quote","\"").csv(inputpath) > val flow: DataFrame => DataFrame = df.select("name") // I also try to > use df.withColumn(xxxx) > val Data: DataFrame = flow(df) > val query: StreamingQuery = Data.writeStream > .format("csv") > .option("header", "true") > .option("format", "append") > .option("path", output) > .option("checkpointLocation", "/tmp/checkout") > .outputMode(OutputMode.Append()) > .start() query.processAllAvailable() I founded that there was 4 executors in Mesos web in the job duration My question is generic: 1. Is it a bug with Append mode,I means why not write all records to one file with append mode? 2. Is there any way to write all records to one file except using `hadoop getmerge` or `Data.coalesce(1).writeStream.xx` not so well as repartition to 1 partition to generate 1 output file