GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17395
[SPARK-20065][SS] Avoid to output empty parquet files ## Problem Description Reported by Silvio Fiorito I've got a Kafka topic which I'm querying, running a windowed aggregation, with a 30 second watermark, 10 second trigger, writing out to Parquet with append output mode. Every 10 second trigger generates a file, regardless of whether there was any data for that trigger, or whether any records were actually finalized by the watermark. Is this expected behavior or should it not write out these empty files? ``` val df = spark.readStream.format("kafka").... val query = df .withWatermark("timestamp", "30 seconds") .groupBy(window($"timestamp", "10 seconds")) .count() .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count") query .writeStream .format("parquet") .option("checkpointLocation", aggChk) .trigger(ProcessingTime("10 seconds")) .outputMode("append") .start(aggPath) ``` As the query executes, do a file listing on "aggPath" and you'll see 339 byte files at a minimum until we arrive at the first watermark and the initial batch is finalized. Even after that though, as there are empty batches it'll keep generating empty files every trigger. ## What changes were proposed in this pull request? Check the partition is empty or not, and skip empty partition to avoid output empty file. ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark SPARK-20065 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17395 ---- commit 86a7d2fa96e3134c1e64864eba81a3bebdedceea Author: uncleGen <husty...@gmail.com> Date: 2017-03-23T08:10:31Z avoid to output empty parquet files ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org