[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...

uncleGen Thu, 23 Mar 2017 01:22:39 -0700

GitHub user uncleGen opened a pull request:

    https://github.com/apache/spark/pull/17395


    [SPARK-20065][SS] Avoid to output empty parquet files

    ## Problem Description
    
    Reported by Silvio Fiorito
    
    I've got a Kafka topic which I'm querying, running a windowed aggregation, 
with a 30 second watermark, 10 second trigger, writing out to Parquet with 
append output mode.
    
    Every 10 second trigger generates a file, regardless of whether there was 
any data for that trigger, or whether any records were actually finalized by 
the watermark.
    
    Is this expected behavior or should it not write out these empty files?
    
    ```
    val df = spark.readStream.format("kafka")....
    
    val query = df
      .withWatermark("timestamp", "30 seconds")
      .groupBy(window($"timestamp", "10 seconds"))
      .count()
      .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")
    
    query
      .writeStream
      .format("parquet")
      .option("checkpointLocation", aggChk)
      .trigger(ProcessingTime("10 seconds"))
      .outputMode("append")
      .start(aggPath)
    ```
    
    As the query executes, do a file listing on "aggPath" and you'll see 339 
byte files at a minimum until we arrive at the first watermark and the initial 
batch is finalized. Even after that though, as there are empty batches it'll 
keep generating empty files every trigger.
    
    ## What changes were proposed in this pull request?
    
    Check the partition is empty or not, and skip empty partition to avoid 
output empty file.
    
    ## How was this patch tested?
    
    Jenkins


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uncleGen/spark SPARK-20065

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17395
    
----
commit 86a7d2fa96e3134c1e64864eba81a3bebdedceea
Author: uncleGen <husty...@gmail.com>
Date:   2017-03-23T08:10:31Z

    avoid to output empty parquet files

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17395: [SPARK-20065][SS] Avoid to output empty parquet f...

Reply via email to