[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter

Fokko Driesprong (JIRA) Wed, 23 Jan 2019 02:04:25 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749744#comment-16749744
 ]


Fokko Driesprong commented on FLINK-11401:
------------------------------------------

Thanks for the comment [~StephanEwen]

The RollOnCheckpoint behavior works very well for our use case, which is just 
ETL'ing the data from Kafka to a bucket. Since we're using an object store FS 
Backend (GCS), the renaming constant renaming of the files to `.in-progress` to 
`.pending` to `.avro` are far from optimal since renaming is very expensive. On 
HDFS this is a constant and atomic logic operation, in contrast when using an 
object store where this implies copying the whole file.

In the near future, we'll open a PR for the Avro writer, implementing the 
BulkWriter. Since Avro is still in a container (we want to include the schema 
in the header of the file), we still need to write a header, before writing the 
actual rows. Writing this header first would require changing some interfaces.


> Allow compression on ParquetBulkWriter
> --------------------------------------
>
>                 Key: FLINK-11401
>                 URL: https://issues.apache.org/jira/browse/FLINK-11401
>             Project: Flink
>          Issue Type: Improvement
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.7.1
>            Reporter: Fokko Driesprong
>            Assignee: Fokko Driesprong
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.8.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter

Reply via email to