[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter
[ https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334031#comment-17334031 ] Flink Jira Bot commented on FLINK-11401: This issue was marked "stale-assigned" and has not received an update in 7 days. It is now automatically unassigned. If you are still working on it, you can assign it to yourself again. Please also give an update about the status of the work. > Allow compression on ParquetBulkWriter > -- > > Key: FLINK-11401 > URL: https://issues.apache.org/jira/browse/FLINK-11401 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available, stale-assigned > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter
[ https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323388#comment-17323388 ] Flink Jira Bot commented on FLINK-11401: This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > Allow compression on ParquetBulkWriter > -- > > Key: FLINK-11401 > URL: https://issues.apache.org/jira/browse/FLINK-11401 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available, stale-assigned > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter
[ https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142678#comment-17142678 ] Ryo Okubo commented on FLINK-11401: --- Any progress on this issue? The pullreq seems to be mostly done. Do you still have any concern about this? I found a duplicated issue https://issues.apache.org/jira/browse/FLINK-16491 and it looks taking same way. > Allow compression on ParquetBulkWriter > -- > > Key: FLINK-11401 > URL: https://issues.apache.org/jira/browse/FLINK-11401 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter
[ https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749744#comment-16749744 ] Fokko Driesprong commented on FLINK-11401: -- Thanks for the comment [~StephanEwen] The RollOnCheckpoint behavior works very well for our use case, which is just ETL'ing the data from Kafka to a bucket. Since we're using an object store FS Backend (GCS), the renaming constant renaming of the files to `.in-progress` to `.pending` to `.avro` are far from optimal since renaming is very expensive. On HDFS this is a constant and atomic logic operation, in contrast when using an object store where this implies copying the whole file. In the near future, we'll open a PR for the Avro writer, implementing the BulkWriter. Since Avro is still in a container (we want to include the schema in the header of the file), we still need to write a header, before writing the actual rows. Writing this header first would require changing some interfaces. > Allow compression on ParquetBulkWriter > -- > > Key: FLINK-11401 > URL: https://issues.apache.org/jira/browse/FLINK-11401 > Project: Flink > Issue Type: Improvement > Components: Batch Connectors and Input/Output Formats >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter
[ https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748795#comment-16748795 ] Stephan Ewen commented on FLINK-11401: -- I can see that being useful. Please bear in mind that bulk writers currently have the implication that they need to roll on checkpoint, because many formats (like Parquet) don't make it easy to intermediately persist and resume writes. Avro's row-by-row append nature makes it possible to write part files across checkpoints. One could think of letting the row-formats add a header, when opening a part file. That would allow the Avro writes to keep the property of writing part files across checkpoints. > Allow compression on ParquetBulkWriter > -- > > Key: FLINK-11401 > URL: https://issues.apache.org/jira/browse/FLINK-11401 > Project: Flink > Issue Type: Improvement >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)