[ 
https://issues.apache.org/jira/browse/FLINK-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016037#comment-17016037
 ] 

Grzegorz Kołakowski edited comment on FLINK-11395 at 1/15/20 2:48 PM:
----------------------------------------------------------------------

In fact, my previous comment does not make much sense.

In addition to metadata kept in the header, the file consists of one or more 
data blocks. Data blocks, in turn, are separated by 16-byte sync markers. The 
blocks can be compressed as well 
([docs|https://avro.apache.org/docs/1.9.1/spec.html#Object+Container+Files]). 
This is why row-based writer is not enough. However, it is possible to use 
existing {{BulkWriter}} to write avro files, but with {{BulkWriter}} we do not 
control size of files we write to - they are closed on checkpoint. 

So what about modifying existing {{BulkWriter.Factory}} to have two create 
methods, e.g.: createForNewFile, createForResumedFile? Alternatively, let's 
introduce a new interface which is designed for appendable block formats like 
avro. This way, we can easily wrap a {{org.apache.avro.file.DataFileWriter}} 
with the interface. The difficult part might be resuming a file. To this end, 
the {{DataFileWriter}} has to read the file header (via 
{{org.apache.avro.file.SeekableInput}} interface) in order to load schema, sync 
marker, codec etc. I guess, this may require further changes in interfaces such 
as {{ResumeRecoverable}}.

I've noticed the fix version is updated to 1.11.0. May I ask what approach will 
be implemented? Will it be possible to control when file is closed like for 
row-based writer or the file is closed on checkpoint like for bulk writer?


was (Author: grzegorz_kolakowski):
In fact, my previous comment does not make much sense.

In addition to metadata kept in the header, the file consists of one or more 
data blocks. Data blocks, in turn, are separated by 16-byte sync markers. The 
blocks can be compressed as well 
([docs|https://avro.apache.org/docs/1.9.1/spec.html#Object+Container+Files]). 
In consequence, row-based writer implementation is not enough.

So what about modifying existing {{BulkWriter.Factory}} to have two create 
methods, e.g.: createForNewFile, createForResumedFile? Alternatively, let's 
introduce a new interface which is designed for appendable block formats like 
avro. This way, we can easily wrap a {{org.apache.avro.file.DataFileWriter}} 
with the interface. The difficult part might be resuming a file. To this end, 
the {{DataFileWriter}} has to read the file header (via 
{{org.apache.avro.file.SeekableInput}} interface) in order to load schema, sync 
marker, codec etc. I guess, this may require further changes in interfaces such 
as {{ResumeRecoverable}}.

I've noticed the fix version is updated to 1.11.0. May I ask what approach will 
be implemented? Will it be possible to control when file is closed like for 
row-based writer or the file is closed on checkpoint like for bulk writer?

> Support for Avro StreamingFileSink
> ----------------------------------
>
>                 Key: FLINK-11395
>                 URL: https://issues.apache.org/jira/browse/FLINK-11395
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem
>            Reporter: Elango Ganesan
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>
> Current implementation for StreamingFileSink supports Rowformat for text and 
> json files . BulkWriter has support for Parquet files . Out of the box 
> RowFormat does not seem to be right fit for writing Avro files as avro files 
> need to persist metadata when opening new file.  We are thinking of 
> implementing Avro writers similar to how Parquet is implemented . I am happy 
> to submit a PR if this approach sounds good . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to