Re: Streaming File Sink - Parquet File Writer

Kostas Kloudas Wed, 30 Oct 2019 15:41:47 -0700

Hi Vinay,

You are correct when saying that the bulk formats only support
onCheckpointRollingPolicy.

The reason for this has to do with the fact that currently Flink
relies on the Hadoop writer for Parquet.

Bulk formats keep important details about how they write the actual
data (such as compression
schemes, offsets, etc) in metadata and they write this metadata with
the file (e.g. parquet writes
them as a footer). The hadoop writer gives no access to these
metadata. Given this, there is
no way for flink to be able to checkpoint a part file securely without
closing it.

The solution would be to write our own writer and not go through the
hadoop one, but there
are no concrete plans for this, as far as I know.

Cheers,
Kostas

On Tue, Oct 29, 2019 at 12:57 PM Vinay Patil <vinay18.pa...@gmail.com> wrote:
>
> Hi,
>
> I am not able to roll the files based on file size as the bulkFormat has 
> onCheckpointRollingPolicy.
>
> One way is to write CustomStreamingFileSink and provide RollingPolicy like 
> RowFormatBuilder. Is this the correct way to go ahead ?
>
> Another way is to write ParquetEncoder and use RowFormatBuilder.
>
> P.S. Curious to know Why was the RollingPolicy not exposed in case of 
> BulkFormat ?
>
> Regards,
> Vinay Patil

Re: Streaming File Sink - Parquet File Writer

Reply via email to