Re: Streaming to Parquet Files in HDFS

Averell Thu, 04 Oct 2018 16:55:31 -0700

Hi Fabian, Kostas,

>From the description of this ticket
https://issues.apache.org/jira/browse/FLINK-9753, I understand that now my
output parquet file with StreamingFileSink will span multiple checkpoints.
However, when I tried (as in the here below code snippet) I still see that
one "part-X-X" file is created after each checkpoint. Is there any other
configuration that I'm missing?


BTW, I have another question regarding
StreamingFileSink.BulkFormatBuilder.withBucketCheckInterval(). As per the
notes at the end of this page  StreamingFileSink
<https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/streamfile_sink.html>
 
, buck-enconding can only combined with OnCheckpointRollingPolicy, which
rolls on every checkpoint. So setting that CheckInterval makes no
difference. So why should we expose that withBucketCheckInterval method?

Thanks and best regards,
Averell

        def buildSink[T <: MyBaseRecord](outputPath: String)(implicit ct:
ClassTag[T]): StreamingFileSink[T] = {
                StreamingFileSink.forBulkFormat(new Path(outputPath),
ParquetAvroWriters.forReflectRecord(ct.runtimeClass)).asInstanceOf[StreamingFileSink.BulkFormatBuilder[T,
String]]
                                .withBucketCheckInterval(5L * 60L * 1000L)
                                .withBucketAssigner(new 
DateTimeBucketAssigner[T]("yyyy-MM-dd--HH"))
                                .build()
        }




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Streaming to Parquet Files in HDFS

Reply via email to