Re: File Naming Pattern from HadoopOutputFormat

Yitzchak Lieberman Tue, 02 Jul 2019 01:35:07 -0700

regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId()
defined the directory for the files in case of partitioning the data,
for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that:
https://issues.apache.org/jira/browse/FLINK-12573


On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <sunhaib...@163.com> wrote:

> Hi, Andreas
>
> I think the following things may be what you want.
>
> 1. For writing Avro, I think you can extend AvroOutputFormat and override
> the  getDirectoryFileName() method to customize a file name, as shown below.
> The javadoc of AvroOutputFormat:
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
>
>       public static class CustomAvroOutputFormat extends AvroOutputFormat {
>               public CustomAvroOutputFormat(Path filePath, Class type) {
>                       super(filePath, type);
>               }
>
>               public CustomAvroOutputFormat(Class type) {
>                       super(type);
>               }
>
>               @Override
>               public void open(int taskNumber, int numTasks) throws 
> IOException {
>                       this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
>                       super.open(taskNumber, numTasks);
>               }
>
>               @Override
>               protected String getDirectoryFileName(int taskNumber) {
>                       // returns a custom filename
>                       return null;
>               }
>       }
>
>
> 2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase,
> StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create
> a class that implements the BucketAssigner interface and return a custom
> file name in the getBucketId() method (the value returned by getBucketId()
> will be treated as the file name).
>
> ParquetStreamingFileSinkITCase:
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java
>
> StreamingFileSink#forBulkFormat:
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java
>
> DateTimeBucketAssigner:
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java
>
>
> Best,
> Haibo
>
> At 2019-07-02 04:15:07, "Hailu, Andreas" <andreas.ha...@gs.com> wrote:
>
> Hello Flink team,
>
>
>
> I’m writing Avro and Parquet files to HDFS, and I’ve would like to include
> a UUID as a part of the file name.
>
>
>
> Our files in HDFS currently follow this pattern:
>
>
>
> *tmp-r-00001.snappy.parquet*
>
> *tmp-r-00002.snappy.parquet*
>
> *...*
>
>
>
> I’m using a custom output format which extends a RichOutputFormat - is
> this something which is natively supported? If so, could you please
> recommend how this could be done, or share the relevant document?
>
>
>
> Best,
>
> Andreas
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>
>

Re: File Naming Pattern from HadoopOutputFormat

Reply via email to