regarding option 2 for parquet: implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example: <root dir>/day=20190101/part-1-1 there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573
On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <sunhaib...@163.com> wrote: > Hi, Andreas > > I think the following things may be what you want. > > 1. For writing Avro, I think you can extend AvroOutputFormat and override > the getDirectoryFileName() method to customize a file name, as shown below. > The javadoc of AvroOutputFormat: > https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html > > public static class CustomAvroOutputFormat extends AvroOutputFormat { > public CustomAvroOutputFormat(Path filePath, Class type) { > super(filePath, type); > } > > public CustomAvroOutputFormat(Class type) { > super(type); > } > > @Override > public void open(int taskNumber, int numTasks) throws > IOException { > this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS); > super.open(taskNumber, numTasks); > } > > @Override > protected String getDirectoryFileName(int taskNumber) { > // returns a custom filename > return null; > } > } > > > 2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, > StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create > a class that implements the BucketAssigner interface and return a custom > file name in the getBucketId() method (the value returned by getBucketId() > will be treated as the file name). > > ParquetStreamingFileSinkITCase: > https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java > > StreamingFileSink#forBulkFormat: > https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java > > DateTimeBucketAssigner: > https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java > > > Best, > Haibo > > At 2019-07-02 04:15:07, "Hailu, Andreas" <andreas.ha...@gs.com> wrote: > > Hello Flink team, > > > > I’m writing Avro and Parquet files to HDFS, and I’ve would like to include > a UUID as a part of the file name. > > > > Our files in HDFS currently follow this pattern: > > > > *tmp-r-00001.snappy.parquet* > > *tmp-r-00002.snappy.parquet* > > *...* > > > > I’m using a custom output format which extends a RichOutputFormat - is > this something which is natively supported? If so, could you please > recommend how this could be done, or share the relevant document? > > > > Best, > > Andreas > > ------------------------------ > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices > >