Hi Averell, One comment regarding what you said:
> As my files are small, I think there would not be much benefit in checkpointing file offset state. Checkpointing is not about efficiency but about consistency. If the position in a split is not checkpointed, your application won't operate with exactly-once state consistency unless each split produces exactly one record. Best, Fabian 2018-08-10 9:10 GMT+02:00 Jörn Franke <jornfra...@gmail.com>: > Or you write a custom file system for Flink... (for the tar part). > Unfortunately gz files can only be processed single threaded (there are > some multiple thread implementation but they don’t bring the big gain). > > On 10. Aug 2018, at 07:07, vino yang <yanghua1...@gmail.com> wrote: > > Hi Averell, > > In this case, I think you may need to extend Flink's existing source. > First, read your tar.gz large file, when it been decompressed, use the > multi-threaded ability to read the record in the source, and then parse the > data format (map / flatmap might be a suitable operator, you can chain > them with source because these two operator don't require data shuffle). > > Note that Flink doesn't encourage creating extra threads in UDFs, but I > don't know if there is a better way for this scenario. > > Thanks, vino. > > Averell <lvhu...@gmail.com> 于2018年8月10日周五 下午12:05写道: > >> Hi Fabian, Vino, >> >> I have one more question, which I initially planned to create a new >> thread, >> but now I think it is better to ask here: >> I need to process one big tar.gz file which contains multiple small gz >> files. What is the best way to do this? I am thinking of having one single >> thread process that read the TarArchiveStream (which has been decompressed >> from that tar.gz by Flink automatically), and then distribute the >> TarArchiveEntry entries to a multi-thread operator which would process the >> small files in parallel. If this is feasible, which elements from Flink I >> can reuse? >> >> Thanks a lot. >> Regards, >> Averell >> >> >> >> -- >> Sent from: http://apache-flink-user-mailing-list-archive.2336050. >> n4.nabble.com/ >> >