Re: Small-files source - partitioning based on prefix of file

Jörn Franke Fri, 10 Aug 2018 00:10:27 -0700

Or you write a custom file system for Flink... (for  the tar part).
Unfortunately gz files can only be processed single threaded (there are some 
multiple thread implementation but they don’t bring the big gain).


> On 10. Aug 2018, at 07:07, vino yang <yanghua1...@gmail.com> wrote:
> 
> Hi Averell,
> 
> In this case, I think you may need to extend Flink's existing source. 
> First, read your tar.gz large file, when it been decompressed, use the 
> multi-threaded ability to read the record in the source, and then parse the 
> data format (map / flatmap  might be a suitable operator, you can chain them 
> with source because these two operator don't require data shuffle).
> 
> Note that Flink doesn't encourage creating extra threads in UDFs, but I don't 
> know if there is a better way for this scenario.
> 
> Thanks, vino.
> 
> Averell <lvhu...@gmail.com> 于2018年8月10日周五 下午12:05写道：
>> Hi Fabian, Vino,
>> 
>> I have one more question, which I initially planned to create a new thread,
>> but now I think it is better to ask here:
>> I need to process one big tar.gz file which contains multiple small gz
>> files. What is the best way to do this? I am thinking of having one single
>> thread process that read the TarArchiveStream (which has been decompressed
>> from that tar.gz by Flink automatically), and then distribute the
>> TarArchiveEntry entries to a multi-thread operator which would process the
>> small files in parallel. If this is feasible, which elements from Flink I
>> can reuse?
>> 
>> Thanks a lot.
>> Regards,
>> Averell
>> 
>> 
>> 
>> --
>> Sent from: 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Small-files source - partitioning based on prefix of file

Reply via email to