Re: Seeking suggestions for ingesting large amount of data from S3

yuxia Sun, 12 Feb 2023 17:44:15 -0800

Hi, Eric. 
Thanks for reaching out. 
I'm wondering how do you use the Table API to ingest the data. Since the OOM is 
too general, do you have any clue for OOM? 
May be you can use jmap to what occupy the most of memory. If find, you can try 
to figure out what's the reason, is it cause by memory lack or others.


Btw, have ever tried with Flink SQL to ingeset the data. Does the OOM still 
happen? 

Best regards, 
Yuxia 


发件人: "Yang Liu" <eric.liu....@gmail.com> 
收件人: "User" <user@flink.apache.org> 
发送时间: 星期五, 2023年 2 月 10日 上午 5:10:49 
主题: Seeking suggestions for ingesting large amount of data from S3 

Hi all, 
We are trying to ingest large amounts of data (20TB) from S3 using Flink 
filesystem connector to bootstrap a Hudi table. Data are well partitioned in S3 
by date/time, but we have been facing OOM issues in Flink jobs, so we wanted to 
update the Flink job to ingest the data chunk by chuck (partition by partition) 
by some kind of looping instead of all at once. Curious what’s the recommended 
way to do this in Flink. I believe this should be a common use case, so hope to 
get some ideas here. 

We have been using Table APIs, but open to other APIs. 
Thanks & Regards 
Eric

Re: Seeking suggestions for ingesting large amount of data from S3

Reply via email to