Re: Flink Read thousands of files with batch

Zhu Zhu Mon, 11 Nov 2019 02:04:35 -0800

Hi Dominik,

Would you check whether the JM GC status?
One possible cause is that the large number of file metas
inHadoopInputFormat is burdening the JM memory.


`akka.ask.timeout` is the default RPC timeout, while some RPCs may override
this timeout for their own purpose. e.g. the RPCs from web usually use
`web.timeout`.
Providing the detailed call stack of the AskTimeoutException may help to
identify where this timeout happened.

Thanks,
Zhu Zhu

Dominik Wosiński <[email protected]> 于2019年11月11日周一 上午3:35写道：

> Hey,
> I have a very specific use case. I have a history of records stored as
> Parquet in S3. I would like to read and process them with Flink. The issue
> is that the number of files is quite large ( >100k). If I provide the full
> list of files to HadoopInputFormat that I am using it will fail with
> AskTimeoutException, which Is weird since I am using YARN and setting the
> -yD akka.ask.timeout=600s, even thought according to the logs the setting
> is processed properly, the job execution still with AskTimeoutException
> after 10s, which seems weird to me. I have managed to go around this, by
> grouping files and reading them in a loop, so that finally I have the
> Seq[DataSet<Record>]. But if I try to union those datasets, then I will
> receive the AskTimeoutException again. So my question is, what can be the
> reason behind this exception being thrown and why is the setting ignored,
> even if this is pared properly.
>
> I will be glad for any help.
>
> Best Regards,
> Dom.
>

Re: Flink Read thousands of files with batch

Reply via email to