Let me provide more details.

We are running TFX and we specified beam FnApiRunner as the underlying
runner type.

Our dataset is a large amount of HDFS files, each around 200MB and the
total are around 200GB.

When running our TFX code, we saw OOM issue.  I assume this is due to Beam
FnApiRunner loading all the data while executing each stage one by one.

Regards

-------------------------------------------------------------

Wilson(Xiaoshuang) Wang
Sr. Software Engineer


On Mon, Mar 13, 2023 at 11:32 AM wilsonny...@gmail.com <
wilsonny...@gmail.com> wrote:

> Python Beam direct runner.
>
>
> Regards
>
> -------------------------------------------------------------
>
> Wilson(Xiaoshuang) Wang
> Sr. Software Engineer
>
>
> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <rob...@frantil.com> wrote:
>
>> Which direct runner? They are language specific.
>>
>> On Mon, Mar 13, 2023, 11:27 AM wilsonny...@gmail.com <
>> wilsonny...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> We are trying to run our pipeline using direct runner and the input
>>> dataset is a large amount of HDFS files (few hundred of GB data)
>>>
>>> We experienced OOM issue crash. Then inside the direct runner document,
>>> I realized direct runner loads the whole dataset into the memory.
>>>
>>> Is there any way we can avoid this OOM issue?
>>>
>>> Regards
>>>
>>> -------------------------------------------------------------
>>>
>>> Wilson(Xiaoshuang) Wang
>>> Sr. Software Engineer
>>>
>>

Reply via email to