The FnApiRunner is primarily for tiny jobs (development and testing)
and holds all the data in memory. You'll likely have to run with a
"real" runner to operate over datasets of this size. If you want to
run locally, you can pass --runner=FlinkRunner and (assuming you have
Java installed) it will run your job on a local Flink cluster which
might be a good option if you want to keep things on a single machine.

On Mon, Mar 13, 2023 at 11:51 AM [email protected]
<[email protected]> wrote:
>
> Let me provide more details.
>
> We are running TFX and we specified beam FnApiRunner as the underlying runner 
> type.
>
> Our dataset is a large amount of HDFS files, each around 200MB and the total 
> are around 200GB.
>
> When running our TFX code, we saw OOM issue.  I assume this is due to Beam 
> FnApiRunner loading all the data while executing each stage one by one.
>
> Regards
>
> -------------------------------------------------------------
>
> Wilson(Xiaoshuang) Wang
> Sr. Software Engineer
>
>
> On Mon, Mar 13, 2023 at 11:32 AM [email protected] 
> <[email protected]> wrote:
>>
>> Python Beam direct runner.
>>
>>
>> Regards
>>
>> -------------------------------------------------------------
>>
>> Wilson(Xiaoshuang) Wang
>> Sr. Software Engineer
>>
>>
>> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <[email protected]> wrote:
>>>
>>> Which direct runner? They are language specific.
>>>
>>> On Mon, Mar 13, 2023, 11:27 AM [email protected] 
>>> <[email protected]> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> We are trying to run our pipeline using direct runner and the input 
>>>> dataset is a large amount of HDFS files (few hundred of GB data)
>>>>
>>>> We experienced OOM issue crash. Then inside the direct runner document, I 
>>>> realized direct runner loads the whole dataset into the memory.
>>>>
>>>> Is there any way we can avoid this OOM issue?
>>>>
>>>> Regards
>>>>
>>>> -------------------------------------------------------------
>>>>
>>>> Wilson(Xiaoshuang) Wang
>>>> Sr. Software Engineer

Reply via email to