Let me provide more details. We are running TFX and we specified beam FnApiRunner as the underlying runner type.
Our dataset is a large amount of HDFS files, each around 200MB and the total are around 200GB. When running our TFX code, we saw OOM issue. I assume this is due to Beam FnApiRunner loading all the data while executing each stage one by one. Regards ------------------------------------------------------------- Wilson(Xiaoshuang) Wang Sr. Software Engineer On Mon, Mar 13, 2023 at 11:32 AM wilsonny...@gmail.com < wilsonny...@gmail.com> wrote: > Python Beam direct runner. > > > Regards > > ------------------------------------------------------------- > > Wilson(Xiaoshuang) Wang > Sr. Software Engineer > > > On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <rob...@frantil.com> wrote: > >> Which direct runner? They are language specific. >> >> On Mon, Mar 13, 2023, 11:27 AM wilsonny...@gmail.com < >> wilsonny...@gmail.com> wrote: >> >>> Hi guys, >>> >>> We are trying to run our pipeline using direct runner and the input >>> dataset is a large amount of HDFS files (few hundred of GB data) >>> >>> We experienced OOM issue crash. Then inside the direct runner document, >>> I realized direct runner loads the whole dataset into the memory. >>> >>> Is there any way we can avoid this OOM issue? >>> >>> Regards >>> >>> ------------------------------------------------------------- >>> >>> Wilson(Xiaoshuang) Wang >>> Sr. Software Engineer >>> >>