Hi all, To anybody reading this in the future: As Hans said, don't use Beam Direct for heavy lifting. On Dataflow, the original pipeline works w/o any issues, no workarounds required.
I still would like to have a scheduled Dataflow job to run my entire workflow, avoiding any additional VM/Container with Hop as a glorified scheduler – but that will have to wait. cheers Fabian > Am 11.10.2022 um 09:05 schrieb Fabian Peters <[email protected]>: > > Hi Hans, > > For now I'm stuck on Beam-Direct, as HOP-4193 > <https://issues.apache.org/jira/browse/HOP-4193> is keeping me from using > Dataflow. The amount of data involved is reasonably small, so this works ok > for now. > > As reported, the pipeline executor does not work on Beam. But I've found a > workaround for now by executing the "embedded" pipeline via the local runner > and writing the results to GCS, then picking them up in a later pipeline to > get inserted in BigQuery. > > cheers > > Fabian > >> Am 10.10.2022 um 18:35 schrieb Hans Van Akelyen <[email protected] >> <mailto:[email protected]>>: >> >> Hi Fabian, >> >> Could you provide a bit more information? in the past couple of weeks, some >> major changes have been made to improve the performance. >> Are you using a Hop local engine configuration when executing the pipeline >> executor or trying the Beam-Direct? If it is the second I fear that's not >> really supported currently, or definitely untested. >> >> That being said, Beam Direct is an engine type mainly for testing >> implementation not made for actual heavy lifting. I would test >> implementation with a couple of files and do the actual heavy processing >> using Dataflow, Spark, or Flink. >> >> In one of our next releases, we are planning to add an "Advisor" which will >> warn on transforms we have not yet tested. Or that we know will not always >> give the expected results. >> >> Cheers, >> Hans >> >> On Mon, 10 Oct 2022 at 10:28, Fabian Peters <[email protected] >> <mailto:[email protected]>> wrote: >> Hi all, >> >> I'm trying to process a few hundred Avro files on GCS. They are getting >> decoded and two simple filters are being applied. When running this on >> Beam-Direct, all heap space is getting filled within a minute or two. I >> threw 58 GB at it before giving up. >> >> To limit the number of files getting processed at once, I have moved the >> actual processing into a pipeline executor. Alas, when running on >> Beam-Direct, it looks like the transforms are only initialised but do not >> get executed. This concerns Write to Log, JavaScript, HTTP Client and >> BigQuery Output. Everything behaves as expected when I configure the >> pipeline executor to use the Local runner. >> >> So, two questions: Is the pipeline executor transform incompatible with >> Beam? And, are there other approaches for limiting memory use in such a case? >> >> cheers >> >> Fabian >
