Re: Memory management | Pipeline executor on Beam

Fabian Peters Tue, 18 Oct 2022 00:54:17 -0700

Hi all,

To anybody reading this in the future: As Hans said, don't use Beam Direct for 
heavy lifting. On Dataflow, the original pipeline works w/o any issues, no 
workarounds required.


I still would like to have a scheduled Dataflow job to run my entire workflow, 
avoiding any additional VM/Container with Hop as a glorified scheduler – but 
that will have to wait.

cheers

Fabian

> Am 11.10.2022 um 09:05 schrieb Fabian Peters <[email protected]>:
> 
> Hi Hans,
> 
> For now I'm stuck on Beam-Direct, as HOP-4193 
> <https://issues.apache.org/jira/browse/HOP-4193> is keeping me from using 
> Dataflow. The amount of data involved is reasonably small, so this works ok 
> for now.
> 
> As reported, the pipeline executor does not work on Beam. But I've found a 
> workaround for now by executing the "embedded" pipeline via the local runner 
> and writing the results to GCS, then picking them up in a later pipeline to 
> get inserted in BigQuery.
> 
> cheers
> 
> Fabian
> 
>> Am 10.10.2022 um 18:35 schrieb Hans Van Akelyen <[email protected] 
>> <mailto:[email protected]>>:
>> 
>> Hi Fabian,
>> 
>> Could you provide a bit more information? in the past couple of weeks, some 
>> major changes have been made to improve the performance.
>> Are you using a Hop local engine configuration when executing the pipeline 
>> executor or trying the Beam-Direct? If it is the second I fear that's not 
>> really supported currently, or definitely untested.
>> 
>> That being said, Beam Direct is an engine type mainly for testing 
>> implementation not made for actual heavy lifting. I would test 
>> implementation with a couple of files and do the actual heavy processing 
>> using Dataflow, Spark, or Flink.
>> 
>> In one of our next releases, we are planning to add an "Advisor" which will 
>> warn on transforms we have not yet tested. Or that we know will not always 
>> give the expected results.
>> 
>> Cheers,
>> Hans
>> 
>> On Mon, 10 Oct 2022 at 10:28, Fabian Peters <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi all,
>> 
>> I'm trying to process a few hundred Avro files on GCS. They are getting 
>> decoded and two simple filters are being applied. When running this on 
>> Beam-Direct, all heap space is getting filled within a minute or two. I 
>> threw 58 GB at it before giving up.
>> 
>> To limit the number of files getting processed at once, I have moved the 
>> actual processing into a pipeline executor. Alas, when running on 
>> Beam-Direct, it looks like the transforms are only initialised but do not 
>> get executed. This concerns Write to Log, JavaScript, HTTP Client and 
>> BigQuery Output. Everything behaves as expected when I configure the 
>> pipeline executor to use the Local runner.
>> 
>> So, two questions: Is the pipeline executor transform incompatible with 
>> Beam? And, are there other approaches for limiting memory use in such a case?
>> 
>> cheers
>> 
>> Fabian
>

Re: Memory management | Pipeline executor on Beam

Reply via email to