Thanks for the follow up Fabian!
We will see how/if we can get HOP-4193 sorted out to support your full use
case.

Cheers,
Hans

On 18 October 2022 at 09:54:15, Fabian Peters ([email protected]) wrote:

Hi all,

To anybody reading this in the future: As Hans said, don't use Beam Direct
for heavy lifting. On Dataflow, the original pipeline works w/o any issues,
no workarounds required.

I still would like to have a scheduled Dataflow job to run my entire
workflow, avoiding any additional VM/Container with Hop as a glorified
scheduler – but that will have to wait.

cheers

Fabian

Am 11.10.2022 um 09:05 schrieb Fabian Peters <[email protected]>:

Hi Hans,

For now I'm stuck on Beam-Direct, as HOP-4193
<https://issues.apache.org/jira/browse/HOP-4193> is keeping me from using
Dataflow. The amount of data involved is reasonably small, so this works ok
for now.

As reported, the pipeline executor does not work on Beam. But I've found a
workaround for now by executing the "embedded" pipeline via the local
runner and writing the results to GCS, then picking them up in a later
pipeline to get inserted in BigQuery.

cheers

Fabian

Am 10.10.2022 um 18:35 schrieb Hans Van Akelyen <[email protected]
>:

Hi Fabian,

Could you provide a bit more information? in the past couple of weeks, some
major changes have been made to improve the performance.
Are you using a Hop local engine configuration when executing the pipeline
executor or trying the Beam-Direct? If it is the second I fear that's not
really supported currently, or definitely untested.

That being said, Beam Direct is an engine type mainly for testing
implementation not made for actual heavy lifting. I would test
implementation with a couple of files and do the actual heavy processing
using Dataflow, Spark, or Flink.

In one of our next releases, we are planning to add an "Advisor" which will
warn on transforms we have not yet tested. Or that we know will not always
give the expected results.

Cheers,
Hans

On Mon, 10 Oct 2022 at 10:28, Fabian Peters <[email protected]> wrote:

> Hi all,
>
> I'm trying to process a few hundred Avro files on GCS. They are getting
> decoded and two simple filters are being applied. When running this on
> Beam-Direct, all heap space is getting filled within a minute or two. I
> threw 58 GB at it before giving up.
>
> To limit the number of files getting processed at once, I have moved the
> actual processing into a pipeline executor. Alas, when running on
> Beam-Direct, it looks like the transforms are only initialised but do not
> get executed. This concerns Write to Log, JavaScript, HTTP Client and
> BigQuery Output. Everything behaves as expected when I configure the
> pipeline executor to use the Local runner.
>
> So, two questions: Is the pipeline executor transform incompatible with
> Beam? And, are there other approaches for limiting memory use in such a
> case?
>
> cheers
>
> Fabian

Reply via email to