Re: Scaling Pipelines & Workflows

Hans Van Akelyen Tue, 18 Oct 2022 03:21:25 -0700

 Hi Jochen,

Hop Server runs in a JVM, you can edit the hop-server.sh script to add
extra options to the JVM (allocate more memory to the JVM default is
2048MB).


Parallelisation and scaling, each Hop transform creates its own thread and
will consume records on the input side and place them to the output side
after processing. You can increase the amount of instances/threads of a
transform by clicking on it and changing the “number of copies”. One thing
to keep in mind is that when you for example add more copies to a table
input that it will execute the query “x” times and will result in x times
the same rows unless you add logic in your query to distribute the data
over these multiple instances.
( you can use ${Internal.Transform.CopyNr} and a mod function on an ID
column for example).

What we usually see in the field is that CPU is not the bottleneck of
pipelines, usually IO is a limiting factor.
When looking at the status of a pipeline via the UI in Hop Server there are
indications to what is the bottleneck, you have a field containing the
input/output buffers of each transform. The transform that has max rows
(default: 10000) on input and 0 on output is your bottleneck. If you see no
data pile-up in the buffers it means that it is processing the data just as
fast as it is receiving it (your database can’t feed rows faster than it
does).

It might be that the pipeline can’t go faster because the DB does not
deliver records any faster, or that the XML writer can’t write faster to
disk than it does.

When dealing with performance issues:
- The transform metrics will show you who is the culprit
- Look at Memory/CPU usage (as you are already doing)
- Increase (copies/threads) but be mindful of the implications as the rows
will be split over multiple instances (input,output,sorting,grouping)

Hope this helps,
Hans

On 18 October 2022 at 11:39:21, Jochen Gatternig ([email protected])
wrote:

Dear all



Are there option/parameters in the Hop server that allow parallelization
and scaling of the processing?

We tested it with a pipeline configuration which read data from a source
table, created XMLs and write them to a filesystem. Additionally, it copies
a document to the very same directory.

Our server has 8 cores (VM).



When running it with a single job, the system caps at 400-450%.

However, we then thought to modify the where-clause and run 2-4 jobs
separately. However, each job seems to be capped at 100-150% CPU load.



Any idea how to increase performance?



Regards

Jochen



Beste Grüsse



*Jochen Gatternig*

Head of Advisory

Telefon +41 76 431 00 94

[email protected] <[email protected]>

Re: Scaling Pipelines & Workflows

Reply via email to