Hi Ken,

Some random ideas that pop up in my head:
- make sure you use data types that are efficient to serialize, and cheap
to compare (ideally use primitive types in TupleN or POJOs)
- Maybe try the TableAPI batch support (if you have time to experiment).
- optimize memory usage on the TaskManager for a lot of managed memory on
the TaskManager, so that we have more memory for efficient sorting (leading
to less spilling):
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_tuning.html#configure-memory-for-batch-jobs
- make sure to configure a separate tmp directory for each SSD, so that we
can spread the load across all SSDs.
- If you are saying the CPU load is 40% on a TM, we have to assume we are
IO bound: Is it the network or the disk(s)?

I hope this is some helpful inspiration for improving the performance.


On Fri, Sep 4, 2020 at 9:43 PM Ken Krugler <kkrugler_li...@transpac.com>
wrote:

> Hi all,
>
> I added a CoGroup to my batch job, and it’s now running much slower,
> primarily due to back pressure from the CoGroup operator.
>
> I assume it’s because this operator is having to sort/buffer-to-disk all
> incoming data. Looks like about 1TB from one side of the join, currently
> very little from the other but will be up to 2TB in the future.
>
> I don’t see lots of GC, I’m using about 60% of available network buffers,
> per TM server load (for all 8 servers) is about 40% average, and both SSDs
> on each TM are being used for …/flink-io-xxx/yyy.channel files.
>
> What are techniques for improving the performance of a CoGroup?
>
> Thanks!
>
> — Ken
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>

Reply via email to