Re: Speeding up CoGroup in batch job

2020-09-17 Thread Ken Krugler
Hi Robert, Thanks for the input. I did increase the amount of managed memory, and confirmed that both SSDs (on each slave) are being used for temp data. I haven’t been able to figure out why the server CPU usage is low, but I did notice that it fluctuated from very low (10%) on up to 95+%,

Re: Speeding up CoGroup in batch job

2020-09-11 Thread Robert Metzger
Hi Ken, Some random ideas that pop up in my head: - make sure you use data types that are efficient to serialize, and cheap to compare (ideally use primitive types in TupleN or POJOs) - Maybe try the TableAPI batch support (if you have time to experiment). - optimize memory usage on the

Speeding up CoGroup in batch job

2020-09-04 Thread Ken Krugler
Hi all, I added a CoGroup to my batch job, and it’s now running much slower, primarily due to back pressure from the CoGroup operator. I assume it’s because this operator is having to sort/buffer-to-disk all incoming data. Looks like about 1TB from one side of the join, currently very little