[GitHub] [beam] nbali commented on issue #26395: [Bug]: Possibly unnecessary prefetch during GroupIntoBatches

via GitHub Tue, 02 May 2023 18:49:14 -0700


nbali commented on issue #26395:
URL: https://github.com/apache/beam/issues/26395#issuecomment-1532358684


   "assuming there was some reason for that"
   I assume too, but there is literally no comment or discussion about it 
anywhere I could find. Meanwhile when I just think it though I can't find any 
justification for it.
   
   "Have you tested that this is caused by the pointed code path? Would be nice 
if there is some benchmark data to share with"
   Kinda. I didn't launch any pipeline with modified code to pinpoint this as 
100% certainty.
   
   Also I don't think my pipeline result can be considered good enough for 
benchmarking regarding consistency and straightforwardness.
   
   ... but basically what I did was that I ran the pipelines on the same kafka 
streams with different configs concurrently. So theoretically the "shuffled" 
data amount should be equal. It wasn't. Meanwhile everything else (read data 
from kafka, written data into bq, etc) were essentially identical. My original 
goal wasn't to test this, but to actually optimize the pipelines by using 
non-default config values as the default seemed rather unoptimized based on the 
GCS storage usage pattern, and average batch size, and sometimes the pipeline 
couldn't even scale down due to the estimated backlog size (when cpu was 
clearly available)
   
   In order to do that I was playing with these configs:
   ```
   options.setMaxBufferingDurationMilliSec(...);
   options.setGcsUploadBufferSizeBytes(...);
   ```
   When I increased them - as expected - the behaviour, cpu utilization, 
storage usage pattern, etc changed in a way that corresponds with having bigger 
batches, but I noticed increased costs due to the "processed data amount". So 
it was obvious for some reason it handles data differently, so I started 
checking the code what it might be, and this seems like the only thing that 
could cause this.
   
   For example one run like I mentioned:
   Pipeline1:
    * gcsUploadBufferSizeBytes: **2MB**
    * maxBufferingDurationMilliSec: 30s
    * every other config identical
    * Total vCPU time 2 139,443 vCPU hr
    * Kafka read:
      * Elements added (Approximate) 8 617 148 267
      * Estimated size 5,89 TB
    * BQ Write:
      * 8 617 148 273 + 8 617 148 224
      * 5,97 TB + 5,99 TB
   * Total streaming data processed **26,66 TB**
     * depending on if both reading and writing counts or only one of them this 
should be either around ~12TB or ~24TB, so 26.66 seems okay
   
   Pipeline2:
    * gcsUploadBufferSizeBytes: **16MB**
    * maxBufferingDurationMilliSec: 30s
    * every other config identical
    * Total vCPU time 2 232,822 vCPU hr
    * Kafka read:
      * Elements added (Approximate) 8 637 509 081
      * Estimated size 6,16 TB
    * BQ Write:
      * 8 637 509 008 + 8 637 509 087
      * 6,06 TB + 6,09 TB
   * Total streaming data processed **46,06 TB**
     * 12/24 vs 46


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nbali commented on issue #26395: [Bug]: Possibly unnecessary prefetch during GroupIntoBatches

Reply via email to