How avoid blocking when decompressing large GZIP files.

Simon Gauld Fri, 23 Apr 2021 04:34:27 -0700

Hello,

I am trying to apply a transformation to each row in a reasonably large (1b
row) gzip compressed CSV.


The first operation is to assign a sequence number, in this case 1,2,3..

The second operation is the actual transformation.

I would like to apply the sequence number *as* each row is read from the
compressed source and then hand off the 'real' transformation work in
parallel, using DataFlow to autoscale the workers for the transformation.

I don't seem to be able to scale *until* all rows have been read; this
appears to be blocking the pipeline until decompression of the entire file
is completed.   At this point DataFlow autoscaling works as expected, it
scales upwards and throughput is then high. The issue is the decompression
appears to block.

My question: in beam, is it possible to stream records from a compressed
source? without blocking the pipeline?

thank you

.s

How avoid blocking when decompressing large GZIP files.

Reply via email to