Hello, I am trying to apply a transformation to each row in a reasonably large (1b row) gzip compressed CSV.
The first operation is to assign a sequence number, in this case 1,2,3.. The second operation is the actual transformation. I would like to apply the sequence number *as* each row is read from the compressed source and then hand off the 'real' transformation work in parallel, using DataFlow to autoscale the workers for the transformation. I don't seem to be able to scale *until* all rows have been read; this appears to be blocking the pipeline until decompression of the entire file is completed. At this point DataFlow autoscaling works as expected, it scales upwards and throughput is then high. The issue is the decompression appears to block. My question: in beam, is it possible to stream records from a compressed source? without blocking the pipeline? thank you .s