Re: How avoid blocking when decompressing large GZIP files.

Evan Galpin Fri, 23 Apr 2021 05:10:48 -0700

Hmm in my somewhat limited experience, I was not able to combine state and
Splittable DoFn. Definitely could be user error on my part though.


RE sequence numbers, could it work to embed those numbers in the CSV itself?

Thanks,
Evan

On Fri, Apr 23, 2021 at 07:55 Simon Gauld <simon.ga...@gmail.com> wrote:

> Thank you and I will have a look however some concerns I have
>
> - the gzip itself is not splittable as such
> - I need to apply a sequence number 1..n so I believe the read *must* be
> sequential
>
> However what I am looking to achieve is handing off the newly decorated
> row as soon as the sequence is applied to it.   The issue is that the
> entire step of applying the sequence number appear to be blocking. Also of
> note, I am using a @DoFn.StateId.
>
> I'll look at SplittableDoFns, thanks.
>
>
> On Fri, Apr 23, 2021 at 12:50 PM Evan Galpin <evan.gal...@gmail.com>
> wrote:
>
>> I could be wrong but I believe that if your large file is being read by a
>> DoFn, it’s likely that the file is being processed atomically inside that
>> DoFn, which cannot be parallelized further by the runner.
>>
>> One purpose-built way around that constraint is by using Splittable
>> DoFn[1][2] which could be used to allow each split to read a portion of the
>> file. I don’t know, however, how this might (or might not) work with
>> compression.
>>
>> [1]
>> https://beam.apache.org/blog/splittable-do-fn-is-available/
>> [2]
>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>>
>> Thanks,
>> Evan
>>
>> On Fri, Apr 23, 2021 at 07:34 Simon Gauld <simon.ga...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am trying to apply a transformation to each row in a reasonably large
>>> (1b row) gzip compressed CSV.
>>>
>>> The first operation is to assign a sequence number, in this case 1,2,3..
>>>
>>> The second operation is the actual transformation.
>>>
>>> I would like to apply the sequence number *as* each row is read from the
>>> compressed source and then hand off the 'real' transformation work in
>>> parallel, using DataFlow to autoscale the workers for the transformation.
>>>
>>> I don't seem to be able to scale *until* all rows have been read; this
>>> appears to be blocking the pipeline until decompression of the entire file
>>> is completed.   At this point DataFlow autoscaling works as expected, it
>>> scales upwards and throughput is then high. The issue is the decompression
>>> appears to block.
>>>
>>> My question: in beam, is it possible to stream records from a compressed
>>> source? without blocking the pipeline?
>>>
>>> thank you
>>>
>>> .s
>>>
>>>

Re: How avoid blocking when decompressing large GZIP files.

Reply via email to