Hi,

I have a pipeline that reads filepaths from pubsub and sends them to avroIO to 
parse the actual files and writes them back into parquet.

PubSubIO(filepaths) ----> AvroIO.parseFilesGenericRecords() ----> Window ----> 
FileWriter

Each file can contain millions of records which create a fanout in AvroIO.
If I start the pipeline while there is already a backlog of messages in PubSub, 
all messages are quickly getting consumed and acknowledged but the pipeline 
then struggles to parse all files and the memory blows up.
I have tried with different types of windows, or by adding a reshuffle step 
right after parseFilesGenericRecords, but without any success.

Any ideas on how to resolve that ?

Thanks,
Jean Wisser.



Jean Wisser
Data Engineer

[https://www.flowtraders.com/sites/flowtraders-corp/files/images/flow-traders-logo-mv.png]

Flow Traders B.V.

T: +31 20 799 6497
F: +31 20 799 6780

Jacob Bontiusplaats 9
1018 LL Amsterdam
Nederland
www.flowtraders.com<http://www.flowtraders.com>

Flow Traders B.V. has its seat in Amsterdam, Nederland, its registered office 
at Jacob Bontiusplaats 9, 1018 LL, Amsterdam, Nederland and is registered with 
the Trade Registry of the Chamber of Commerce under number 33.22.3268. This 
message may contain information that is not intended for you. If you are not 
the addressee or if this message was sent to you by mistake, you are requested 
to inform the sender and delete the message. This message may not be forwarded 
or published to any other person than its addressees without Flow Traders 
B.V.'s prior consent. Flow Traders B.V. accepts no liability for damage of any 
kind resulting from the risks inherent in the electronic transmission of 
messages.

<<attachment: winmail.dat>>

Reply via email to