I do have a problem with my tries to test scaling on dataflow.
My dataflow is pretty simple: I get a list of files from pubsub, so the
number of files I'm going to use to feed the flow is well known at the
begining. Here are my steps:
Let's say I have 200 files containing about 20,000,000 of records
- *First Step:* Read file contents from storage: files are .tar.gz
containing each 4 files (CSV). I return the file content as the whole in a
record
*OUT:* 200 records (one for each file containing the data of all 4
files). Bascillacy it's a dict : {file1: content_of_file1, file2:
content_of_file2, etc...}
- *Second step:* Joining the data of the 4 files in one record (the
main file contains foreign key to get information from the other files)
*OUT:* 20,000,000 records each for every line in the files. Each record
is a list of string
- *Third step:* cleaning data (convert to prepare integration in
bigquery) and set them as a dict where keys are bigquery column name.
*OUT:* 20,000,000 records as dict for each record
- *Fourth step:* insert into bigquery
So the first step return 200 records, but I have 20,000,000 records to
insert.
This takes about 1 hour and half and always use 1 worker ...
If I manually set the number of workers, it's not really faster. So for an
unknow reason, it doesn't scale, any ideas how to do it?
Thanks for any help.
*Sébastien MORAND*
Team Lead Solution Architect
Technology & Operations / Digital Factory
Veolia - Group Information Systems & Technology (IS&T)
Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
Bureau 0144C (Ouest)
30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
*www.veolia.com <http://www.veolia.com>*
<http://www.veolia.com>
<https://www.facebook.com/veoliaenvironment/>
<https://www.youtube.com/user/veoliaenvironnement>
<https://www.linkedin.com/company/veolia-environnement>
<https://twitter.com/veolia>
--
--------------------------------------------------------------------------------------------
This e-mail transmission (message and any attached files) may contain
information that is proprietary, privileged and/or confidential to Veolia
Environnement and/or its affiliates and is intended exclusively for the
person(s) to whom it is addressed. If you are not the intended recipient,
please notify the sender by return e-mail and delete all copies of this
e-mail, including all attachments. Unless expressly authorized, any use,
disclosure, publication, retransmission or dissemination of this e-mail
and/or of its attachments is strictly prohibited.
Ce message electronique et ses fichiers attaches sont strictement
confidentiels et peuvent contenir des elements dont Veolia Environnement
et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
message par erreur, merci de le retourner a son emetteur et de le detruire
ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
publication, la distribution, ou la reproduction non expressement
autorisees de ce message et de ses pieces attachees sont interdites.
--------------------------------------------------------------------------------------------