Do you have a Dataflow job ID to look at? It might be due to fusion https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
On Mon, Jun 5, 2017 at 12:13 PM Prabeesh K. <[email protected]> wrote: > Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8 > > On 5 June 2017 at 23:08, Morand, Sebastien <[email protected]> > wrote: > >> I do have a problem with my tries to test scaling on dataflow. >> >> My dataflow is pretty simple: I get a list of files from pubsub, so the >> number of files I'm going to use to feed the flow is well known at the >> begining. Here are my steps: >> Let's say I have 200 files containing about 20,000,000 of records >> >> - *First Step:* Read file contents from storage: files are .tar.gz >> containing each 4 files (CSV). I return the file content as the whole in a >> record >> *OUT:* 200 records (one for each file containing the data of all 4 >> files). Bascillacy it's a dict : {file1: content_of_file1, file2: >> content_of_file2, etc...} >> >> - *Second step:* Joining the data of the 4 files in one record (the >> main file contains foreign key to get information from the other files) >> *OUT:* 20,000,000 records each for every line in the files. Each >> record is a list of string >> >> - *Third step:* cleaning data (convert to prepare integration in >> bigquery) and set them as a dict where keys are bigquery column name. >> *OUT:* 20,000,000 records as dict for each record >> >> - *Fourth step:* insert into bigquery >> >> So the first step return 200 records, but I have 20,000,000 records to >> insert. >> This takes about 1 hour and half and always use 1 worker ... >> >> If I manually set the number of workers, it's not really faster. So for >> an unknow reason, it doesn't scale, any ideas how to do it? >> >> Thanks for any help. >> >> *Sébastien MORAND* >> Team Lead Solution Architect >> Technology & Operations / Digital Factory >> Veolia - Group Information Systems & Technology (IS&T) >> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >> <+33%201%2085%2057%2071%2008> >> Bureau 0144C (Ouest) >> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >> *www.veolia.com <http://www.veolia.com>* >> <http://www.veolia.com> >> <https://www.facebook.com/veoliaenvironment/> >> <https://www.youtube.com/user/veoliaenvironnement> >> <https://www.linkedin.com/company/veolia-environnement> >> <https://twitter.com/veolia> >> >> >> >> -------------------------------------------------------------------------------------------- >> This e-mail transmission (message and any attached files) may contain >> information that is proprietary, privileged and/or confidential to Veolia >> Environnement and/or its affiliates and is intended exclusively for the >> person(s) to whom it is addressed. If you are not the intended recipient, >> please notify the sender by return e-mail and delete all copies of this >> e-mail, including all attachments. Unless expressly authorized, any use, >> disclosure, publication, retransmission or dissemination of this e-mail >> and/or of its attachments is strictly prohibited. >> >> Ce message electronique et ses fichiers attaches sont strictement >> confidentiels et peuvent contenir des elements dont Veolia Environnement >> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc >> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce >> message par erreur, merci de le retourner a son emetteur et de le detruire >> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la >> publication, la distribution, ou la reproduction non expressement >> autorisees de ce message et de ses pieces attachees sont interdites. >> >> -------------------------------------------------------------------------------------------- >> > >
