Windowing doesn't work with Batch jobs. You could dump your BQ data to pubsub and then use a streaming job to window. *~Vincent*
On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann <akettm...@evolve24.com> wrote: > Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) > > Pipeline is simple, but large amounts of end files, ~125K temp files > written in one case at least > > 1. Scan Bigtable (NoSQL DB) > 2. Transform with business logic > 3. Convert to GenericRecord > 4. WriteDynamic to a google bucket as Parquet files partitioned by 15 > minute intervals. > > (gs://bucket/root_dir/CATEGORY/YEAR/MONTH/DAY/HOUR/MINUTE_FLOOR_15/FILENAME.parquet) > > > Everything does fine until I get to the writeDynamic. When it does the > groupByKey ( > FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Reshuffle/GroupByKey) > the stackdriver logs show a ton of allocation failure triggered GC that > then frees up essentially zero space and never progresses, ends up with a > "The worker lost contact with the service." error four times and then fails. > Also worth noting that Dataflow sizes down to a single worker during this > time, so it is trying to do it all at once. What are my options for > splitting > > Likely I am not hitting GC alerts because I am using a snippet I pulled > from a GCP Dataflow template that queries Bigtable that looks to disable > the GCThrashing monitoring, due to Bigtable creating at least 5 objects per > row scanned. > > DataflowPipelineDebugOptions debugOptions = options.as > (DataflowPipelineDebugOptions.class); > debugOptions.setGCThrashingPercentagePerPeriod(100.00); > > What are my options for splitting this up so that it can process this in > smaller chunks? I tried adding windowing but it didn't seem to help, or I > needed to do something else other than just the windowing, but I don't > really have a key to group it by here. > > <https://www.evolve24.com> *Andrew Kettmann* > DevOps Engineer > P: 1.314.596.2836 > [image: LinkedIn] <https://linkedin.com/company/evolve24> [image: Twitter] > <https://twitter.com/evolve24> [image: Instagram] > <https://www.instagram.com/evolve_24> > > evolve24 Confidential & Proprietary Statement: This email and any > attachments are confidential and may contain information that is > privileged, confidential or exempt from disclosure under applicable law. It > is intended for the use of the recipients. If you are not the intended > recipient, or believe that you have received this communication in error, > please do not read, print, copy, retransmit, disseminate, or otherwise use > the information. Please delete this email and attachments, without reading, > printing, copying, forwarding or saving them, and notify the Sender > immediately by reply email. No confidentiality or privilege is waived or > lost by any transmission in error. >