Windowing doesn't work with Batch jobs.  You could dump your BQ data to
pubsub and then use a streaming job to window.
*~Vincent*


On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann <akettm...@evolve24.com>
wrote:

> Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM)
>
> Pipeline is simple, but large amounts of end files, ~125K temp files
> written in one case at least
>
>    1. Scan Bigtable (NoSQL DB)
>    2. Transform with business logic
>    3. Convert to GenericRecord
>    4. WriteDynamic to a google bucket as Parquet files partitioned by 15
>    minute intervals.
>    
> (gs://bucket/root_dir/CATEGORY/YEAR/MONTH/DAY/HOUR/MINUTE_FLOOR_15/FILENAME.parquet)
>
>
> Everything does fine until I get to the writeDynamic. When it does the
> groupByKey (
> FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Reshuffle/GroupByKey)
> the stackdriver logs show a ton of allocation failure triggered GC that
> then frees up essentially zero space and never progresses, ends up with a
> "The worker lost contact with the service." error four times and then fails.
> Also worth noting that Dataflow sizes down to a single worker during this
> time, so it is trying to do it all at once. What are my options for
> splitting
>
> Likely I am not hitting GC alerts because I am using a snippet I pulled
> from a GCP Dataflow template that queries Bigtable that looks to disable
> the GCThrashing monitoring, due to Bigtable creating at least 5 objects per
> row scanned.
>
>     DataflowPipelineDebugOptions debugOptions = options.as
> (DataflowPipelineDebugOptions.class);
>     debugOptions.setGCThrashingPercentagePerPeriod(100.00);
>
> What are my options for splitting this up so that it can process this in
> smaller chunks? I tried adding windowing but it didn't seem to help, or I
> needed to do something else other than just the windowing, but I don't
> really have a key to group it by here.
>
> <https://www.evolve24.com> *Andrew Kettmann*
> DevOps Engineer
> P: 1.314.596.2836
> [image: LinkedIn] <https://linkedin.com/company/evolve24> [image: Twitter]
> <https://twitter.com/evolve24> [image: Instagram]
> <https://www.instagram.com/evolve_24>
>
> evolve24 Confidential & Proprietary Statement: This email and any
> attachments are confidential and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. It
> is intended for the use of the recipients. If you are not the intended
> recipient, or believe that you have received this communication in error,
> please do not read, print, copy, retransmit, disseminate, or otherwise use
> the information. Please delete this email and attachments, without reading,
> printing, copying, forwarding or saving them, and notify the Sender
> immediately by reply email. No confidentiality or privilege is waived or
> lost by any transmission in error.
>

Reply via email to