Without the full logs it's hard to say, but I've definitely seen that error in the past when the worker disks are full. ApplianceShuffleWriter needs to extract a native library to a temp location, and if the disk is full that'll fail, resulting in the NoClassDefFoundError.
On Wed, Apr 6, 2022 at 12:46 PM Chamikara Jayalath <chamik...@google.com> wrote: > I'm not aware of a breaking change along these lines off the top of my > head. Sounds like the classes required for Dataflow shuffle are missing > somehow. Unless someone here can point to something, you might have to > contact Google Cloud support so they can look at your job. > > Thanks, > Cham > > On Wed, Apr 6, 2022 at 9:39 AM Ahmet Altay <al...@google.com> wrote: > >> /cc @Chamikara Jayalath <chamik...@google.com> >> >> On Tue, Apr 5, 2022 at 11:29 PM Siyu Lin <siyu...@unity3d.com> wrote: >> >>> Hi Beam community, >>> >>> We have a batch pipeline which does not run regularly. Recently we >>> have upgraded to Beam 2.36 and this broke the FileIO WriteDynamic >>> process. >>> >>> We are using Dataflow Runner, and the errors are like this when there >>> are multiple workers: >>> >>> Error message from worker: java.lang.NoClassDefFoundError: Could not >>> initialize class >>> org.apache.beam.runners.dataflow.worker.ApplianceShuffleWriter >>> >>> org.apache.beam.runners.dataflow.worker.ShuffleSink.writer(ShuffleSink.java:348) >>> >>> org.apache.beam.runners.dataflow.worker.SizeReportingSinkWrapper.writer(SizeReportingSinkWrapper.java:46) >>> >>> org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.initializeWriter(WriteOperation.java:71) >>> >>> org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.start(WriteOperation.java:78) >>> >>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92) >>> >>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:420) >>> >>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:389) >>> >>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314) >>> >>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) >>> >>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) >>> >>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) >>> java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> java.lang.Thread.run(Thread.java:748) >>> >>> However when there is only a single worker, the error is like this: >>> >>> The job failed because a work item has failed 4 times. Look in >>> previous log entries for the cause of each one of the 4 failures. For >>> more information, see >>> https://cloud.google.com/dataflow/docs/guides/common-errors. The work >>> item was attempted on these workers: xxx Root cause: The worker lost >>> contact with the service., >>> >>> The error guided suggested upgrade machine type. >>> >>> Those errors happen when using SDK 2.34+. When I switched to SDK 2.33, >>> everything worked well without any issues. Tried SDK 2.34, 2.35 and >>> 2.36, and found all of them got the same issue. >>> >>> Context: The code simply just reads from BigQuery with a fixed table >>> of 4,034 records, does some transform, and outputs to GCS with >>> FileIO.WriteDynamic. All tests were performed using the same machine >>> type with the same worker number. >>> >>> Does anyone know if there are any breaking changes in this SDK / >>> Dataflow runner? >>> >>> Thanks so much! >>> Siyu >>> >>