Without the full logs it's hard to say, but I've definitely seen that error
in the past when the worker disks are full.  ApplianceShuffleWriter needs
to extract a native library to a temp location, and if the disk is full
that'll fail, resulting in the NoClassDefFoundError.

On Wed, Apr 6, 2022 at 12:46 PM Chamikara Jayalath <chamik...@google.com>
wrote:

> I'm not aware of a breaking change along these lines off the top of my
> head. Sounds like the classes required for Dataflow shuffle are missing
> somehow. Unless someone here can point to something, you might have to
> contact Google Cloud support so they can look at your job.
>
> Thanks,
> Cham
>
> On Wed, Apr 6, 2022 at 9:39 AM Ahmet Altay <al...@google.com> wrote:
>
>> /cc @Chamikara Jayalath <chamik...@google.com>
>>
>> On Tue, Apr 5, 2022 at 11:29 PM Siyu Lin <siyu...@unity3d.com> wrote:
>>
>>> Hi Beam community,
>>>
>>> We have a batch pipeline which does not run regularly. Recently we
>>> have upgraded to Beam 2.36 and this broke the FileIO WriteDynamic
>>> process.
>>>
>>> We are using Dataflow Runner, and the errors are like this when there
>>> are multiple workers:
>>>
>>> Error message from worker: java.lang.NoClassDefFoundError: Could not
>>> initialize class
>>> org.apache.beam.runners.dataflow.worker.ApplianceShuffleWriter
>>>
>>> org.apache.beam.runners.dataflow.worker.ShuffleSink.writer(ShuffleSink.java:348)
>>>
>>> org.apache.beam.runners.dataflow.worker.SizeReportingSinkWrapper.writer(SizeReportingSinkWrapper.java:46)
>>>
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.initializeWriter(WriteOperation.java:71)
>>>
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.start(WriteOperation.java:78)
>>>
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92)
>>>
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:420)
>>>
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:389)
>>>
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314)
>>>
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>>>
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>>>
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>>> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> java.lang.Thread.run(Thread.java:748)
>>>
>>> However when there is only a single worker, the error is like this:
>>>
>>> The job failed because a work item has failed 4 times. Look in
>>> previous log entries for the cause of each one of the 4 failures. For
>>> more information, see
>>> https://cloud.google.com/dataflow/docs/guides/common-errors. The work
>>> item was attempted on these workers: xxx Root cause: The worker lost
>>> contact with the service.,
>>>
>>> The error guided suggested upgrade machine type.
>>>
>>> Those errors happen when using SDK 2.34+. When I switched to SDK 2.33,
>>> everything worked well without any issues. Tried SDK 2.34, 2.35 and
>>> 2.36, and found all of them got the same issue.
>>>
>>> Context: The code simply just reads from BigQuery with a fixed table
>>> of 4,034 records, does some transform, and outputs to GCS with
>>> FileIO.WriteDynamic. All tests were performed using the same machine
>>> type with the same worker number.
>>>
>>> Does anyone know if there are any breaking changes in this SDK /
>>> Dataflow runner?
>>>
>>> Thanks so much!
>>> Siyu
>>>
>>

Reply via email to