Re: Breaking change for FileIO WriteDynamic in Beam 2.34?

2022-04-06 Thread Steve Niemitz
Without the full logs it's hard to say, but I've definitely seen that error
in the past when the worker disks are full.  ApplianceShuffleWriter needs
to extract a native library to a temp location, and if the disk is full
that'll fail, resulting in the NoClassDefFoundError.

On Wed, Apr 6, 2022 at 12:46 PM Chamikara Jayalath 
wrote:

> I'm not aware of a breaking change along these lines off the top of my
> head. Sounds like the classes required for Dataflow shuffle are missing
> somehow. Unless someone here can point to something, you might have to
> contact Google Cloud support so they can look at your job.
>
> Thanks,
> Cham
>
> On Wed, Apr 6, 2022 at 9:39 AM Ahmet Altay  wrote:
>
>> /cc @Chamikara Jayalath 
>>
>> On Tue, Apr 5, 2022 at 11:29 PM Siyu Lin  wrote:
>>
>>> Hi Beam community,
>>>
>>> We have a batch pipeline which does not run regularly. Recently we
>>> have upgraded to Beam 2.36 and this broke the FileIO WriteDynamic
>>> process.
>>>
>>> We are using Dataflow Runner, and the errors are like this when there
>>> are multiple workers:
>>>
>>> Error message from worker: java.lang.NoClassDefFoundError: Could not
>>> initialize class
>>> org.apache.beam.runners.dataflow.worker.ApplianceShuffleWriter
>>>
>>> org.apache.beam.runners.dataflow.worker.ShuffleSink.writer(ShuffleSink.java:348)
>>>
>>> org.apache.beam.runners.dataflow.worker.SizeReportingSinkWrapper.writer(SizeReportingSinkWrapper.java:46)
>>>
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.initializeWriter(WriteOperation.java:71)
>>>
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.start(WriteOperation.java:78)
>>>
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92)
>>>
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:420)
>>>
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:389)
>>>
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314)
>>>
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>>>
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>>>
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>>> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> java.lang.Thread.run(Thread.java:748)
>>>
>>> However when there is only a single worker, the error is like this:
>>>
>>> The job failed because a work item has failed 4 times. Look in
>>> previous log entries for the cause of each one of the 4 failures. For
>>> more information, see
>>> https://cloud.google.com/dataflow/docs/guides/common-errors. The work
>>> item was attempted on these workers: xxx Root cause: The worker lost
>>> contact with the service.,
>>>
>>> The error guided suggested upgrade machine type.
>>>
>>> Those errors happen when using SDK 2.34+. When I switched to SDK 2.33,
>>> everything worked well without any issues. Tried SDK 2.34, 2.35 and
>>> 2.36, and found all of them got the same issue.
>>>
>>> Context: The code simply just reads from BigQuery with a fixed table
>>> of 4,034 records, does some transform, and outputs to GCS with
>>> FileIO.WriteDynamic. All tests were performed using the same machine
>>> type with the same worker number.
>>>
>>> Does anyone know if there are any breaking changes in this SDK /
>>> Dataflow runner?
>>>
>>> Thanks so much!
>>> Siyu
>>>
>>


Breaking change for FileIO WriteDynamic in Beam 2.34?

2022-04-06 Thread Siyu Lin
Hi Beam community,

We have a batch pipeline which does not run regularly. Recently we
have upgraded to Beam 2.36 and this broke the FileIO WriteDynamic
process.

We are using Dataflow Runner, and the errors are like this when there
are multiple workers:

Error message from worker: java.lang.NoClassDefFoundError: Could not
initialize class
org.apache.beam.runners.dataflow.worker.ApplianceShuffleWriter
org.apache.beam.runners.dataflow.worker.ShuffleSink.writer(ShuffleSink.java:348)
org.apache.beam.runners.dataflow.worker.SizeReportingSinkWrapper.writer(SizeReportingSinkWrapper.java:46)
org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.initializeWriter(WriteOperation.java:71)
org.apache.beam.runners.dataflow.worker.util.common.worker.WriteOperation.start(WriteOperation.java:78)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92)
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:420)
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:389)
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314)
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

However when there is only a single worker, the error is like this:

The job failed because a work item has failed 4 times. Look in
previous log entries for the cause of each one of the 4 failures. For
more information, see
https://cloud.google.com/dataflow/docs/guides/common-errors. The work
item was attempted on these workers: xxx Root cause: The worker lost
contact with the service.,

The error guided suggested upgrade machine type.

Those errors happen when using SDK 2.34+. When I switched to SDK 2.33,
everything worked well without any issues. Tried SDK 2.34, 2.35 and
2.36, and found all of them got the same issue.

Context: The code simply just reads from BigQuery with a fixed table
of 4,034 records, does some transform, and outputs to GCS with
FileIO.WriteDynamic. All tests were performed using the same machine
type with the same worker number.

Does anyone know if there are any breaking changes in this SDK /
Dataflow runner?

Thanks so much!
Siyu