Re: Possible 2.36.0 regression with CoGbkResult

Steve Niemitz Thu, 24 Mar 2022 13:05:03 -0700

We were able to reproduce the issue on our end as well in some of our jobs
as well.  We're going to revert that commit and see if that fixes it for us
too.


On Wed, Mar 23, 2022 at 3:47 PM Robert Bradshaw <rober...@google.com> wrote:

> If it hadn't been specifically requested by some customers, and been
> out for multiple releases, rolling back would be more palatable. I'll
> try to see what specifically is going on here.
>
> On Wed, Mar 23, 2022 at 10:26 AM Lara Schmidt <laraschm...@google.com>
> wrote:
> >
> > I rolled back https://github.com/apache/beam/pull/16354/files and it
> succeeded 4 times in a row (after failing 1/2 attempts before the
> rollback). Likely it's https://github.com/apache/beam/pull/16354/files.
> We probably should roll back this change?
> >
> > On Wed, Mar 23, 2022 at 1:11 AM Niel Markwick <ni...@google.com> wrote:
> >>
> >> Some more data points for my reproducer with 2.37.0, which confirms
> what Claire has already found:
> >>
> >> * Fails most of the time on Dataflow v1 with  default options
> >> * Fails most of the time on Dataflow v1 with shuffle service disabled
> (`--experiments=shuffle_mode=appliance`)
> >> * Succeeds on Dataflow v2
> >> * Succeeds on Dataflow Prime
> >> * Succeeds on DirectRunner
> >>
> >> So it is definitely runner-dependent, which may help narrow down the
> issue.
> >>
> >>
> >>
> >> --
> >>  •  Niel Markwick
> >>  •  Cloud Solutions Architect
> >>  •  Google Belgium
> >>  •  ni...@google.com
> >>  •  +32 2 894 6771
> >>
> >> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie.
> RPR: 0878.065.378
> >>
> >> If you have received this communication by mistake, please don't
> forward it to anyone else (it may contain confidential or privileged
> information), please erase all copies of it, including all attachments, and
> please let the sender know it went to the wrong person. Thanks
> >>
> >>
> >> On Wed, 23 Mar 2022 at 04:31, Claire McGinty <
> claire.d.mcgi...@gmail.com> wrote:
> >>>
> >>> (Forgot to add, this was with the transform coder wrapped with
> NullableCoder so it didn’t fail outright.)
> >>>
> >>> - Claire
> >>>
> >>> On Tue, Mar 22, 2022 at 11:26 PM Claire McGinty <
> claire.d.mcgi...@gmail.com> wrote:
> >>>>
> >>>> I was wondering if there’s any difference in how Dataflow v1 vs v2
> loads CoGbkResult iterables - in one of our internal pipielines that
> started failing, I added some logging to CoGbkResult and found that the
> TagIterable’s iterator would return hasNext() == true, but next() == null.
> >>>>
> >>>> - Claire
> >>>>
> >>>> On Tue, Mar 22, 2022 at 6:13 PM Robert Bradshaw <rober...@google.com>
> wrote:
> >>>>>
> >>>>> BEAM-13541 is as far as I'm aware the only major change that has
> >>>>> happened in this code recently, but this should be totally agnostic
> of
> >>>>> Dataflow v1 vs. v2 vs. DirectRunner. Have there been any changes to
> >>>>> the shuffle/reiterable code?
> >>>>>
> >>>>> On Tue, Mar 22, 2022 at 2:37 PM Steve Niemitz <sniem...@apache.org>
> wrote:
> >>>>> >
> >>>>> > oh I just realized I responded to a different thread, feel free to
> ignore me.
> >>>>> >
> >>>>> > On Tue, Mar 22, 2022 at 5:34 PM Niel Markwick <ni...@google.com>
> wrote:
> >>>>> >>
> >>>>> >> yeah there does seem to be some heisenbug attributes...
> >>>>> >> It failed on 4 out of 6 runs with this  reproducer, and always
> succeeded with DirectRunner or 2.35.0...
> >>>>> >> It does seem to depend on the number of elements in the group.
> >>>>> >>
> >>>>> >>
> >>>>> >> On Tue, 22 Mar 2022 at 22:27, Steve Niemitz <sniem...@apache.org>
> wrote:
> >>>>> >>>
> >>>>> >>> Your email was actually what made me notice this! :D
> >>>>> >>>
> >>>>> >>> I haven't been able to reproduce the NPE you found (also on
> 2.37) but that certainly doesn't mean it's not a bug.
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> On Tue, Mar 22, 2022 at 5:23 PM Niel Markwick <ni...@google.com>
> wrote:
> >>>>> >>>>
> >>>>> >>>> I have also seen this with Java beam 2.36.0 and 2.37.0, again
> with large groups...
> >>>>> >>>>
> >>>>> >>>> The behaviour is that some elements become null, which then
> causes either coder problems for the primitive types (as seen in Claire's
> stack trace), or downstream issues for custom types which are not expecting
> null elements...
> >>>>> >>>>
> >>>>> >>>> I have a relatively simple Java reproducer creating 200,000 KVs
> and doing a CoGroupByKey with a single KV. This runs fine on DirectRunner,
> and on Dataflow with 2.35.0, but fails on Dataflow on 2.36.0 and 2.47.0
> with the same error: "org.apache.beam.sdk.coders.CoderException: cannot
> encode a null String" in the CoGroupByKey step
> >>>>> >>>>
> >>>>> >>>>     Pipeline p =
> Pipeline.create(PipelineOptionsFactory.fromArgs(args).create());
> >>>>> >>>>     String[] bigarray = new String[200000];
> >>>>> >>>>     Arrays.fill(bigarray, "Hello World");
> >>>>> >>>>     PCollection<KV<Integer, String>> right =
> >>>>> >>>>         p.apply(Create.of(Arrays.asList(bigarray)))
> >>>>> >>>>             .apply(
> >>>>> >>>>                 MapElements.into(new TypeDescriptor<KV<Integer,
> String>>() {})
> >>>>> >>>>                     .via(input -> KV.of(100, input)));
> >>>>> >>>>     PCollection<KV<Integer, String>> left =
> p.apply(Create.of(KV.of(100, "goodbye world")));
> >>>>> >>>>
> >>>>> >>>>     TupleTag<String> leftTag = new TupleTag<String>();
> >>>>> >>>>     TupleTag<String> rightTag = new TupleTag<String>();
> >>>>> >>>>
> >>>>> >>>>     PCollection<KV<Integer, CoGbkResult>> joinedResult =
> >>>>> >>>>         KeyedPCollectionTuple.of(leftTag, left)
> >>>>> >>>>             .and(rightTag, right)
> >>>>> >>>>             .apply("Join By Key",
> CoGroupByKey.<Integer>create());
> >>>>> >>>>
> >>>>> >>>>     joinedResult.apply(
> >>>>> >>>>         "Report",
> >>>>> >>>>         MapElements.into(TypeDescriptor.of(Void.class))
> >>>>> >>>>             .via(
> >>>>> >>>>                 (KV<Integer, CoGbkResult> element) -> {
> >>>>> >>>>                   Iterable<String> leftValues =
> element.getValue().getAll(leftTag);
> >>>>> >>>>                   Iterable<String> rightValues =
> element.getValue().getAll(rightTag);
> >>>>> >>>>
> >>>>> >>>>                   System.out.println("Key = " +
> element.getKey());
> >>>>> >>>>                   System.out.println("left size= " +
> Iterables.size(leftValues));
> >>>>> >>>>                   System.out.println("right size= " +
> Iterables.size(rightValues));
> >>>>> >>>>                   return (Void) null;
> >>>>> >>>>                 }));
> >>>>> >>>>
> >>>>> >>>>     p.run().waitUntilFinish();
> >>>>> >>>>
> >>>>> >>>>
> >>>>> >>>> On Fri, 18 Mar 2022, 16:01 Reuven Lax, <re...@google.com>
> wrote:
> >>>>> >>>>>
> >>>>> >>>>>
> >>>>> >>>>> On Fri, Mar 18, 2022, 8:28 AM Claire McGinty <
> claire.d.mcgi...@gmail.com> wrote:
> >>>>> >>>>>>
> >>>>> >>>>>> To add more investigative context, we discovered that these
> jobs succeed when run with Dataflow Prime; it's just Beam 2.36.0 + Dataflow
> V1 that produces this regression. Unfortunately it's not feasible to
> upgrade our whole fleet to Dataflow Prime right now, so we've created an
> internal Google support ticket for this, too.
> >>>>> >>>>>>
> >>>>> >>>>>> - Claire
> >>>>> >>>>>>
> >>>>> >>>>>> On Tue, Mar 15, 2022 at 12:22 PM Claire McGinty <
> claire.d.mcgi...@gmail.com> wrote:
> >>>>> >>>>>>>
> >>>>> >>>>>>> Hi! No, there aren't null strings in the data--it would have
> thrown a NPE much earlier if that were the case, and likely failed when run
> with Beam 2.35.0 also, unless null-handling changed too?
> >>>>> >>>>>>>
> >>>>> >>>>>>> - Claire
> >>>>> >>>>>>>
> >>>>> >>>>>>> On Tue, Mar 15, 2022 at 12:19 PM Reuven Lax <
> re...@google.com> wrote:
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> Is it expected to have null strings in your data?
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> On Tue, Mar 15, 2022 at 9:06 AM Claire McGinty <
> claire.d.mcgi...@gmail.com> wrote:
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Hi Beam devs!
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> I wanted to surface a possible regression with
> CoGroupByKeys in this transform. Our organization (which primarily uses
> Scio) has had several pipelines start failing after upgrading from Beam
> 2.35.0 to Beam 2.36.0, specifically during cogroup operations with large
> (>10,000) key groups.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> We initially saw this problem using Scio join operations,
> but I was able to reproduce using public datasets & Beam PTransforms
> directly (job code link; the repo README includes instructions to run it if
> anyone is interested). This job throws the following error when run in
> Dataflow with Beam 2.36.0:
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Caused by: java.lang.IllegalArgumentException: Unable to
> encode element
> '[org.apache.beam.sdk.transforms.join.CoGbkResult$TagIterable@33dd2cbb,
> org.apache.beam.sdk.transforms.join.CoGbkResult$TagIterable@26edfb82]'
> with coder
> 'org.apache.beam.sdk.transforms.join.CoGbkResult$CoGbkResultCoder@346927'.
> org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
> >>>>> >>>>>>>>>
> org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
> >>>>> >>>>>>>>>
> org.apache.beam.sdk.coders.KvCoder.registerByteSizeObserver(KvCoder.java:130)
> >>>>> >>>>>>>>>
> org.apache.beam.sdk.coders.KvCoder.registerByteSizeObserver(KvCoder.java:37)
> >>>>> >>>>>>>>>
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:642)
> >>>>> >>>>>>>>>
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:558)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:403)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:128)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:67)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:285)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:268)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:84)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:416)
> >>>>> >>>>>>>>>
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:404)
> >>>>> >>>>>>>>>
> org.apache.beam.sdk.transforms.join.CoGroupByKey$ConstructCoGbkResultFn.processElement(CoGroupByKey.java:192)
> >>>>> >>>>>>>>> Caused by: org.apache.beam.sdk.coders.CoderException:
> cannot encode a null String
> org.apache.beam.sdk.coders.StringUtf8Coder.encode(StringUtf8Coder.java:74)
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> To troubleshoot, I first tried downgrading to Beam 2.35.0;
> the job then ran successfully. Next, I tried running the job using Beam
> 2.36.0, but with 2.35.0's version of CoGbkResult on the classpath; it ran
> successfully then, too. I'm wondering if it's related to the lazy-iterator
> improvements introduced in BEAM-13541? I wasn't able to reproduce it with a
> unit test yet, unfortunately; maybe it's unique to the way Dataflow
> lazily-loads large CoGbkResults.
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> I'm continuing to try to reproduce this on a smaller
> scale, but wanted to flag it here for now; it's preventing our pipeline
> fleet from upgrading to Beam 2.36.0. Any insight would be appreciated!
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>> Thanks,
> >>>>> >>>>>>>>> Claire
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>
> >>>>> >>>>>>>>>
>

Re: Possible 2.36.0 regression with CoGbkResult

Reply via email to