I have not had an easy time following all the discussion points here. It
seems that the main issue is really something that has been true since
before Beam started: " In Beam, the user is not supposed to modify the input
collection and if they do, it's undefined behavior." and after Beam started
we added features to help: "This is the reason the DirectRunner checks for
this, to make sure the users are not relying on it."

This is all true. A runner can do "chaining" or "fusion" with zero copy and
this is allowed. An SDK harness can do the same. If a user mutates an input
to a DoFn, or mutates a value after it is output, that is user error. A
runner that eagerly cloning elements is wasting time and we should remove
that.

Kenn

On Thu, Oct 29, 2020 at 8:03 AM Teodor Spæren <teodor_spae...@riseup.net>
wrote:

> Thanks Jan, this cleared some things up!
>
> Best regards,
> Teodor Spæren
>
> On Thu, Oct 29, 2020 at 02:13:50PM +0100, Jan Lukavský wrote:
> >Hi Teodor,
> >
> >the confusion here maybe comes from the fact, that there are two
> >(logical) representations of an element in PCollection. One
> >representation is the never mutable (most probably serialized in a
> >binary form) form of a PCollection element, where no modifications are
> >possible. Once a PCollection is created (e.g. read from source, or
> >created by a PTransform) it cannot be modified further. The second
> >form is an SDK-dependent representation of each PCollection element in
> >user code. This representation is what UDFs work with. The same source
> >(binary) form of element can have (and will have) different
> >representation in Java SDK and in Python SDK. The Beam model says
> >nothing about mutability of this SDK-dependent form. Nevertheless,
> >even if you modify this element, it has no impact on the source
> >representation. But, it can lead to SDK-dependent errors, when the
> >element is mutated in a way that a runner might not expect.
> >
> >Hope this helps.
> >
> > Jan
> >
> >On 10/29/20 1:58 PM, Teodor Spæren wrote:
> >>Hey!
> >>
> >>Just so I understand this correctly then, what does the following
> >>quote from [1], section 3.2.3 mean:
> >>
> >>A PCollection is immutable. Once created, you cannot add, remove, or
> >>change individual elements. A Beam Transform might process each
> >>element of a PCollection and generate new pipeline data (as a new
> >>PCollection), *but it does not consume or modify the original input
> >>collection.*
> >>
> >>(Don't know what the normal way of highlighting is on mailing lists,
> >>so I just put it between *)
> >>
> >>I read this as meaning that it is the users responsibilty to make
> >>sure that their transformations do not modify the input, but should
> >>I rather read it as meaning the beam runner itself should make sure
> >>the user cannot make such a mistake? I find this reading at odds
> >>with the documentation about the direct runner and it's express
> >>purpose being to make sure users doesn't rely on semantics the beam
> >>model doesn't ensure. And modifying of input arguments being one of
> >>the constraints listed. [2].
> >>
> >>It doesn't change the outcome here, adding an opt out switch, but if
> >>I've missunderstood the quote above, I think this might benefit by
> >>being reworded, to make sure it is communicated that shooting
> >>yourself in the foot is impossible and the direct runner testing of
> >>modifying input should be removed, as there is no point in users
> >>making sure to not modifying the input if all runners guarantee it.
> >>
> >>
> >>Also, I ran the whole Flink test suite with a simple return instead
> >>of the deep copy and all tests passed, so there is no such test in
> >>there. Depending on the reading above, we should add such tests to
> >>all runners.
> >>
> >>Best regards,
> >>Teodor Spæren
> >>
> >>On Thu, Oct 29, 2020 at 10:16:30AM +0100, Maximilian Michels wrote:
> >>>>Ok then we are on the same page, but I disagree with your
> >>>>conclusion. The reason Flink has to do the deep copy is that it
> >>>>doesn't state that the inputs are immutable and should not be
> >>>>changed, and so have to do the deep copy. In Beam, the user is
> >>>>not supposed to modify the input collection and if they do, it's
> >>>>undefined behavior. This is the reason the DirectRunner checks
> >>>>for this, to make sure the users are not relying on it.
> >>>
> >>>It's not written anywhere that the input cannot be mutated. A
> >>>DirectRunner test is not a proof. Any runner could add a test
> >>>which proves the opposite. In fact we may have one that checks
> >>>copying for Flink.
> >>>
> >>>I prefer safety and correctness over performance because I've seen
> >>>too many cases where users shoot themselves in the foot. We should
> >>>make sure that, by default, the user cannot modify the input
> >>>element. An option to disable that is fine.
> >>>
> >>>-Max
>

Reply via email to