Hi all, Our team recently did a similar experiment and came to a similar observation as what Teodor did. The Beam slack channel points me to email thread discussion.
It seems like there is a jira issue created and Teodor had a PR. May I ask what is the decision on that and if the PR is approved? May I also have the link to the jira issue? Much appreciated. Antonio. On 2020/10/30 20:08:32, Kenneth Knowles <[email protected]> wrote: > I have not had an easy time following all the discussion points here. It > seems that the main issue is really something that has been true since > before Beam started: " In Beam, the user is not supposed to modify the input > collection and if they do, it's undefined behavior." and after Beam started > we added features to help: "This is the reason the DirectRunner checks for > this, to make sure the users are not relying on it." > > This is all true. A runner can do "chaining" or "fusion" with zero copy and > this is allowed. An SDK harness can do the same. If a user mutates an input > to a DoFn, or mutates a value after it is output, that is user error. A > runner that eagerly cloning elements is wasting time and we should remove > that. > > Kenn > > On Thu, Oct 29, 2020 at 8:03 AM Teodor Spæren <[email protected]> > wrote: > > > Thanks Jan, this cleared some things up! > > > > Best regards, > > Teodor Spæren > > > > On Thu, Oct 29, 2020 at 02:13:50PM +0100, Jan Lukavský wrote: > > >Hi Teodor, > > > > > >the confusion here maybe comes from the fact, that there are two > > >(logical) representations of an element in PCollection. One > > >representation is the never mutable (most probably serialized in a > > >binary form) form of a PCollection element, where no modifications are > > >possible. Once a PCollection is created (e.g. read from source, or > > >created by a PTransform) it cannot be modified further. The second > > >form is an SDK-dependent representation of each PCollection element in > > >user code. This representation is what UDFs work with. The same source > > >(binary) form of element can have (and will have) different > > >representation in Java SDK and in Python SDK. The Beam model says > > >nothing about mutability of this SDK-dependent form. Nevertheless, > > >even if you modify this element, it has no impact on the source > > >representation. But, it can lead to SDK-dependent errors, when the > > >element is mutated in a way that a runner might not expect. > > > > > >Hope this helps. > > > > > > Jan > > > > > >On 10/29/20 1:58 PM, Teodor Spæren wrote: > > >>Hey! > > >> > > >>Just so I understand this correctly then, what does the following > > >>quote from [1], section 3.2.3 mean: > > >> > > >>A PCollection is immutable. Once created, you cannot add, remove, or > > >>change individual elements. A Beam Transform might process each > > >>element of a PCollection and generate new pipeline data (as a new > > >>PCollection), *but it does not consume or modify the original input > > >>collection.* > > >> > > >>(Don't know what the normal way of highlighting is on mailing lists, > > >>so I just put it between *) > > >> > > >>I read this as meaning that it is the users responsibilty to make > > >>sure that their transformations do not modify the input, but should > > >>I rather read it as meaning the beam runner itself should make sure > > >>the user cannot make such a mistake? I find this reading at odds > > >>with the documentation about the direct runner and it's express > > >>purpose being to make sure users doesn't rely on semantics the beam > > >>model doesn't ensure. And modifying of input arguments being one of > > >>the constraints listed. [2]. > > >> > > >>It doesn't change the outcome here, adding an opt out switch, but if > > >>I've missunderstood the quote above, I think this might benefit by > > >>being reworded, to make sure it is communicated that shooting > > >>yourself in the foot is impossible and the direct runner testing of > > >>modifying input should be removed, as there is no point in users > > >>making sure to not modifying the input if all runners guarantee it. > > >> > > >> > > >>Also, I ran the whole Flink test suite with a simple return instead > > >>of the deep copy and all tests passed, so there is no such test in > > >>there. Depending on the reading above, we should add such tests to > > >>all runners. > > >> > > >>Best regards, > > >>Teodor Spæren > > >> > > >>On Thu, Oct 29, 2020 at 10:16:30AM +0100, Maximilian Michels wrote: > > >>>>Ok then we are on the same page, but I disagree with your > > >>>>conclusion. The reason Flink has to do the deep copy is that it > > >>>>doesn't state that the inputs are immutable and should not be > > >>>>changed, and so have to do the deep copy. In Beam, the user is > > >>>>not supposed to modify the input collection and if they do, it's > > >>>>undefined behavior. This is the reason the DirectRunner checks > > >>>>for this, to make sure the users are not relying on it. > > >>> > > >>>It's not written anywhere that the input cannot be mutated. A > > >>>DirectRunner test is not a proof. Any runner could add a test > > >>>which proves the opposite. In fact we may have one that checks > > >>>copying for Flink. > > >>> > > >>>I prefer safety and correctness over performance because I've seen > > >>>too many cases where users shoot themselves in the foot. We should > > >>>make sure that, by default, the user cannot modify the input > > >>>element. An option to disable that is fine. > > >>> > > >>>-Max > > >
