Re: Possible 80% reduction in overhead for flink runner, input needed

Jan Lukavský Thu, 29 Oct 2020 06:15:02 -0700

Hi Teodor,

the confusion here maybe comes from the fact, that there are two(logical) representations of an element in PCollection. Onerepresentation is the never mutable (most probably serialized in abinary form) form of a PCollection element, where no modifications arepossible. Once a PCollection is created (e.g. read from source, orcreated by a PTransform) it cannot be modified further. The second formis an SDK-dependent representation of each PCollection element in usercode. This representation is what UDFs work with. The same source(binary) form of element can have (and will have) differentrepresentation in Java SDK and in Python SDK. The Beam model saysnothing about mutability of this SDK-dependent form. Nevertheless, evenif you modify this element, it has no impact on the sourcerepresentation. But, it can lead to SDK-dependent errors, when theelement is mutated in a way that a runner might not expect.


Hope this helps.

 Jan

On 10/29/20 1:58 PM, Teodor Spæren wrote:

Hey!
Just so I understand this correctly then, what does the followingquote from [1], section 3.2.3 mean:
A PCollection is immutable. Once created, you cannot add, remove, orchange individual elements. A Beam Transform might process eachelement of a PCollection and generate new pipeline data (as a newPCollection), *but it does not consume or modify the original inputcollection.*
(Don't know what the normal way of highlighting is on mailing lists,so I just put it between *)
I read this as meaning that it is the users responsibilty to make surethat their transformations do not modify the input, but should Irather read it as meaning the beam runner itself should make sure theuser cannot make such a mistake? I find this reading at odds with thedocumentation about the direct runner and it's express purpose beingto make sure users doesn't rely on semantics the beam model doesn'tensure. And modifying of input arguments being one of the constraintslisted. [2].
It doesn't change the outcome here, adding an opt out switch, but ifI've missunderstood the quote above, I think this might benefit bybeing reworded, to make sure it is communicated that shooting yourselfin the foot is impossible and the direct runner testing of modifyinginput should be removed, as there is no point in users making sure tonot modifying the input if all runners guarantee it.
Also, I ran the whole Flink test suite with a simple return instead ofthe deep copy and all tests passed, so there is no such test in there.Depending on the reading above, we should add such tests to all runners.
Best regards,
Teodor Spæren

On Thu, Oct 29, 2020 at 10:16:30AM +0100, Maximilian Michels wrote:
Ok then we are on the same page, but I disagree with yourconclusion. The reason Flink has to do the deep copy is that itdoesn't state that the inputs are immutable and should not bechanged, and so have to do the deep copy. In Beam, the user is notsupposed to modify the input collection and if they do, it'sundefined behavior. This is the reason the DirectRunner checks forthis, to make sure the users are not relying on it.
It's not written anywhere that the input cannot be mutated. ADirectRunner test is not a proof. Any runner could add a test whichproves the opposite. In fact we may have one that checks copying forFlink.
I prefer safety and correctness over performance because I've seentoo many cases where users shoot themselves in the foot. We shouldmake sure that, by default, the user cannot modify the input element.An option to disable that is fine.
-Max

Re: Possible 80% reduction in overhead for flink runner, input needed

Reply via email to