[ https://issues.apache.org/jira/browse/BEAM-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856777#comment-16856777 ]
Yifan Mai commented on BEAM-529: -------------------------------- Sorry, I haven't captured the proposal on JIRA yet. The general idea to have DoFnRunner hash each input element (or some sample of input elements) before and after the DoFn is run. If the hashes differ, then the input element was mutated and the pipeline should return an error. The problem is that does not actually have the semantics we want. See https://docs.python.org/3/reference/datamodel.html#object.__hash__ # Not all objects are hashable. For instance mutable containers like lists are unhashable. # User defined classes are hashable by default, but the default hash is simply the id of the object, rather than its contents. I've tried some workarounds such as: # Convert unhashable containers to immutable hashable containers before hashing them # Traverse into the __attr__ of user defined classes and hash the elements Even so, there are user defined classes that still break under this scheme. For instance, pandas DataFrame has properties that, when read, modifies a cache that is stored as a parameter. This scheme will treat the cache modification as a mutation and incorrectly raise a false positive. As such, I haven't come up with a way to do this in a way that is robust enough to cover all conceivable user code. > Check immutability violations in DirectPipelineRunner > ----------------------------------------------------- > > Key: BEAM-529 > URL: https://issues.apache.org/jira/browse/BEAM-529 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core > Reporter: Ahmet Altay > Priority: Minor > Labels: newbie, starter > > Users are going to mutate inputs and outputs of DoFn inappropriately. We > should help their tests fail to catch such mistakes. (Similar to the > DirectPipelineRunner in Java SDK) -- This message was sent by Atlassian JIRA (v7.6.3#76005)