Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19246#discussion_r139861001 --- Diff: python/pyspark/sql/types.py --- @@ -410,6 +410,24 @@ def __init__(self, name, dataType, nullable=True, metadata=None): self.dataType = dataType self.nullable = nullable self.metadata = metadata or {} + self.needConversion = dataType.needConversion + self.toInternal = dataType.toInternal + self.fromInternal = dataType.fromInternal + + def __getstate__(self): + """Return state values to be pickled.""" + return (self.name, self.dataType, self.nullable, self.metadata) + + def __setstate__(self, state): + """Restore state from the unpickled state values.""" + name, dataType, nullable, metadata = state + self.name = name + self.dataType = dataType + self.nullable = nullable + self.metadata = metadata + self.needConversion = dataType.needConversion --- End diff -- My only main concern is, it replaces the reference of the bound method from`StructType` to another method bound to another instance. I don't actually quite like a monkey patch in Python because, IMHO, it confuses other developers, which might slow down the improvement iteration from the community. I just ran the Python profile on the top of the current master with this patch: **Before** ``` ============================================================ Profile of RDD<id=13> ============================================================ 220158736 function calls (210148475 primitive calls) in 373.886 seconds ``` **After** ``` ============================================================ Profile of RDD<id=13> ============================================================ 210149857 function calls (200139596 primitive calls) in 377.577 seconds ``` Looks the improvement is not quite significant.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org