JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354 > But if we would like to fix all these problems, all public APIs accepting `Encoder` will need the copy. I think that most existing uses of Encoders are thread-safe because either (a) the use occurs inside of a Spark task and task gets its own fresh copy when the `Task` is deserialized or (b) the use occurs on the driver but the code calls call `resolveAndBind` (which internally performs a `copy`) prior to using the Encoder. Given this, I suspect that this might be the only non-thread-safe Encoder usage in Spark (excluding code which is only used in Spark's unit tests). I don't think that we need to introduce similar copying in other public APIs. > I did some research about this and found some noticeable performance regression in our internal benchmark. What do you think about improving the performance / reducing the cost of `.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the immutable `vals` become fields of the case class, (b) the current constructor becomes a `.apply()` on the companion object and the case class constructor becomes `private`, and (c) `resolveAndBind` calls the companion object constructor instead of `copy()`? Given this, I think `copy()` could be _really_ cheap, effectively giving us a fresh copy of the internal mutable state but copying all other immutable attributes without performing any re-resolution, analysis, attribute binding, etc. If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one object allocation) and then could copy-by-default and free users from having to worry about thread-safety. I think that's a potentially huge win from a developer productivity point-of-view: the cost / toil of having to worry about thread-unsafe code is a tax placed on end users and creates a developer education / training burden, so I think it's worth attempting to eliminate this entire class of pitfall.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org