Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66426107 Hmm, might be tricky to explain if you do not have sufficient context, let me give it a shot. a) Streams in java are not usually multiplexed - unless explicitly stated otherwise. With this PR, the same underlying stream (fileStream) is being reused across deserializeStream (and its users). One way it manifests is (b) b) Most streams in java override finalize to close their underlying stream in case they are going out of scope (to prevent resource leak, etc) : ofcourse this is an implementation detail, but is the general expectation. In this case, deserializeStream gets re-assigned somewhere in the method below - causing the previous 'deserializeStream' to go out of scope. When gc kicks in, and then when finalizers are run, deserializeStream's finalize can call its close, resulting in fileStream to get closed - which might now be used by some other deserializeStream : since it was re-used. This will cause hard to debug crashes/bugs. I am sure I am missing other spectacular ways in which this can fail :-) - in general, these things happen when the basic api expectation (probably implicit here maybe) is broken. Now, we can go down this path in case the operation we are saving is very expensive -which is not the case here (it is a cheap file open/close which is saved).
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org