Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2838#discussion_r19056687 --- Diff: python/pyspark/worker.py --- @@ -131,6 +130,14 @@ def process(): for (aid, accum) in _accumulatorRegistry.items(): pickleSer._write_with_length((aid, accum._value), outfile) + # check end of stream + if read_int(infile) == SpecialLengths.END_OF_STREAM: --- End diff -- This seems to assume that `infile` will point to the beginning of a (length, [optional data]) pair rather than at some random data in the middle of a data block. I believe that this is actually correct given our current code, but I'm worried that the reason _why_ it's correct is subtle / brittle. Our current serializers seem to maintain this invariant that once control flow has exited the serializer then `infile` will point to a valid length. For example, if I call `deserializer.load_stream(infile)` then call `next()` on this iterator, then once my `next()` call returns `infile` will point to a valid length. This invariant holds for our batched pickle serializer because we first deserialize an entire block (e.g. read a whole (length, data) pair) then yield its items one-by-one. I think it also holds for all of the other serializers, but it could potentially be broken by a streaming deserializer which doesn't need to read the entire data block in order to yield the first item. Basically, my concern is that there's a lot of implicit / subtle assumptions here that we should document if we're going to take this approach.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org