[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

JoshRosen Sat, 18 Oct 2014 15:40:00 -0700

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2838#discussion_r19056687
  
    --- Diff: python/pyspark/worker.py ---
    @@ -131,6 +130,14 @@ def process():
         for (aid, accum) in _accumulatorRegistry.items():
             pickleSer._write_with_length((aid, accum._value), outfile)
     
    +    # check end of stream
    +    if read_int(infile) == SpecialLengths.END_OF_STREAM:
    --- End diff --
    
    This seems to assume that `infile` will point to the beginning of a 
(length, [optional data]) pair rather than at some random data in the middle of 
a data block.  I believe that this is actually correct given our current code, 
but I'm worried that the reason _why_ it's correct is subtle / brittle.
    
    Our current serializers seem to maintain this invariant that once control 
flow has exited the serializer then `infile` will point to a valid length.  For 
example, if I call `deserializer.load_stream(infile)` then call `next()` on 
this iterator, then once my `next()` call returns `infile` will point to a 
valid length.  This invariant holds for our batched pickle serializer because 
we first deserialize an entire block (e.g. read a whole (length, data) pair) 
then yield its items one-by-one.  I think it also holds for all of the other 
serializers, but it could potentially be broken by a streaming deserializer 
which doesn't need to read the entire data block in order to yield the first 
item.
    
    Basically, my concern is that there's a lot of implicit / subtle 
assumptions here that we should document if we're going to take this approach.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

Reply via email to