Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19226#discussion_r138790747 --- Diff: python/pyspark/serializers.py --- @@ -343,9 +346,6 @@ def _load_stream_without_unbatching(self, stream): key_batch_stream = self.key_ser._load_stream_without_unbatching(stream) val_batch_stream = self.val_ser._load_stream_without_unbatching(stream) for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream): - if len(key_batch) != len(val_batch): - raise ValueError("Can not deserialize PairRDD with different number of items" - " in batches: (%d, %d)" % (len(key_batch), len(val_batch))) # for correctness with repeated cartesian/zip this must be returned as one batch yield zip(key_batch, val_batch) --- End diff -- How about returning this batch as a list (and as described in the doc)?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org