[jira] [Commented] (SPARK-4019) Repartitioning with more than 2000 partitions may drop all data when partitions are mostly empty or cause deserialization errors if at least one partition is empty

Josh Rosen (JIRA) Wed, 22 Oct 2014 16:31:59 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180737#comment-14180737
 ]


Josh Rosen commented on SPARK-4019:
-----------------------------------

This also explains another occurrence of the Snappy PARSING_ERROR(2) error.  If 
the average block size is non-zero but there is at least one zero-sized block, 
then HighlyCompressedMapStatus would cause us to fetch empty blocks, leading to 
the PARSING_ERROR(2) when Snappy tries to decompress this empty block.

Thanks to [~ilikerps] for helping to figure this out.

> Repartitioning with more than 2000 partitions may drop all data when 
> partitions are mostly empty or cause deserialization errors if at least one 
> partition is empty
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4019
>                 URL: https://issues.apache.org/jira/browse/SPARK-4019
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Xiangrui Meng
>            Assignee: Josh Rosen
>            Priority: Blocker
>
> {code}
> sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
> {code}
> returns `Array()`.
> 1.1.0 doesn't have this issue. Tried both HASH and SORT manager.
> This problem can also manifest itself as Snappy deserialization errors if the 
> average map output status size is non-zero but there is at least one empty 
> partition, e.g. 
> sc.makeRDD(0 until 100000, 1000).repartition(2001).collect()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4019) Repartitioning with more than 2000 partitions may drop all data when partitions are mostly empty or cause deserialization errors if at least one partition is empty

Reply via email to