[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545005#comment-14545005
 ] 

Josh Rosen commented on SPARK-7660:
-----------------------------------

I pushed 
https://github.com/apache/spark/commit/7da33ce5057ff965eec19ce662465b64a3564019 
as a hotfix, which masks the bug in a way that fixes the JavaAPISuite Jenkins 
failures.  We'll still fix this bug before 1.4, but in the meantime this will 
make it easy to recognize new Jenkins failures.

> Snappy-java buffer-sharing bug leads to data corruption / test failures
> -----------------------------------------------------------------------
>
>                 Key: SPARK-7660
>                 URL: https://issues.apache.org/jira/browse/SPARK-7660
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.4.0
>            Reporter: Josh Rosen
>            Priority: Blocker
>
> snappy-java contains a bug that can lead to situations where separate 
> SnappyOutputStream instances end up sharing the same input and output 
> buffers, which can lead to data corruption issues.  See 
> https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
> and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
> issue.
> I discovered this issue because the buffer-sharing was leading to a test 
> failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
> the wrong answer because both tasks wrote their output using the same 
> compression buffers and one task won the race, causing its output to be 
> written to both shuffle output files. As a result, the test returned the 
> result of collecting one partition twice (see 
> https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
> details).
> The buffer-sharing can only occur if {{close()}} is called twice on the same 
> SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
> a more precise description of when this issue may occur, see my upstream 
> tickets).  I think that this double-close happens somewhere in some test code 
> that was added as part of my Tungsten shuffle patch, exposing this bug (to 
> see this, download a recent build of master and run 
> https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
> force the test execution order that triggers the bug).
> I think that it's rare that this bug would lead to silent failures like this. 
> In more realistic workloads that aren't writing only a handful of bytes per 
> task, I would expect this issue to lead to stream corruption issues like 
> SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to