[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544987#comment-14544987
 ] 

Josh Rosen commented on SPARK-7660:
---

Note that this affects more than just Spark 1.4.0; I'll trace back and figure 
out the complete list of affected versions tomorrow, but I think that any 
version that relied on a Snappy-java library published after mid June or July 
2014 may be affected.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice.
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545005#comment-14545005
 ] 

Josh Rosen commented on SPARK-7660:
---

I pushed 
https://github.com/apache/spark/commit/7da33ce5057ff965eec19ce662465b64a3564019 
as a hotfix, which masks the bug in a way that fixes the JavaAPISuite Jenkins 
failures.  We'll still fix this bug before 1.4, but in the meantime this will 
make it easy to recognize new Jenkins failures.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545031#comment-14545031
 ] 

Apache Spark commented on SPARK-7660:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6176

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545014#comment-14545014
 ] 

Josh Rosen commented on SPARK-7660:
---

If we're wary of upgrading to a new Snappy version and don't want to wait for a 
new release / backport, one option is to just wrap SnappyOutputStream with our 
own code to make close() idempotent.  I don't think that this will have any 
significant overhead if done right, since the JIT should be able to inline the 
SnappyOutputStream calls.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org