[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-18 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943
 ] 

Andrew Ash commented on SPARK-22982:


[~joshrosen] do you have some example stacktraces of what this bug can cause?  
Several of our clusters hit what I think is this problem earlier this month, 
see below for details.

 

For a few days in January (4th through 12th) on our AWS infra, we observed 
massively degraded disk read throughput (down to 33% of previous peaks).  
During this time, we also began observing intermittent exceptions coming from 
Spark at read time of parquet files that a previous Spark job had written.  
When the read throughput recovered on the 12th, we stopped observing the 
exceptions and haven't seen them since.

At first we observed this stacktrace when reading .snappy.parquet files:
{noformat}
java.lang.RuntimeException: java.io.IOException: could not read page Page 
[bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col 
[my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, 
valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490)
... 31 more
Caused by: java.io

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320667#comment-16320667
 ] 

Sean Owen commented on SPARK-22982:
---

Agreed. java.nio should be OK as that was introduced in Java 7. Spark 2.2 
requires Java 8, so that much is safe.

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320664#comment-16320664
 ] 

Josh Rosen commented on SPARK-22982:


In theory this affects all 1.6.0+ versions.

It's going to be much harder to trigger in 1.6.0 because we shouldn't have as 
many Janino-induced remote ClassNotFoundExceptions to trigger the error path.

We should probably port to other 2.x branches, with one caveat: we need to make 
sure that my fix isn't relying on JRE 8.x functionality because I think at 
least some of those older branches still have support for Java 7.

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320203#comment-16320203
 ] 

Sean Owen commented on SPARK-22982:
---

Does this affect earlier branches in the same way? Seems important to back port 
if so. 

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315514#comment-16315514
 ] 

Apache Spark commented on SPARK-22982:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/20179

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org