[jira] [Commented] (SPARK-37088) Python UDF after off-heap vectorized reader can cause crash due to use-after-free in writer thread

2021-10-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432562#comment-17432562
 ] 

Apache Spark commented on SPARK-37088:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/34245

> Python UDF after off-heap vectorized reader can cause crash due to 
> use-after-free in writer thread
> --
>
> Key: SPARK-37088
> URL: https://issues.apache.org/jira/browse/SPARK-37088
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> Python UDFs in Spark SQL are run in a separate Python process. The Python 
> process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). 
> This writer thread drives the child plan by pulling rows from its output 
> iterator and serializing them across a socket.
> When the child exec node is the off-heap vectorized Parquet reader, these 
> rows are backed by off-heap memory. The child node uses a task completion 
> listener to free the off-heap memory at the end of the task, which 
> invalidates the output iterator and any rows it has produced. Since task 
> completion listeners are registered bottom-up and executed in reverse order 
> of registration, this is safe as long as an exec node never accesses its 
> input after its task completion listener has executed.
> The BasePythonRunner task completion listener violates this assumption. It 
> interrupts the writer thread, but does not wait for it to exit. This causes a 
> race condition that can lead to an executor crash:
> 1. The Python writer thread is processing a row backed by off-heap memory.
> 2. The task finishes, for example because it has reached a row limit.
> 3. The BasePythonRunner task completion listener sets the interrupt status of 
> the writer thread, but the writer thread does not check it immediately.
> 4. The child plan's task completion listener frees its off-heap memory, 
> invalidating the row that the Python writer thread is processing.
> 5. The Python writer thread attempts to access the invalidated row. The 
> use-after-free triggers a segfault that crashes the executor.
> https://issues.apache.org/jira/browse/SPARK-33277 describes the same issue, 
> but the fix was incomplete. It did not address the situation where the Python 
> writer thread accesses a freed row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37088) Python UDF after off-heap vectorized reader can cause crash due to use-after-free in writer thread

2021-10-21 Thread Ankur Dave (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432540#comment-17432540
 ] 

Ankur Dave commented on SPARK-37088:


https://github.com/apache/spark/pull/34245

> Python UDF after off-heap vectorized reader can cause crash due to 
> use-after-free in writer thread
> --
>
> Key: SPARK-37088
> URL: https://issues.apache.org/jira/browse/SPARK-37088
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> Python UDFs in Spark SQL are run in a separate Python process. The Python 
> process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). 
> This writer thread drives the child plan by pulling rows from its output 
> iterator and serializing them across a socket.
> When the child exec node is the off-heap vectorized Parquet reader, these 
> rows are backed by off-heap memory. The child node uses a task completion 
> listener to free the off-heap memory at the end of the task, which 
> invalidates the output iterator and any rows it has produced. Since task 
> completion listeners are registered bottom-up and executed in reverse order 
> of registration, this is safe as long as an exec node never accesses its 
> input after its task completion listener has executed.
> The BasePythonRunner task completion listener violates this assumption. It 
> interrupts the writer thread, but does not wait for it to exit. This causes a 
> race condition that can lead to an executor crash:
> 1. The Python writer thread is processing a row backed by off-heap memory.
> 2. The task finishes, for example because it has reached a row limit.
> 3. The BasePythonRunner task completion listener sets the interrupt status of 
> the writer thread, but the writer thread does not check it immediately.
> 4. The child plan's task completion listener frees its off-heap memory, 
> invalidating the row that the Python writer thread is processing.
> 5. The Python writer thread attempts to access the invalidated row. The 
> use-after-free triggers a segfault that crashes the executor.
> https://issues.apache.org/jira/browse/SPARK-33277 describes the same issue, 
> but the fix was incomplete. It did not address the situation where the Python 
> writer thread accesses a freed row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org