[ 
https://issues.apache.org/jira/browse/SPARK-41379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643104#comment-17643104
 ] 

Apache Spark commented on SPARK-41379:
--------------------------------------

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/38906

> Inconsistency of spark session in DataFrame in user function for foreachBatch 
> sink in PySpark
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41379
>                 URL: https://issues.apache.org/jira/browse/SPARK-41379
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Structured Streaming
>    Affects Versions: 3.3.2, 3.4.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html]
> According to some manual testing against above code example in PySpark, it 
> seems like the property of sparkSession in given DataFrame is not the same 
> with cloned session in streaming query. In other words, {{df.sparkSession}} 
> does not seem to be same with the cloned spark session which you can access 
> via {{{}df._jdf.sparkSession(){}}}.
> So which session to pick depends on the actual implementation of method in 
> PySpark DataFrame, which users would never know. If it leads to pick the 
> different session than expected, it leads to open backdoor for avoiding 
> restrictions (e.g. AQE), unable to see session scoped resources (e.g. temp 
> view), etc.
> So it’s quite critical to sync two sessions to refer the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to