[jira] [Resolved] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval

Jungtaek Lim (Jira) Mon, 20 May 2024 15:58:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jungtaek Lim resolved SPARK-48330.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 46651
[https://github.com/apache/spark/pull/46651]

> Fix the python streaming data source timeout issue for large trigger interval
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-48330
>                 URL: https://issues.apache.org/jira/browse/SPARK-48330
>             Project: Spark
>          Issue Type: Task
>          Components: PySpark, SS
>    Affects Versions: 4.0.0
>            Reporter: Chaoqin Li
>            Assignee: Chaoqin Li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> Currently we run long running python worker process for python streaming 
> source and sink to perform planning, commit and abort in driver side. Testing 
> indicate that current implementation cause connection timeout error when 
> streaming query has large trigger interval
> For python streaming source, keep the long running worker archaetecture but 
> set the socket timeout to be infinity to avoid timeout error.
> For python streaming sink, since StreamingWrite is also created per 
> microbatch in scala side, long running worker cannot be attached to s 
> StreamingWrite instance. Therefore we abandon the long running worker 
> architecture, simply call commit() or abort() and exit the worker and allow 
> spark to reuse worker for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval

Reply via email to