[jira] [Created] (SPARK-48330) Fix the python data source timeout issue for large trigger interval

Chaoqin Li (Jira) Sat, 18 May 2024 12:42:05 -0700

Chaoqin Li created SPARK-48330:
----------------------------------

             Summary: Fix the python data source timeout issue for large 
trigger interval
                 Key: SPARK-48330
                 URL: https://issues.apache.org/jira/browse/SPARK-48330
             Project: Spark
          Issue Type: Task
          Components: PySpark, SS
    Affects Versions: 4.0.0
            Reporter: Chaoqin Li



Currently we run long running python worker process for python streaming source 
and sink to perform planning, commit and abort in driver side. Testing indicate 
that current implementation cause connection timeout error when streaming query 
has large trigger interval

For python streaming source, keep the long running worker archaetecture but set 
the socket timeout to be infinity to avoid timeout error.

For python streaming sink, since StreamingWrite is also created per microbatch 
in scala side, long running worker cannot be attached to s StreamingWrite 
instance. Therefore we abandon the long running worker architecture, simply 
call commit() or abort() and exit the worker and allow spark to reuse worker 
for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48330) Fix the python data source timeout issue for large trigger interval

Reply via email to