[jira] [Created] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress
Wei Liu created SPARK-48567: --- Summary: Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress Key: SPARK-48567 URL: https://issues.apache.org/jira/browse/SPARK-48567 Project: Spark Issue Type: New Feature Components: PySpark, SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48482) dropDuplicates and dropDuplicatesWithinWatermark should accept varargs
Wei Liu created SPARK-48482: --- Summary: dropDuplicates and dropDuplicatesWithinWatermark should accept varargs Key: SPARK-48482 URL: https://issues.apache.org/jira/browse/SPARK-48482 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48480) StreamingQueryListener thread should not be interruptable
Wei Liu created SPARK-48480: --- Summary: StreamingQueryListener thread should not be interruptable Key: SPARK-48480 URL: https://issues.apache.org/jira/browse/SPARK-48480 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849348#comment-17849348 ] Wei Liu commented on SPARK-48411: - Sorry I tagged the wrong Yuchen > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
Wei Liu created SPARK-48411: --- Summary: Add E2E test for DropDuplicateWithinWatermark Key: SPARK-48411 URL: https://issues.apache.org/jira/browse/SPARK-48411 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu Currently we do not have a e2e test for DropDuplicateWithinWatermark, we should add one. We can simply use one of the test written in Scala here (with the testStream API) and replicate it to python: [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] The change should happen in [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] so we can test it in both connect and non-connect. Test with: ``` python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming python/run-tests --testnames pyspark.sql.tests.connect.streaming.test_parity_streaming ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849202#comment-17849202 ] Wei Liu commented on SPARK-48411: - [~liuyuchen777] is going to work on this > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48181) Unify StreamingPythonRunner and PythonPlannerRunner
Wei Liu created SPARK-48181: --- Summary: Unify StreamingPythonRunner and PythonPlannerRunner Key: SPARK-48181 URL: https://issues.apache.org/jira/browse/SPARK-48181 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu We should unify the two driver side python runner for PySpark. To do this we should move out of StreamingPythonRunner and enhance PythonPlannerRunner with streaming support (multiple read - write loop) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48147) Remove all client listeners when local spark session is deleted
Wei Liu created SPARK-48147: --- Summary: Remove all client listeners when local spark session is deleted Key: SPARK-48147 URL: https://issues.apache.org/jira/browse/SPARK-48147 Project: Spark Issue Type: New Feature Components: Connect, PySpark, SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48093) Add config to switch between client side listener and server side listener
Wei Liu created SPARK-48093: --- Summary: Add config to switch between client side listener and server side listener Key: SPARK-48093 URL: https://issues.apache.org/jira/browse/SPARK-48093 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 3.5.1, 3.5.0, 3.5.2 Reporter: Wei Liu We are moving the implementation of Streaming Query Listener from server to client. For clients already running client side listener, to prevent regression, we should add a config to let them decide what type of listener the user want to use. This is only added to 3.5.x published versions. For 4.0 and upwards we only use client side listener. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48002) Add Observed metrics test in PySpark StreamingQueryListeners
Wei Liu created SPARK-48002: --- Summary: Add Observed metrics test in PySpark StreamingQueryListeners Key: SPARK-48002 URL: https://issues.apache.org/jira/browse/SPARK-48002 Project: Spark Issue Type: New Feature Components: SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47877) Speed up test_parity_listener
Wei Liu created SPARK-47877: --- Summary: Speed up test_parity_listener Key: SPARK-47877 URL: https://issues.apache.org/jira/browse/SPARK-47877 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47718) .sql() does not recognize watermark defined upstream
[ https://issues.apache.org/jira/browse/SPARK-47718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-47718: Labels: (was: pull-request-available) > .sql() does not recognize watermark defined upstream > > > Key: SPARK-47718 > URL: https://issues.apache.org/jira/browse/SPARK-47718 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Chloe He >Priority: Major > > I have a data pipeline set up in such a way that it reads data from a Kafka > source, does some transformation on the data using pyspark, then writes the > output into a sink (Kafka, Redis, etc). > > My entire pipeline in written in SQL, so I wish to use the .sql() method to > execute SQL on my streaming source directly. > > However, I'm running into the issue where my watermark is not being > recognized by the downstream query via the .sql() method. > > ``` > Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) > [Clang 16.0.6 ] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyspark > >>> print(pyspark.__version__) > 3.5.1 > >>> from pyspark.sql import SparkSession > >>> > >>> session = SparkSession.builder \ > ... .config("spark.jars.packages", > "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1")\ > ... .getOrCreate() > >>> from pyspark.sql.functions import col, from_json > >>> from pyspark.sql.types import StructField, StructType, TimestampType, > >>> LongType, DoubleType, IntegerType > >>> schema = StructType( > ... [ > ... StructField('createTime', TimestampType(), True), > ... StructField('orderId', LongType(), True), > ... StructField('payAmount', DoubleType(), True), > ... StructField('payPlatform', IntegerType(), True), > ... StructField('provinceId', IntegerType(), True), > ... ]) > >>> > >>> streaming_df = session.readStream\ > ... .format("kafka")\ > ... .option("kafka.bootstrap.servers", "localhost:9092")\ > ... .option("subscribe", "payment_msg")\ > ... .option("startingOffsets","earliest")\ > ... .load()\ > ... .select(from_json(col("value").cast("string"), > schema).alias("parsed_value"))\ > ... .select("parsed_value.*")\ > ... .withWatermark("createTime", "10 seconds") > >>> > >>> streaming_df.createOrReplaceTempView("streaming_df") > >>> session.sql(""" > ... SELECT > ... window.start, window.end, provinceId, sum(payAmount) as totalPayAmount > ... FROM streaming_df > ... GROUP BY provinceId, window('createTime', '1 hour', '30 minutes') > ... ORDER BY window.start > ... """)\ > ... .writeStream\ > ... .format("kafka") \ > ... .option("checkpointLocation", "checkpoint") \ > ... .option("kafka.bootstrap.servers", "localhost:9092") \ > ... .option("topic", "sink") \ > ... .start() > ``` > > This throws exception > ``` > pyspark.errors.exceptions.captured.AnalysisException: Append output mode not > supported when there are streaming aggregations on streaming > DataFrames/DataSets without watermark; line 6 pos 4; > ``` > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47722) Wait until RocksDB background work finish before closing
Wei Liu created SPARK-47722: --- Summary: Wait until RocksDB background work finish before closing Key: SPARK-47722 URL: https://issues.apache.org/jira/browse/SPARK-47722 Project: Spark Issue Type: New Feature Components: SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47332) StreamingPythonRunner don't need redundant logic for starting python process
Wei Liu created SPARK-47332: --- Summary: StreamingPythonRunner don't need redundant logic for starting python process Key: SPARK-47332 URL: https://issues.apache.org/jira/browse/SPARK-47332 Project: Spark Issue Type: New Feature Components: Connect, SS, Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu https://github.com/apache/spark/pull/45023#discussion_r1516609093 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47292) safeMapToJValue should consider when map is null
Wei Liu created SPARK-47292: --- Summary: safeMapToJValue should consider when map is null Key: SPARK-47292 URL: https://issues.apache.org/jira/browse/SPARK-47292 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 3.5.1, 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
Wei Liu created SPARK-47277: --- Summary: PySpark util function assertDataFrameEqual should not support streaming DF Key: SPARK-47277 URL: https://issues.apache.org/jira/browse/SPARK-47277 Project: Spark Issue Type: New Feature Components: Connect, PySpark, SQL, Structured Streaming Affects Versions: 3.5.1, 3.5.0, 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47174) Client Side Listener - Server side implementation
Wei Liu created SPARK-47174: --- Summary: Client Side Listener - Server side implementation Key: SPARK-47174 URL: https://issues.apache.org/jira/browse/SPARK-47174 Project: Spark Issue Type: Improvement Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47174) Client Side Listener - Server side implementation
[ https://issues.apache.org/jira/browse/SPARK-47174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820828#comment-17820828 ] Wei Liu commented on SPARK-47174: - im working on this > Client Side Listener - Server side implementation > - > > Key: SPARK-47174 > URL: https://issues.apache.org/jira/browse/SPARK-47174 > Project: Spark > Issue Type: Improvement > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47173) fix typo in new streaming query listener explanation
Wei Liu created SPARK-47173: --- Summary: fix typo in new streaming query listener explanation Key: SPARK-47173 URL: https://issues.apache.org/jira/browse/SPARK-47173 Project: Spark Issue Type: Improvement Components: SS, UI Affects Versions: 4.0.0 Reporter: Wei Liu miss spelled flatMapGroupsWithState with flatMapGroupWithState (missed a "s" after group) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46873) PySpark spark.streams should not recreate new StreamingQueryManager
Wei Liu created SPARK-46873: --- Summary: PySpark spark.streams should not recreate new StreamingQueryManager Key: SPARK-46873 URL: https://issues.apache.org/jira/browse/SPARK-46873 Project: Spark Issue Type: Task Components: Connect, PySpark, SS Affects Versions: 4.0.0 Reporter: Wei Liu In Scala, there is only one streaming query manager for one spark session: ``` scala> spark.streams val *res0*: *org.apache.spark.sql.streaming.StreamingQueryManager* = org.apache.spark.sql.streaming.StreamingQueryManager@46bb8cba scala> spark.streams val *res1*: *org.apache.spark.sql.streaming.StreamingQueryManager* = org.apache.spark.sql.streaming.StreamingQueryManager@46bb8cba scala> spark.streams val *res2*: *org.apache.spark.sql.streaming.StreamingQueryManager* = org.apache.spark.sql.streaming.StreamingQueryManager@46bb8cba scala> spark.streams val *res3*: *org.apache.spark.sql.streaming.StreamingQueryManager* = org.apache.spark.sql.streaming.StreamingQueryManager@46bb8cba ``` In Python, this is currently false: ``` >>> spark.streams >>> spark.streams >>> spark.streams >>> spark.streams ``` Python should align scala behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-44460) Pass user auth credential to Python workers for foreachBatch and listener
[ https://issues.apache.org/jira/browse/SPARK-44460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu closed SPARK-44460. --- > Pass user auth credential to Python workers for foreachBatch and listener > - > > Key: SPARK-44460 > URL: https://issues.apache.org/jira/browse/SPARK-44460 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > > No user specific credentials are sent to Python worker that runs user > functions like foreachBatch() and streaming listener. > We might need to pass in these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46627) Streaming UI hover-over shows incorrect value
[ https://issues.apache.org/jira/browse/SPARK-46627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-46627: Attachment: Screenshot 2024-01-08 at 15.06.24.png > Streaming UI hover-over shows incorrect value > - > > Key: SPARK-46627 > URL: https://issues.apache.org/jira/browse/SPARK-46627 > Project: Spark > Issue Type: Task > Components: Structured Streaming, UI, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > Attachments: Screenshot 2024-01-08 at 1.55.57 PM.png, Screenshot > 2024-01-08 at 15.06.24.png > > > Running a simple streaming query: > val df = spark.readStream.format("rate").option("rowsPerSecond", > "5000").load() > val q = df.writeStream.format("noop").start() > > The hover-over value is incorrect in the streaming ui (shows 321.00 at > undefined) > > !Screenshot 2024-01-08 at 1.55.57 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46627) Streaming UI hover-over shows incorrect value
[ https://issues.apache.org/jira/browse/SPARK-46627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804513#comment-17804513 ] Wei Liu commented on SPARK-46627: - Also batch percent doesn't add to 100% now: !Screenshot 2024-01-08 at 15.06.24.png! > Streaming UI hover-over shows incorrect value > - > > Key: SPARK-46627 > URL: https://issues.apache.org/jira/browse/SPARK-46627 > Project: Spark > Issue Type: Task > Components: Structured Streaming, UI, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > Attachments: Screenshot 2024-01-08 at 1.55.57 PM.png, Screenshot > 2024-01-08 at 15.06.24.png > > > Running a simple streaming query: > val df = spark.readStream.format("rate").option("rowsPerSecond", > "5000").load() > val q = df.writeStream.format("noop").start() > > The hover-over value is incorrect in the streaming ui (shows 321.00 at > undefined) > > !Screenshot 2024-01-08 at 1.55.57 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46627) Streaming UI hover-over shows incorrect value
[ https://issues.apache.org/jira/browse/SPARK-46627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-46627: Attachment: Screenshot 2024-01-08 at 1.55.57 PM.png > Streaming UI hover-over shows incorrect value > - > > Key: SPARK-46627 > URL: https://issues.apache.org/jira/browse/SPARK-46627 > Project: Spark > Issue Type: Task > Components: Structured Streaming, UI, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > Attachments: Screenshot 2024-01-08 at 1.55.57 PM.png > > > Running a simple streaming query: > val df = spark.readStream.format("rate").option("rowsPerSecond", > "5000").load() > val q = df.writeStream.format("noop").start() > > The hover-over value is incorrect in the streaming ui: > > !https://files.slack.com/files-tmb/T02727P8HV4-F06CJ83D3JT-b44210f391/image_720.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46627) Streaming UI hover-over shows incorrect value
[ https://issues.apache.org/jira/browse/SPARK-46627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804481#comment-17804481 ] Wei Liu commented on SPARK-46627: - Hi Kent : ) [~yao] I was wondering if you have context in this issue? Thank you so much! > Streaming UI hover-over shows incorrect value > - > > Key: SPARK-46627 > URL: https://issues.apache.org/jira/browse/SPARK-46627 > Project: Spark > Issue Type: Task > Components: Structured Streaming, UI, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > Attachments: Screenshot 2024-01-08 at 1.55.57 PM.png > > > Running a simple streaming query: > val df = spark.readStream.format("rate").option("rowsPerSecond", > "5000").load() > val q = df.writeStream.format("noop").start() > > The hover-over value is incorrect in the streaming ui (shows 321.00 at > undefined) > > !Screenshot 2024-01-08 at 1.55.57 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46627) Streaming UI hover-over shows incorrect value
[ https://issues.apache.org/jira/browse/SPARK-46627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-46627: Description: Running a simple streaming query: val df = spark.readStream.format("rate").option("rowsPerSecond", "5000").load() val q = df.writeStream.format("noop").start() The hover-over value is incorrect in the streaming ui (shows 321.00 at undefined) !Screenshot 2024-01-08 at 1.55.57 PM.png! was: Running a simple streaming query: val df = spark.readStream.format("rate").option("rowsPerSecond", "5000").load() val q = df.writeStream.format("noop").start() The hover-over value is incorrect in the streaming ui: !https://files.slack.com/files-tmb/T02727P8HV4-F06CJ83D3JT-b44210f391/image_720.png! > Streaming UI hover-over shows incorrect value > - > > Key: SPARK-46627 > URL: https://issues.apache.org/jira/browse/SPARK-46627 > Project: Spark > Issue Type: Task > Components: Structured Streaming, UI, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > Attachments: Screenshot 2024-01-08 at 1.55.57 PM.png > > > Running a simple streaming query: > val df = spark.readStream.format("rate").option("rowsPerSecond", > "5000").load() > val q = df.writeStream.format("noop").start() > > The hover-over value is incorrect in the streaming ui (shows 321.00 at > undefined) > > !Screenshot 2024-01-08 at 1.55.57 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46627) Streaming UI hover-over shows incorrect value
Wei Liu created SPARK-46627: --- Summary: Streaming UI hover-over shows incorrect value Key: SPARK-46627 URL: https://issues.apache.org/jira/browse/SPARK-46627 Project: Spark Issue Type: Task Components: Structured Streaming, UI, Web UI Affects Versions: 4.0.0 Reporter: Wei Liu Running a simple streaming query: val df = spark.readStream.format("rate").option("rowsPerSecond", "5000").load() val q = df.writeStream.format("noop").start() The hover-over value is incorrect in the streaming ui: !https://files.slack.com/files-tmb/T02727P8HV4-F06CJ83D3JT-b44210f391/image_720.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46384) Streaming UI doesn't display graph correctly
[ https://issues.apache.org/jira/browse/SPARK-46384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-46384: Summary: Streaming UI doesn't display graph correctly (was: Streaming UI doesn't show graph) > Streaming UI doesn't display graph correctly > > > Key: SPARK-46384 > URL: https://issues.apache.org/jira/browse/SPARK-46384 > Project: Spark > Issue Type: Task > Components: Structured Streaming, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > The Streaming UI is broken currently at spark master. Running a simple query: > ``` > q = > spark.readStream.format("rate").load().writeStream.format("memory").queryName("test_wei").start() > ``` > Would make the spark UI shows empty graph for "operation duration": > !https://private-user-images.githubusercontent.com/10248890/289990561-fdb78c92-2d6f-41a9-ba23-3068d128caa8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA1NjEtZmRiNzhjOTItMmQ2Zi00MWE5LWJhMjMtMzA2OGQxMjhjYWE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI4N2FkZDYyZGQwZGZmNWJhN2IzMTM3ZmI1MzNhOGExZGY2MThjZjMwZDU5MzZiOTI4ZGVkMjc3MjBhMTNhZjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.KXhI0NnwIpfTRjVcsXuA82AnaURHgtkLOYVzifI-mp8! > Here is the error: > !https://private-user-images.githubusercontent.com/10248890/289990953-cd477d48-a45e-4ee9-b45e-06dc1dbeb9d9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA5NTMtY2Q0NzdkNDgtYTQ1ZS00ZWU5LWI0NWUtMDZkYzFkYmViOWQ5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI2MzIyYjQ5OWQ3YWZlMGYzYjFmYTljZjIwZjBmZDBiNzQyZmE3OTI2ZjkxNzVhNWU0ZDAwYTA4NDRkMTRjOTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.JAqEG_4NCEiRvHO6yv59hZdkH_5_tSUuaOkpEbH-I20! > > I verified the same query runs fine on spark 3.5, as in the following graph. > !https://private-user-images.githubusercontent.com/10248890/289990563-642aa6c3-7728-43c7-8a11-cbf79c4362c5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA1NjMtNjQyYWE2YzMtNzcyOC00M2M3LThhMTEtY2JmNzljNDM2MmM1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM0YmZiYTkyNGZkNzBkOGMzMmUyYTYzZTMyYzQ1ZTZkNDU3MDk2M2ZlOGNlZmQxNGYzNTFjZWRiNTQ2ZmQzZWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.MJ8-78Qv5KkoLNGBrLXS-8gcC7LZepFsOD4r7pcnzSI! > > This should be a problem from the library updates, this could be a potential > source of error: [https://github.com/apache/spark/pull/42879] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46384) Structured Streaming UI doesn't display graph correctly
[ https://issues.apache.org/jira/browse/SPARK-46384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-46384: Summary: Structured Streaming UI doesn't display graph correctly (was: Streaming UI doesn't display graph correctly) > Structured Streaming UI doesn't display graph correctly > --- > > Key: SPARK-46384 > URL: https://issues.apache.org/jira/browse/SPARK-46384 > Project: Spark > Issue Type: Task > Components: Structured Streaming, Web UI >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > The Streaming UI is broken currently at spark master. Running a simple query: > ``` > q = > spark.readStream.format("rate").load().writeStream.format("memory").queryName("test_wei").start() > ``` > Would make the spark UI shows empty graph for "operation duration": > !https://private-user-images.githubusercontent.com/10248890/289990561-fdb78c92-2d6f-41a9-ba23-3068d128caa8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA1NjEtZmRiNzhjOTItMmQ2Zi00MWE5LWJhMjMtMzA2OGQxMjhjYWE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI4N2FkZDYyZGQwZGZmNWJhN2IzMTM3ZmI1MzNhOGExZGY2MThjZjMwZDU5MzZiOTI4ZGVkMjc3MjBhMTNhZjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.KXhI0NnwIpfTRjVcsXuA82AnaURHgtkLOYVzifI-mp8! > Here is the error: > !https://private-user-images.githubusercontent.com/10248890/289990953-cd477d48-a45e-4ee9-b45e-06dc1dbeb9d9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA5NTMtY2Q0NzdkNDgtYTQ1ZS00ZWU5LWI0NWUtMDZkYzFkYmViOWQ5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI2MzIyYjQ5OWQ3YWZlMGYzYjFmYTljZjIwZjBmZDBiNzQyZmE3OTI2ZjkxNzVhNWU0ZDAwYTA4NDRkMTRjOTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.JAqEG_4NCEiRvHO6yv59hZdkH_5_tSUuaOkpEbH-I20! > > I verified the same query runs fine on spark 3.5, as in the following graph. > !https://private-user-images.githubusercontent.com/10248890/289990563-642aa6c3-7728-43c7-8a11-cbf79c4362c5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA1NjMtNjQyYWE2YzMtNzcyOC00M2M3LThhMTEtY2JmNzljNDM2MmM1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM0YmZiYTkyNGZkNzBkOGMzMmUyYTYzZTMyYzQ1ZTZkNDU3MDk2M2ZlOGNlZmQxNGYzNTFjZWRiNTQ2ZmQzZWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.MJ8-78Qv5KkoLNGBrLXS-8gcC7LZepFsOD4r7pcnzSI! > > This should be a problem from the library updates, this could be a potential > source of error: [https://github.com/apache/spark/pull/42879] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46384) Streaming UI doesn't show graph
Wei Liu created SPARK-46384: --- Summary: Streaming UI doesn't show graph Key: SPARK-46384 URL: https://issues.apache.org/jira/browse/SPARK-46384 Project: Spark Issue Type: Task Components: Structured Streaming, Web UI Affects Versions: 4.0.0 Reporter: Wei Liu The Streaming UI is broken currently at spark master. Running a simple query: ``` q = spark.readStream.format("rate").load().writeStream.format("memory").queryName("test_wei").start() ``` Would make the spark UI shows empty graph for "operation duration": !https://private-user-images.githubusercontent.com/10248890/289990561-fdb78c92-2d6f-41a9-ba23-3068d128caa8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA1NjEtZmRiNzhjOTItMmQ2Zi00MWE5LWJhMjMtMzA2OGQxMjhjYWE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI4N2FkZDYyZGQwZGZmNWJhN2IzMTM3ZmI1MzNhOGExZGY2MThjZjMwZDU5MzZiOTI4ZGVkMjc3MjBhMTNhZjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.KXhI0NnwIpfTRjVcsXuA82AnaURHgtkLOYVzifI-mp8! Here is the error: !https://private-user-images.githubusercontent.com/10248890/289990953-cd477d48-a45e-4ee9-b45e-06dc1dbeb9d9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA5NTMtY2Q0NzdkNDgtYTQ1ZS00ZWU5LWI0NWUtMDZkYzFkYmViOWQ5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI2MzIyYjQ5OWQ3YWZlMGYzYjFmYTljZjIwZjBmZDBiNzQyZmE3OTI2ZjkxNzVhNWU0ZDAwYTA4NDRkMTRjOTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.JAqEG_4NCEiRvHO6yv59hZdkH_5_tSUuaOkpEbH-I20! I verified the same query runs fine on spark 3.5, as in the following graph. !https://private-user-images.githubusercontent.com/10248890/289990563-642aa6c3-7728-43c7-8a11-cbf79c4362c5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI0MjI2NjksIm5iZiI6MTcwMjQyMjM2OSwicGF0aCI6Ii8xMDI0ODg5MC8yODk5OTA1NjMtNjQyYWE2YzMtNzcyOC00M2M3LThhMTEtY2JmNzljNDM2MmM1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEyVDIzMDYwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM0YmZiYTkyNGZkNzBkOGMzMmUyYTYzZTMyYzQ1ZTZkNDU3MDk2M2ZlOGNlZmQxNGYzNTFjZWRiNTQ2ZmQzZWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.MJ8-78Qv5KkoLNGBrLXS-8gcC7LZepFsOD4r7pcnzSI! This should be a problem from the library updates, this could be a potential source of error: [https://github.com/apache/spark/pull/42879] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46250) Deflake test_parity_listener
Wei Liu created SPARK-46250: --- Summary: Deflake test_parity_listener Key: SPARK-46250 URL: https://issues.apache.org/jira/browse/SPARK-46250 Project: Spark Issue Type: Task Components: Connect, SS Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45845) Streaming UI add number of evicted state rows
Wei Liu created SPARK-45845: --- Summary: Streaming UI add number of evicted state rows Key: SPARK-45845 URL: https://issues.apache.org/jira/browse/SPARK-45845 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu The UI is missing this chart, and people always confuse "aggregated number of rows dropped by watermark" with this newly added metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45637) Time window aggregation in separate streams followed by stream-stream join not returning results
[ https://issues.apache.org/jira/browse/SPARK-45637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-45637: Description: According to documentation update (SPARK-42591) resulting from SPARK-42376, Spark 3.5.0 should support time-window aggregations in two separate streams followed by stream-stream window join: [https://github.com/apache/spark/blob/261b281e6e57be32eb28bf4e50bea24ed22a9f21/docs/structured-streaming-programming-guide.md?plain=1#L1939-L1995] However, I failed to reproduce this example and the query I built doesn't return any results: {code:java} from pyspark.sql.functions import rand from pyspark.sql.functions import expr, window, window_time spark.conf.set("spark.sql.shuffle.partitions", "1") impressions = ( spark .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load() .selectExpr("value AS adId", "timestamp AS impressionTime") ) impressionsWithWatermark = impressions \ .selectExpr("adId AS impressionAdId", "impressionTime") \ .withWatermark("impressionTime", "10 seconds") clicks = ( spark .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load() .where((rand() * 100).cast("integer") < 10) # 10 out of every 100 impressions result in a click .selectExpr("(value - 10) AS adId ", "timestamp AS clickTime") # -10 so that a click with same id as impression is generated later (i.e. delayed data). .where("adId > 0") ) clicksWithWatermark = clicks \ .selectExpr("adId AS clickAdId", "clickTime") \ .withWatermark("clickTime", "10 seconds") clicksWindow = clicksWithWatermark.groupBy( window(clicksWithWatermark.clickTime, "1 minute") ).count() impressionsWindow = impressionsWithWatermark.groupBy( window(impressionsWithWatermark.impressionTime, "1 minute") ).count() clicksAndImpressions = clicksWindow.join(impressionsWindow, "window", "inner") clicksAndImpressions.writeStream \ .format("memory") \ .queryName("clicksAndImpressions") \ .outputMode("append") \ .start() {code} My intuition is that I'm getting no results because to output results of the first stateful operator (time window aggregation), a watermark needs to pass the end timestamp of the window. And once the watermark is after the end timestamp of the window, this window is ignored at the second stateful operator (stream-stream) join because it's behind the watermark. Indeed, a small hack done to event time column (adding one minute) between two stateful operators makes it possible to get results: {code:java} clicksWindow2 = clicksWithWatermark.groupBy( window(clicksWithWatermark.clickTime, "1 minute") ).count().withColumn("window_time", window_time("window") + expr('INTERVAL 1 MINUTE')).drop("window") impressionsWindow2 = impressionsWithWatermark.groupBy( window(impressionsWithWatermark.impressionTime, "1 minute") ).count().withColumn("window_time", window_time("window") + expr('INTERVAL 1 MINUTE')).drop("window") clicksAndImpressions2 = clicksWindow2.join(impressionsWindow2, "window_time", "inner") clicksAndImpressions2.writeStream \ .format("memory") \ .queryName("clicksAndImpressions2") \ .outputMode("append") \ .start() {code} was: According to documentation update (SPARK-42591) resulting from SPARK-42376, Spark 3.5.0 should support time-window aggregations in two separate streams followed by stream-stream window join: https://github.com/apache/spark/blob/261b281e6e57be32eb28bf4e50bea24ed22a9f21/docs/structured-streaming-programming-guide.md?plain=1#L1939-L1995 However, I failed to reproduce this example and the query I built doesn't return any results: {code:java} from pyspark.sql.functions import rand from pyspark.sql.functions import expr, window, window_time spark.conf.set("spark.sql.shuffle.partitions", "1") impressions = ( spark .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load() .selectExpr("value AS adId", "timestamp AS impressionTime") ) impressionsWithWatermark = impressions \ .selectExpr("adId AS impressionAdId", "impressionTime") \ .withWatermark("impressionTime", "10 seconds") clicks = ( spark .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load() .where((rand() * 100).cast("integer") < 10) # 10 out of every 100 impressions result in a click .selectExpr("(value - 10) AS adId ", "timestamp AS clickTime") # -10 so that a click with same id as impression is generated later (i.e. delayed data). .where("adId > 0") ) clicksWithWatermark = clicks \ .selectExpr("adId AS clickAdId", "clickTime") \ .withWatermark("clickTime", "10 seconds") clicksWindow = clicksWithWatermark.groupBy( window(clicksWithWatermark.clickTime, "1 minute") ).count() impressionsWindow
[jira] [Created] (SPARK-45677) Observe API error logging
Wei Liu created SPARK-45677: --- Summary: Observe API error logging Key: SPARK-45677 URL: https://issues.apache.org/jira/browse/SPARK-45677 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu We should tell user why it's not supported and what to do [https://github.com/apache/spark/blob/536439244593d40bdab88e9d3657f2691d3d33f2/sql/core/src/main/scala/org/apache/spark/sql/Observation.scala#L76] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45053) Improve python version mismatch logging
[ https://issues.apache.org/jira/browse/SPARK-45053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-45053: Description: Currently the syntax of the python version mismatching is a little bit confusing, it uses (3,9) to represent python version 3.9. Just a minor update to make it more straightforward > Improve python version mismatch logging > --- > > Key: SPARK-45053 > URL: https://issues.apache.org/jira/browse/SPARK-45053 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.1 >Reporter: Wei Liu >Priority: Trivial > > Currently the syntax of the python version mismatching is a little bit > confusing, it uses (3,9) to represent python version 3.9. Just a minor update > to make it more straightforward -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45056) Add process termination tests for Python foreachBatch and StreamingQueryListener
Wei Liu created SPARK-45056: --- Summary: Add process termination tests for Python foreachBatch and StreamingQueryListener Key: SPARK-45056 URL: https://issues.apache.org/jira/browse/SPARK-45056 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45053) Improve python version mismatch logging
Wei Liu created SPARK-45053: --- Summary: Improve python version mismatch logging Key: SPARK-45053 URL: https://issues.apache.org/jira/browse/SPARK-45053 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44971) [BUG Fix] PySpark StreamingQuerProgress fromJson
[ https://issues.apache.org/jira/browse/SPARK-44971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44971: Issue Type: Bug (was: New Feature) > [BUG Fix] PySpark StreamingQuerProgress fromJson > - > > Key: SPARK-44971 > URL: https://issues.apache.org/jira/browse/SPARK-44971 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44971) [BUG Fix] PySpark StreamingQuerProgress fromJson
Wei Liu created SPARK-44971: --- Summary: [BUG Fix] PySpark StreamingQuerProgress fromJson Key: SPARK-44971 URL: https://issues.apache.org/jira/browse/SPARK-44971 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.5.0, 3.5.1 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44917) PySpark Streaming DataStreamWriter table API
[ https://issues.apache.org/jira/browse/SPARK-44917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-44917. - Resolution: Not A Problem > PySpark Streaming DataStreamWriter table API > > > Key: SPARK-44917 > URL: https://issues.apache.org/jira/browse/SPARK-44917 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44917) PySpark Streaming DataStreamWriter table API
Wei Liu created SPARK-44917: --- Summary: PySpark Streaming DataStreamWriter table API Key: SPARK-44917 URL: https://issues.apache.org/jira/browse/SPARK-44917 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44460) Pass user auth credential to Python workers for foreachBatch and listener
[ https://issues.apache.org/jira/browse/SPARK-44460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757087#comment-17757087 ] Wei Liu commented on SPARK-44460: - [~rangadi] This seems to be a Databricks internal issue. See the updates in SC-138245 > Pass user auth credential to Python workers for foreachBatch and listener > - > > Key: SPARK-44460 > URL: https://issues.apache.org/jira/browse/SPARK-44460 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > > No user specific credentials are sent to Python worker that runs user > functions like foreachBatch() and streaming listener. > We might need to pass in these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44839) Better error logging when user accesses spark session in foreachBatch and Listener
Wei Liu created SPARK-44839: --- Summary: Better error logging when user accesses spark session in foreachBatch and Listener Key: SPARK-44839 URL: https://issues.apache.org/jira/browse/SPARK-44839 Project: Spark Issue Type: New Feature Components: Connect, Structured Streaming Affects Versions: 3.5.0, 4.0.0 Reporter: Wei Liu TypeError: cannot pickle '_thread._local' object when user access `spark`. we need a better error for this -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754280#comment-17754280 ] Wei Liu commented on SPARK-44808: - This seems to be against the design principle of spark connect – We pass everything to the server side. Client only keep ids to get status. That's the reason why the foreachBatch and Listener are ran on the server side. This can be closed > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for connect python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So we could define a new interface on the streaming query manager, maybe > called refreshListener(listener), e.g. use it as > `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > For `foreachBatch`, we could wrap the function to a new class > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-44808. - Resolution: Won't Do > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for connect python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So we could define a new interface on the streaming query manager, maybe > called refreshListener(listener), e.g. use it as > `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > For `foreachBatch`, we could wrap the function to a new class > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44808: Description: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So we could define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} For `foreachBatch`, we could wrap the function to a new class was: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} For `foreachBatch`, we could wrap the function to a new class > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for connect python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So we could define a new interface on the streaming query manager, maybe > called refreshListener(listener), e.g. use it as > `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > For `foreachBatch`, we could wrap the function to a new class > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44808: Description: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} For `foreachBatch`, we could wrap the function to a new class was: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} For `foreachBatch`, we might could wrap the function to a new class, like the listener case > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for connect python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So if we define a new interface on the streaming query manager, maybe called > refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > For `foreachBatch`, we could wrap the function to a new class > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44808: Description: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} For `foreachBatch`, we might could wrap the function to a new class, like the listener case was: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for connect python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So if we define a new interface on the streaming query manager, maybe called > refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > For `foreachBatch`, we might could wrap the function to a new class, like the > listener case > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44808: Description: I’m thinking of an improvement for connect python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} was: I’m thinking of an improvement for python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for connect python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So if we define a new interface on the streaming query manager, maybe called > refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
Wei Liu created SPARK-44808: --- Summary: refreshListener() API on StreamingQueryManager for spark connect Key: SPARK-44808 URL: https://issues.apache.org/jira/browse/SPARK-44808 Project: Spark Issue Type: New Feature Components: Connect, Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu I’m thinking of an improvement for python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. ``` x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined ``` But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` ``` def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44808) refreshListener() API on StreamingQueryManager for spark connect
[ https://issues.apache.org/jira/browse/SPARK-44808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44808: Description: I’m thinking of an improvement for python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. {code:java} x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined {code} But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` {code:java} def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y {code} was: I’m thinking of an improvement for python listener and foreachBatch. Currently if you define a variable outside of the function, you can’t actually see it on client side if it’s touched in the function because the operation is on server side, e.g. ``` x = 0 class MyListener(StreamingQueryListener): def onQueryStarted(e): x = 100 self.y = 200 spark.streams.addListener(MyListener()) q = spark.readStream.start() # x is still 0, self.y is not defined ``` But for the self.y case, there could be an improvement. The server side is capable of pickle serialize the whole listener instance again, and send back to the client. So if we define a new interface on the streaming query manager, maybe called refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` ``` def refreshListener(listener: StreamingQueryListener) -> StreamingQueryListener # send the listener id with this refresh request, server receives this request and serializes the listener again, and send back to client, so the returned new listener contains the updated value self.y ``` > refreshListener() API on StreamingQueryManager for spark connect > > > Key: SPARK-44808 > URL: https://issues.apache.org/jira/browse/SPARK-44808 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > I’m thinking of an improvement for python listener and foreachBatch. > Currently if you define a variable outside of the function, you can’t > actually see it on client side if it’s touched in the function because the > operation is on server side, e.g. > > > {code:java} > x = 0 > class MyListener(StreamingQueryListener): > def onQueryStarted(e): > x = 100 > self.y = 200 > spark.streams.addListener(MyListener()) > q = spark.readStream.start() > > # x is still 0, self.y is not defined > {code} > > > But for the self.y case, there could be an improvement. > > The server side is capable of pickle serialize the whole listener instance > again, and send back to the client. > > So if we define a new interface on the streaming query manager, maybe called > refreshListener(listener), e.g. use it as `spark.streams.refreshListener()` > > > {code:java} > def refreshListener(listener: StreamingQueryListener) -> > StreamingQueryListener > # send the listener id with this refresh request, server receives this > request and serializes the listener again, and send back to client, so the > returned new listener contains the updated value self.y > > {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44764) Streaming process improvement
Wei Liu created SPARK-44764: --- Summary: Streaming process improvement Key: SPARK-44764 URL: https://issues.apache.org/jira/browse/SPARK-44764 Project: Spark Issue Type: New Feature Components: Connect, Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu # Deduplicate or remove StreamingPythonRunner if possible, it is very similar to existing PythonRunner # Use the DAEMON mode for starting a new process -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44460) Pass user auth credential to Python workers for foreachBatch and listener
[ https://issues.apache.org/jira/browse/SPARK-44460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44460: Description: No user specific credentials are sent to Python worker that runs user functions like foreachBatch() and streaming listener. We might need to pass in these. was: No user specific credentials (like UC creds?) are sent to Python worker that runs user functions like foreachBatch() and streaming listener. We might need to pass in these. > Pass user auth credential to Python workers for foreachBatch and listener > - > > Key: SPARK-44460 > URL: https://issues.apache.org/jira/browse/SPARK-44460 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > > No user specific credentials are sent to Python worker that runs user > functions like foreachBatch() and streaming listener. > We might need to pass in these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44516) Spark Connect Python StreamingQueryListener removeListener method actually shut down the listener process
[ https://issues.apache.org/jira/browse/SPARK-44516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-44516: Summary: Spark Connect Python StreamingQueryListener removeListener method actually shut down the listener process (was: Spark Connect Python StreamingQueryListener removeListener method) > Spark Connect Python StreamingQueryListener removeListener method actually > shut down the listener process > - > > Key: SPARK-44516 > URL: https://issues.apache.org/jira/browse/SPARK-44516 > Project: Spark > Issue Type: New Feature > Components: Connect, Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44516) Spark Connect Python StreamingQueryListener removeListener method
Wei Liu created SPARK-44516: --- Summary: Spark Connect Python StreamingQueryListener removeListener method Key: SPARK-44516 URL: https://issues.apache.org/jira/browse/SPARK-44516 Project: Spark Issue Type: New Feature Components: Connect, Structured Streaming Affects Versions: 3.5.0, 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44515) Code Improvement: PySpark add util function to set python version
Wei Liu created SPARK-44515: --- Summary: Code Improvement: PySpark add util function to set python version Key: SPARK-44515 URL: https://issues.apache.org/jira/browse/SPARK-44515 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.5.0, 4.0.0 Reporter: Wei Liu If searching for `"%d.%d" % sys.version_info[:2]` there are multiple occurrence of such method, we should create a separate util method for this -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44502) Add mission versionchanged field to docs
Wei Liu created SPARK-44502: --- Summary: Add mission versionchanged field to docs Key: SPARK-44502 URL: https://issues.apache.org/jira/browse/SPARK-44502 Project: Spark Issue Type: New Feature Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44484) Add missing json field batchDuration to StreamingQueryProgress
Wei Liu created SPARK-44484: --- Summary: Add missing json field batchDuration to StreamingQueryProgress Key: SPARK-44484 URL: https://issues.apache.org/jira/browse/SPARK-44484 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42962) Allow access to stopped streaming queries
[ https://issues.apache.org/jira/browse/SPARK-42962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735867#comment-17735867 ] Wei Liu commented on SPARK-42962: - Closing this as it's duplicate with SPARK-42940. Please kindly reopen the ticket if this is a mistake! > Allow access to stopped streaming queries > - > > Key: SPARK-42962 > URL: https://issues.apache.org/jira/browse/SPARK-42962 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > > In Spark connect, when a query is stopped, the client can't access it > anymore. That implies they can not fetch any information about the query > after that. > The server might have to cache the queries for some time (upto the time the > session is closed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42962) Allow access to stopped streaming queries
[ https://issues.apache.org/jira/browse/SPARK-42962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-42962. - Resolution: Duplicate > Allow access to stopped streaming queries > - > > Key: SPARK-42962 > URL: https://issues.apache.org/jira/browse/SPARK-42962 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > > In Spark connect, when a query is stopped, the client can't access it > anymore. That implies they can not fetch any information about the query > after that. > The server might have to cache the queries for some time (upto the time the > session is closed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42941) Add support for streaming listener in Python
[ https://issues.apache.org/jira/browse/SPARK-42941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735866#comment-17735866 ] Wei Liu commented on SPARK-42941: - Hmmm this is actually still in progress. I was thinking two PRs could share the same Jira ticket... sorry if I did it wrong > Add support for streaming listener in Python > > > Key: SPARK-42941 > URL: https://issues.apache.org/jira/browse/SPARK-42941 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > Add support of streaming listener in Python. > This likely requires a design doc to hash out the details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-42941) Add support for streaming listener in Python
[ https://issues.apache.org/jira/browse/SPARK-42941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu reopened SPARK-42941: - > Add support for streaming listener in Python > > > Key: SPARK-42941 > URL: https://issues.apache.org/jira/browse/SPARK-42941 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > Add support of streaming listener in Python. > This likely requires a design doc to hash out the details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44046) Pyspark StreamingQueryListener listListener
Wei Liu created SPARK-44046: --- Summary: Pyspark StreamingQueryListener listListener Key: SPARK-44046 URL: https://issues.apache.org/jira/browse/SPARK-44046 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44029) Observation.py supports streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-44029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-44029. - Resolution: Not A Problem > Observation.py supports streaming dataframes > > > Key: SPARK-44029 > URL: https://issues.apache.org/jira/browse/SPARK-44029 > Project: Spark > Issue Type: Documentation > Components: Documentation, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44029) Observation.py supports streaming dataframes
Wei Liu created SPARK-44029: --- Summary: Observation.py supports streaming dataframes Key: SPARK-44029 URL: https://issues.apache.org/jira/browse/SPARK-44029 Project: Spark Issue Type: Documentation Components: Documentation, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44010) Python StreamingQueryProgress rowsPerSecond type fix
Wei Liu created SPARK-44010: --- Summary: Python StreamingQueryProgress rowsPerSecond type fix Key: SPARK-44010 URL: https://issues.apache.org/jira/browse/SPARK-44010 Project: Spark Issue Type: Bug Components: PySpark, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43796) Streaming ForeachWriter can't accept custom user defined class
Wei Liu created SPARK-43796: --- Summary: Streaming ForeachWriter can't accept custom user defined class Key: SPARK-43796 URL: https://issues.apache.org/jira/browse/SPARK-43796 Project: Spark Issue Type: Bug Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu [https://github.com/apache/spark/pull/41129] The last example in the PR description doesn't work with current REPL implementation. Code: {code:java} import org.apache.spark.sql.{ForeachWriter, Row} import java.io._ val filePath = "/home/wei.liu/test_foreach/output-custom" case class MyTestClass(value: Int) { override def toString: String = value.toString } val writer = new ForeachWriter[MyTestClass] { var fileWriter: FileWriter = _ def open(partitionId: Long, version: Long): Boolean = { fileWriter = new FileWriter(filePath, true) true } def process(row: MyTestClass): Unit = { fileWriter.write(row.toString) fileWriter.write("\n") } def close(errorOrNull: Throwable): Unit = { fileWriter.close() } } val df = spark.readStream .format("rate") .option("rowsPerSecond", "10") .load() val query = df .selectExpr("CAST(value AS INT)") .as[MyTestClass] .writeStream .foreach(writer) .outputMode("update") .start() {code} Error: {code:java} 23/05/24 19:17:31 ERROR Utils: Aborting task java.lang.NoClassDefFoundError: Could not initialize class ammonite.$sess.cmd4$ at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:35) at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:30) at org.apache.spark.util.Utils$.classForName(Utils.scala:94) at org.apache.spark.sql.catalyst.encoders.OuterScopes$.$anonfun$getOuterScope$1(OuterScopes.scala:59) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.$anonfun$doGenCode$1(objects.scala:598) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:598) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.$anonfun$create$1(GenerateSafeProjection.scala:156) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:153) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1369) at org.apache.spark.sql.catalyst.expressions.SafeProjection$.createCodeGeneratedObject(Projection.scala:171) at org.apache.spark.sql.catalyst.expressions.SafeProjection$.createCodeGeneratedObject(Projection.scala:168) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.SafeProjection$.create(Projection.scala:194) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:173) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:166) at org.apache.spark.sql.execution.streaming.sources.ForeachDataWriter.write(ForeachWriterTable.scala:147) at org.apache.spark.sql.execution.streaming.sources.ForeachDataWriter.write(ForeachWriterTable.scala:132) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.write(WriteToDataSourceV2Exec.scala:493) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteToDataSourceV2Exec.scala:448) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1521) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:486) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:425) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:491) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:388) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at
[jira] [Created] (SPARK-43761) Streaming ForeachWriter with UnboundRowEncoder
Wei Liu created SPARK-43761: --- Summary: Streaming ForeachWriter with UnboundRowEncoder Key: SPARK-43761 URL: https://issues.apache.org/jira/browse/SPARK-43761 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu https://github.com/apache/spark/pull/41129#discussion_r1196873903 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42941) Add support for streaming listener in Python
[ https://issues.apache.org/jira/browse/SPARK-42941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-42941: Summary: Add support for streaming listener in Python (was: Add support for streaming listener.) > Add support for streaming listener in Python > > > Key: SPARK-42941 > URL: https://issues.apache.org/jira/browse/SPARK-42941 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > > Add support of streaming listener in Python. > This likely requires a design doc to hash out the details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43360) Scala Connect: Add StreamingQueryManager API
Wei Liu created SPARK-43360: --- Summary: Scala Connect: Add StreamingQueryManager API Key: SPARK-43360 URL: https://issues.apache.org/jira/browse/SPARK-43360 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43132) Add foreach streaming API in Python
[ https://issues.apache.org/jira/browse/SPARK-43132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717347#comment-17717347 ] Wei Liu commented on SPARK-43132: - i'm working on this > Add foreach streaming API in Python > --- > > Key: SPARK-43132 > URL: https://issues.apache.org/jira/browse/SPARK-43132 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43054) Support foreach() in streaming spark connect
[ https://issues.apache.org/jira/browse/SPARK-43054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-43054. - Resolution: Duplicate duplicate with 43132 and 43133 > Support foreach() in streaming spark connect > > > Key: SPARK-43054 > URL: https://issues.apache.org/jira/browse/SPARK-43054 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43299) JVM Client throw StreamingQueryException when error handling is implemented
Wei Liu created SPARK-43299: --- Summary: JVM Client throw StreamingQueryException when error handling is implemented Key: SPARK-43299 URL: https://issues.apache.org/jira/browse/SPARK-43299 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu Currently the awaitTermination() method of connect's JVM client's StreamingQuery won't throw error when there is an exception. In Python connect this is directly handled by python client's error-handling framework but such is not existed in JVM client right now. We should verify it works when JVM adds that -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
[ https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43287: Description: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu:~/oss-spark$ wei.liu:~/oss-spark$ wei.liu:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times was: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times > Connect JVM client REPL not correctly shut down if killed > - > > Key: SPARK-43287 > URL: https://issues.apache.org/jira/browse/SPARK-43287 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > How to reproduce: > # Start a scala client `./connector/connect/bin/spark-connect-scala-client` > # in another terminal, kill the process `kill ` > # Back to the client terminal, you can't see anything you type, but the > command still works > > > {code:java} > Spark session available as 'spark'. > _ __ __ __ > / ___/ __/ /__ / /___ ___ _/ /_ > \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ > ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ > // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ > /_/ > @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf > examples logs resource-managers > target > LICENSE artifacts connector graphx > mllib sbin tools > LICENSE-binary assembly core hadoop-cloud
[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
[ https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43287: Description: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times was: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times > Connect JVM client REPL not correctly shut down if killed > - > > Key: SPARK-43287 > URL: https://issues.apache.org/jira/browse/SPARK-43287 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > How to reproduce: > # Start a scala client `./connector/connect/bin/spark-connect-scala-client` > # in another terminal, kill the process `kill ` > # Back to the client terminal, you can't see anything you type, but the > command still works > > > {code:java} > Spark session available as 'spark'. > _ __ __ __ > / ___/ __/ /__ / /___ ___ _/ /_ > \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ > ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ > // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ > /_/ > @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf > examples logs resource-managers > target > LICENSE artifacts connector graphx > mllib sbin tools > LICENSE-binary
[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
[ https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43287: Description: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times was: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works ``` Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ *wei.liu*:*~/oss-spark*$ CONTRIBUTING.md appveyor.yml *conf* *examples* *logs* *resource-managers* *target* LICENSE *artifacts* *connector* *graphx* *mllib* *sbin* *tools* LICENSE-binary *assembly* *core* *hadoop-cloud* *mllib-local* scalastyle-config.xml NOTICE *bin* *data* hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary *binder* dependency-reduced-pom.xml *launcher* *project* *spark-warehouse* *R* *build* *dev* *licenses* *python* *sql* README.md *common* *docs* *licenses-binary* *repl* *streaming* *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ ``` I ran 'ls' above, and clicked return multiple times > Connect JVM client REPL not correctly shut down if killed > - > > Key: SPARK-43287 > URL: https://issues.apache.org/jira/browse/SPARK-43287 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > How to reproduce: > # Start a scala client `./connector/connect/bin/spark-connect-scala-client` > # in another terminal, kill the process `kill ` > # Back to the client terminal, you can't see anything you type, but the > command still works > > > {code:java} > Spark session available as 'spark'. > _ __ __ __ > / ___/ __/ /__ / /___ ___ _/ /_ > \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ > ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ > // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ > /_/ > @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf > examples logs resource-managers > target > LICENSE artifacts connector graphx >
[jira] [Created] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
Wei Liu created SPARK-43287: --- Summary: Connect JVM client REPL not correctly shut down if killed Key: SPARK-43287 URL: https://issues.apache.org/jira/browse/SPARK-43287 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Wei Liu How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works ``` Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ *wei.liu*:*~/oss-spark*$ CONTRIBUTING.md appveyor.yml *conf* *examples* *logs* *resource-managers* *target* LICENSE *artifacts* *connector* *graphx* *mllib* *sbin* *tools* LICENSE-binary *assembly* *core* *hadoop-cloud* *mllib-local* scalastyle-config.xml NOTICE *bin* *data* hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary *binder* dependency-reduced-pom.xml *launcher* *project* *spark-warehouse* *R* *build* *dev* *licenses* *python* *sql* README.md *common* *docs* *licenses-binary* *repl* *streaming* *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ ``` I ran 'ls' above, and clicked return multiple times -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Connect Better StreamingQueryException
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Summary: Connect Better StreamingQueryException (was: Streaming query exception() also include stack trace) > Connect Better StreamingQueryException > -- > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43206) Connect Better StreamingQueryException
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17715178#comment-17715178 ] Wei Liu commented on SPARK-43206: - Also cause, offsets, stack trace... > Connect Better StreamingQueryException > -- > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17715171#comment-17715171 ] Wei Liu commented on SPARK-43206: - I'll work on this. To myself: don't forget jvm exceptions > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43032) Add StreamingQueryManager API
[ https://issues.apache.org/jira/browse/SPARK-43032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17715147#comment-17715147 ] Wei Liu commented on SPARK-43032: - [https://github.com/apache/spark/pull/40861] still draft > Add StreamingQueryManager API > - > > Key: SPARK-43032 > URL: https://issues.apache.org/jira/browse/SPARK-43032 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > > Add StreamingQueryManager API. It would include API that can be directly > support. API like registering streaming listener will be handled separately. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43143) Scala: Add StreamingQuery awaitTermination() API
[ https://issues.apache.org/jira/browse/SPARK-43143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17715149#comment-17715149 ] Wei Liu commented on SPARK-43143: - I'm working on this > Scala: Add StreamingQuery awaitTermination() API > > > Key: SPARK-43143 > URL: https://issues.apache.org/jira/browse/SPARK-43143 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43134) Add streaming query exception API in Scala
[ https://issues.apache.org/jira/browse/SPARK-43134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17715148#comment-17715148 ] Wei Liu commented on SPARK-43134: - I'm working on this > Add streaming query exception API in Scala > -- > > Key: SPARK-43134 > URL: https://issues.apache.org/jira/browse/SPARK-43134 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Description: [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Epic Link: SPARK-42938 > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43167) Streaming Connect console output format support
[ https://issues.apache.org/jira/browse/SPARK-43167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-43167. - Resolution: Not A Problem automatically supported with existing Connect implementation > Streaming Connect console output format support > --- > > Key: SPARK-43167 > URL: https://issues.apache.org/jira/browse/SPARK-43167 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Environment: (was: https://github.com/apache/spark/pull/40785#issuecomment-1515522281 ) > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43206) Streaming query exception() also include stack trace
Wei Liu created SPARK-43206: --- Summary: Streaming query exception() also include stack trace Key: SPARK-43206 URL: https://issues.apache.org/jira/browse/SPARK-43206 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Environment: https://github.com/apache/spark/pull/40785#issuecomment-1515522281 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43167) Streaming Connect console output format support
[ https://issues.apache.org/jira/browse/SPARK-43167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713787#comment-17713787 ] Wei Liu commented on SPARK-43167: - Should be: ``` Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT /_/ Using Python version 3.10.8 (main, Oct 13 2022 09:48:40) Spark context Web UI available at http://10.10.105.160:4040 Spark context available as 'sc' (master = local[*], app id = local-1681856185012). SparkSession available as 'spark'. >>> spark >>> q = >>> spark.readStream.format("rate").load().writeStream.format("console").start() 23/04/18 15:17:12 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkwgp/T/temporary-64d68668-bc6f-46aa-8ea5-b66ddae09f91. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort. 23/04/18 15:17:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. --- Batch: 0 --- l+-+-+ |timestamp|value| +-+-+ +-+-+ --- Batch: 1 --- ++-+ | timestamp|value| ++-+ |2023-04-18 15:17:...| 0| |2023-04-18 15:17:...| 1| ++-+ --- Batch: 2 --- ++-+ | timestamp|value| ++-+ |2023-04-18 15:17:...| 2| |2023-04-18 15:17:...| 3| ++-+ --- Batch: 3 --- ++-+ | timestamp|value| ++-+ |2023-04-18 15:17:...| 4| ++-+ ``` > Streaming Connect console output format support > --- > > Key: SPARK-43167 > URL: https://issues.apache.org/jira/browse/SPARK-43167 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43167) Streaming Connect console output format support
Wei Liu created SPARK-43167: --- Summary: Streaming Connect console output format support Key: SPARK-43167 URL: https://issues.apache.org/jira/browse/SPARK-43167 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43147) Python lint local config
Wei Liu created SPARK-43147: --- Summary: Python lint local config Key: SPARK-43147 URL: https://issues.apache.org/jira/browse/SPARK-43147 Project: Spark Issue Type: Task Components: PySpark, python Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42962) Allow access to stopped streaming queries
[ https://issues.apache.org/jira/browse/SPARK-42962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711155#comment-17711155 ] Wei Liu commented on SPARK-42962: - Please be aware of the TODOs in connect query.py and readwriter.py when creating a PR for this > Allow access to stopped streaming queries > - > > Key: SPARK-42962 > URL: https://issues.apache.org/jira/browse/SPARK-42962 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > > In Spark connect, when a query is stopped, the client can't access it > anymore. That implies they can not fetch any information about the query > after that. > The server might have to cache the queries for some time (upto the time the > session is closed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43015) Python: await_termination & exception() for Streaming Connect
[ https://issues.apache.org/jira/browse/SPARK-43015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709070#comment-17709070 ] Wei Liu edited comment on SPARK-43015 at 4/6/23 10:05 PM: -- Duplicate with SPARK-42960 was (Author: JIRAUSER295948): SPARK-42960 > Python: await_termination & exception() for Streaming Connect > - > > Key: SPARK-43015 > URL: https://issues.apache.org/jira/browse/SPARK-43015 > Project: Spark > Issue Type: Story > Components: Connect, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43054) Support foreach() in streaming spark connect
Wei Liu created SPARK-43054: --- Summary: Support foreach() in streaming spark connect Key: SPARK-43054 URL: https://issues.apache.org/jira/browse/SPARK-43054 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42951) Spark Connect: Streaming DataStreamReader API except table()
[ https://issues.apache.org/jira/browse/SPARK-42951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-42951: Issue Type: Task (was: Story) > Spark Connect: Streaming DataStreamReader API except table() > > > Key: SPARK-42951 > URL: https://issues.apache.org/jira/browse/SPARK-42951 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43042) Spark Connect: Streaming readerwriter table() API
[ https://issues.apache.org/jira/browse/SPARK-43042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43042: Issue Type: Task (was: Story) > Spark Connect: Streaming readerwriter table() API > - > > Key: SPARK-43042 > URL: https://issues.apache.org/jira/browse/SPARK-43042 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43042) Spark Connect: Streaming readerwriter table() API
[ https://issues.apache.org/jira/browse/SPARK-43042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43042: Summary: Spark Connect: Streaming readerwriter table() API (was: Spark Connect: Streaming DataStreamReader table() API) > Spark Connect: Streaming readerwriter table() API > - > > Key: SPARK-43042 > URL: https://issues.apache.org/jira/browse/SPARK-43042 > Project: Spark > Issue Type: Story > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42951) Spark Connect: Streaming DataStreamReader API except table()
[ https://issues.apache.org/jira/browse/SPARK-42951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-42951: Affects Version/s: (was: 3.4.0) > Spark Connect: Streaming DataStreamReader API except table() > > > Key: SPARK-42951 > URL: https://issues.apache.org/jira/browse/SPARK-42951 > Project: Spark > Issue Type: Story > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43042) Spark Connect: Streaming DataStreamReader table() API
Wei Liu created SPARK-43042: --- Summary: Spark Connect: Streaming DataStreamReader table() API Key: SPARK-43042 URL: https://issues.apache.org/jira/browse/SPARK-43042 Project: Spark Issue Type: Story Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42951) Spark Connect: Streaming DataStreamReader API except table()
[ https://issues.apache.org/jira/browse/SPARK-42951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-42951: Epic Link: SPARK-42938 > Spark Connect: Streaming DataStreamReader API except table() > > > Key: SPARK-42951 > URL: https://issues.apache.org/jira/browse/SPARK-42951 > Project: Spark > Issue Type: Story > Components: Connect, Structured Streaming >Affects Versions: 3.4.0, 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org