[jira] [Created] (SPARK-47052) Separate state tracking variables from MicroBatchExecution/StreamExecution

2024-02-14 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-47052:
-

 Summary: Separate state tracking variables from 
MicroBatchExecution/StreamExecution
 Key: SPARK-47052
 URL: https://issues.apache.org/jira/browse/SPARK-47052
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Boyang Jerry Peng


To improve code clarity and maintainability, I propose that we move all the 
variables that track mutable state and metrics for streaming query into a 
separate class.  With this refactor, it would be easy to track and find all the 
mutable state a microbatch can have.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Labels:   (was: SPIP)

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Labels: SPIP  (was: )

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Labels: SPIP  (was: )

> SPIP: Offset Management Improvements in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Summary: SPIP: Offset Management Improvements in Structured Streaming  
(was: Offset Management Improvements in Structured Streaming)

> SPIP: Offset Management Improvements in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Description: 
Currently in Structured Streaming, at the beginning of every micro-batch the 
offset to process up to for the current batch is persisted to durable storage.  
At the end of every micro-batch, a marker to indicate the completion of this 
current micro-batch is persisted to durable storage. For pipelines such as one 
that read from Kafka and write to Kafka, end-to-end exactly once is not support 
and latency is sensitive, we can allow users to configure offset commits to be 
written asynchronously thus this commit operation will not contribute to the 
batch duration and effectively lowering the overall latency of the pipeline.

 

SPIP Doc: 

 

https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing

  was:Currently in Structured Streaming, at the beginning of every micro-batch 
the offset to process up to for the current batch is persisted to durable 
storage.  At the end of every micro-batch, a marker to indicate the completion 
of this current micro-batch is persisted to durable storage. For pipelines such 
as one that read from Kafka and write to Kafka, end-to-end exactly once is not 
support and latency is sensitive, we can allow users to configure offset 
commits to be written asynchronously thus this commit operation will not 
contribute to the batch duration and effectively lowering the overall latency 
of the pipeline.


> SPIP: Offset Management Improvements in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.
>  
> SPIP Doc: 
>  
> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39591) SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39591:
--
Summary: SPIP: Asynchronous Offset Management in Structured Streaming  
(was: SPIP: Offset Management Improvements in Structured Streaming)

> SPIP: Asynchronous Offset Management in Structured Streaming
> 
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>  Labels: SPIP
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.
>  
> SPIP Doc: 
>  
> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-21 Thread Boyang Jerry Peng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691762#comment-17691762
 ] 

Boyang Jerry Peng commented on SPARK-42485:
---

[~mich.talebza...@gmail.com] This seems like an interesting and useful feature. 
 I have a couple of questions:

 
 # Can you share any use cases you have that will benefit from this feature
 # Is there a SPIP doc written for this yet? If there is a SPIP doc written 
please link in the JIRA.
 # In regards to this statement > I have devised a method that allows one to 
terminate the spark application internally after processing the last received 
message. Within say 2 seconds of the confirmation of shutdown, the process will 
invoke a graceful shutdown.

Do you mean the query will gracefully shutdown after the most current/most 
recent micro-batch is done processing?  

 

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39585) Multiple Stateful Operators

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39585:
-

 Summary: Multiple Stateful Operators
 Key: SPARK-39585
 URL: https://issues.apache.org/jira/browse/SPARK-39585
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Currently, Structured Streaming only supports one stateful operator per 
pipeline.  This constraint excludes Structured Streaming to be used for many 
use cases such ones involving chained time window aggregations, chained 
stream-stream outer equality join, stream-stream time interval join followed by 
time window aggregation.  Enabling multiple stateful operators within a 
pipeline in Structured Streaming will open up many additional use cases for its 
usage as well as be on par with other stream processing engines.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39586) Advanced Windowing

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39586:
-

 Summary: Advanced Windowing
 Key: SPARK-39586
 URL: https://issues.apache.org/jira/browse/SPARK-39586
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Currently in Structure Streaming there does not exist a user friendly method to 
define custom windowing strategies. Users can only choose from an existing set 
of built-in window strategies such tumbling, sliding, session windows.  To 
improve flexibility of the engine and to support more advanced use cases, 
Structured Streaming needs to provide an intuitive API that allows users to 
define custom window boundaries, to trigger windows when they are complete, 
and, if necessary, evict elements from windows.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39587) Schema Evolution for Stateful Pipelines in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39587:
-

 Summary: Schema Evolution for Stateful Pipelines in Structured 
Streaming 
 Key: SPARK-39587
 URL: https://issues.apache.org/jira/browse/SPARK-39587
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Support for schema evolution of stateful operators is non-existent. There is no 
clear path to evolve the schema for existing built-in stateful operators.  For 
use defined stateful operators such as map/flatMapGroupsWithState, there is no 
path for evolving state as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39588) Inspect the State of Stateful Streaming Pipelines

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39588:
-

 Summary: Inspect the State of Stateful Streaming Pipelines
 Key: SPARK-39588
 URL: https://issues.apache.org/jira/browse/SPARK-39588
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Currently there is no mechanism to query or inspect the state of the stateful 
streaming pipeline externally.  A tool or API that allows a user to do that 
would be extremely helpful for debugging.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39588) Inspect the State of Stateful Streaming Pipelines

2022-06-24 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39588:
--
Issue Type: New Feature  (was: Improvement)

> Inspect the State of Stateful Streaming Pipelines
> -
>
> Key: SPARK-39588
> URL: https://issues.apache.org/jira/browse/SPARK-39588
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently there is no mechanism to query or inspect the state of the stateful 
> streaming pipeline externally.  A tool or API that allows a user to do that 
> would be extremely helpful for debugging.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39586) Advanced Windowing in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39586:
--
Summary: Advanced Windowing in Structured Streaming  (was: Advanced 
Windowing)

> Advanced Windowing in Structured Streaming
> --
>
> Key: SPARK-39586
> URL: https://issues.apache.org/jira/browse/SPARK-39586
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structure Streaming there does not exist a user friendly method 
> to define custom windowing strategies. Users can only choose from an existing 
> set of built-in window strategies such tumbling, sliding, session windows.  
> To improve flexibility of the engine and to support more advanced use cases, 
> Structured Streaming needs to provide an intuitive API that allows users to 
> define custom window boundaries, to trigger windows when they are complete, 
> and, if necessary, evict elements from windows.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39585) Multiple Stateful Operators in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-39585:
--
Summary: Multiple Stateful Operators in Structured Streaming  (was: 
Multiple Stateful Operators)

> Multiple Stateful Operators in Structured Streaming
> ---
>
> Key: SPARK-39585
> URL: https://issues.apache.org/jira/browse/SPARK-39585
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently, Structured Streaming only supports one stateful operator per 
> pipeline.  This constraint excludes Structured Streaming to be used for many 
> use cases such ones involving chained time window aggregations, chained 
> stream-stream outer equality join, stream-stream time interval join followed 
> by time window aggregation.  Enabling multiple stateful operators within a 
> pipeline in Structured Streaming will open up many additional use cases for 
> its usage as well as be on par with other stream processing engines.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39589) Asynchronous I/O API support in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39589:
-

 Summary: Asynchronous I/O API support in Structured Streaming
 Key: SPARK-39589
 URL: https://issues.apache.org/jira/browse/SPARK-39589
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Often in streaming pipelines used to support ETL use cases, API calls to 
external systems can be found.  To better support and improve performance for 
such use cases, an asynchronous processing API should be added to Structured 
Streaming so that external requests can be handled in a more efficient manner.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39591:
-

 Summary: Offset Management Improvements in Structured Streaming
 Key: SPARK-39591
 URL: https://issues.apache.org/jira/browse/SPARK-39591
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Currently in Structured Streaming, at the beginning of every micro-batch the 
offset to process up to for the current batch is persisted to durable storage.  
At the end of every micro-batch, a marker to indicate the completion of this 
current micro-batch is persisted to durable storage. For pipelines such as one 
that read from Kafka and write to Kafka, end-to-end exactly once is not support 
and latency is sensitive, we can allow users to configure offset commits to be 
written asynchronously thus this commit operation will not contribute to the 
batch duration and effectively lowering the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39590) Python API Parity in Structure Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39590:
-

 Summary: Python API Parity in Structure Streaming
 Key: SPARK-39590
 URL: https://issues.apache.org/jira/browse/SPARK-39590
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


New APIs in Structured Streaming tend to get added to Java/Scala first.  This 
creates a situation where the Python API have fallen behind.  For example 
map/flatMapGroupsWithState is not supported in the Pyspark.  We need Pyspark 
API to catch up with the Java/Scala APIs and, where necessary, provide tighter 
integrations with native python data processing frameworks such as Pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39592) Asynchronous State Checkpointing in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39592:
-

 Summary: Asynchronous State Checkpointing in Structured Streaming
 Key: SPARK-39592
 URL: https://issues.apache.org/jira/browse/SPARK-39592
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


We can reduce the latency of stateful pipelines in Structured Streaming by 
making state checkpoints asynchronous.  One of the major contributors of 
latency for stateful pipelines in Structured Streaming can be checkpointing the 
state changes of every micro-batch.  If we make the state checkpointing 
asynchronous, we can potentially significantly lower the latency of the 
pipeline as the state checkpointing won’t or will contribute less to the batch 
latency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39593) Configurable State Checkpointing Frequency in Structured Streaming

2022-06-24 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-39593:
-

 Summary: Configurable State Checkpointing Frequency in Structured 
Streaming
 Key: SPARK-39593
 URL: https://issues.apache.org/jira/browse/SPARK-39593
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Boyang Jerry Peng


Currently, for stateful pipelines state changes are checkpointed for every 
micro-batch. State checkpoints can contribute significantly to the latency of a 
micro-batch.  If state is checkpointed less frequently, its effect on batch 
latency can be amortized.  This can be used in conjunction with asynchronous 
state checkpointing to further reduce the cost in latency state checkpointing 
may incur.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40025) Project Lightspeed (Spark Streaming Improvements)

2022-08-09 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-40025:
-

 Summary: Project Lightspeed (Spark Streaming Improvements)
 Key: SPARK-40025
 URL: https://issues.apache.org/jira/browse/SPARK-40025
 Project: Spark
  Issue Type: Umbrella
  Components: Structured Streaming
Affects Versions: 3.2.2
Reporter: Boyang Jerry Peng


{quote}SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed (Spark Streaming Improvements)

2022-08-09 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
{quote}SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency{quote}


> Project Lightspeed (Spark Streaming Improvements)
> -
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-08-10 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Summary: Project Lightspeed: Faster and Simpler Stream Processing with 
Apache Spark  (was: Project Lightspeed (Spark Streaming Improvements))

> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-08-10 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Please reference full blog post for this project:
[https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html]
 

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Please reference full blog post for this project:
> [https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html]
>  
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-08-10 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Please reference full blog post for this project:
[https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html]
 

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36649) Support Trigger.AvailableNow on Kafka data source

2022-01-13 Thread Boyang Jerry Peng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475761#comment-17475761
 ] 

Boyang Jerry Peng commented on SPARK-36649:
---

i'm working on it

> Support Trigger.AvailableNow on Kafka data source
> -
>
> Key: SPARK-36649
> URL: https://issues.apache.org/jira/browse/SPARK-36649
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-36533 introduces a new trigger Trigger.AvailableNow, but only 
> introduces the new functionality to the file stream source. Given that Kafka 
> data source is the one of major data sources being used in streaming query, 
> we should make Kafka data source support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37973) Directly call super.getDefaultReadLimit when scala issue 12523 is fixed

2022-01-20 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-37973:
-

 Summary: Directly call super.getDefaultReadLimit when scala issue 
12523 is fixed
 Key: SPARK-37973
 URL: https://issues.apache.org/jira/browse/SPARK-37973
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Boyang Jerry Peng


In regards to [https://github.com/apache/spark/pull/35238] and more 
specifically these lines:

 

[https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]

 

[https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]

 

```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)

```

 

needed to be changed to



```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse({*}ReadLimit.allAvailable(){*})

```

 

Because of a bug in the scala compiler documented here:

 

[https://github.com/scala/bug/issues/12523]

 

After this bug is fixed we can revert this change, i.e. back to using 
`super.getDefaultReadLimit`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37973) Directly call super.getDefaultReadLimit when scala issue 12523 is fixed

2022-01-20 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-37973:
--
Description: 
In regards to [https://github.com/apache/spark/pull/35238] and more 
specifically these lines:

 

[https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]

 

[https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]

 

 
{code:java}
maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)
 {code}
 

 

needed to be changed to

 
{code:java}
maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(ReadLimit.allAvailable()) 
{code}
 

Because of a bug in the scala compiler documented here:

 

[https://github.com/scala/bug/issues/12523]

 

After this bug is fixed we can revert this change, i.e. back to using 
`super.getDefaultReadLimit`

  was:
In regards to [https://github.com/apache/spark/pull/35238] and more 
specifically these lines:

 

[https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]

 

[https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]

 

```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)

```

 

needed to be changed to



```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse({*}ReadLimit.allAvailable(){*})

```

 

Because of a bug in the scala compiler documented here:

 

[https://github.com/scala/bug/issues/12523]

 

After this bug is fixed we can revert this change, i.e. back to using 
`super.getDefaultReadLimit`


> Directly call super.getDefaultReadLimit when scala issue 12523 is fixed
> ---
>
> Key: SPARK-37973
> URL: https://issues.apache.org/jira/browse/SPARK-37973
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> In regards to [https://github.com/apache/spark/pull/35238] and more 
> specifically these lines:
>  
> [https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]
>  
> [https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]
>  
>  
> {code:java}
> maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)
>  {code}
>  
>  
> needed to be changed to
>  
> {code:java}
> maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(ReadLimit.allAvailable())
>  {code}
>  
> Because of a bug in the scala compiler documented here:
>  
> [https://github.com/scala/bug/issues/12523]
>  
> After this bug is fixed we can revert this change, i.e. back to using 
> `super.getDefaultReadLimit`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38046) Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing

2022-01-27 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-38046:
-

 Summary: Fix KafkaSource/KafkaMicroBatch flaky test due to 
non-deterministic timing
 Key: SPARK-38046
 URL: https://issues.apache.org/jira/browse/SPARK-38046
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.0
Reporter: Boyang Jerry Peng


There is a test call "compositeReadLimit"

 

[https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L460]

 

that is flaky.  The problem is because the Kakfa connector is always getting 
the actual system time and not advancing it manually, thus leaving room for 
non-deterministic behaviors especially since the source determines if 
"maxTriggerDelayMs" is satisfied by comparing the last trigger time with the 
current system time.  One can simply "sleep" at points in the test to generate 
different outcomes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38046) Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing

2022-01-27 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-38046:
--
Description: 
There is a test call "compositeReadLimit"

 

[https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L460]

 

that is flaky.  The problem is because the Kakfa connector is always getting 
the actual system time and not advancing it manually, thus leaving room for 
non-deterministic behaviors especially since the source determines if 
"maxTriggerDelayMs" is satisfied by comparing the last trigger time with the 
current system time.  One can simply "sleep" at points in the test to generate 
different outcomes.

 

Example output when test fails:

 
{code:java}
- compositeReadLimit *** FAILED *** (7 seconds, 862 milliseconds)
  == Results ==
  !== Correct Answer - 0 ==   == Spark Answer - 14 ==
  !struct<>                   struct
  !                           [112]
  !                           [113]
  !                           [114]
  !                           [115]
  !                           [116]
  !                           [117]
  !                           [118]
  !                           [119]
  !                           [120]
  !                           [16]
  !                           [17]
  !                           [18]
  !                           [19]
  !                           [20]
      
  
  == Progress ==
     
StartStream(ProcessingTimeTrigger(100),org.apache.spark.sql.streaming.util.StreamManualClock@30075210,Map(),null)
     AssertOnQuery(, )
     CheckAnswer: 
[1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[12],[13],[14],[15]
     AdvanceManualClock(100)
     AssertOnQuery(, )
  => CheckNewAnswer: 
     Assert(, )
     AdvanceManualClock(100)
     AssertOnQuery(, )
     CheckAnswer: 
[1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[12],[117],[118],[119],[120],[121],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24]
     AdvanceManualClock(100)
     AssertOnQuery(, )
     CheckNewAnswer: 
     Assert(, )
     AdvanceManualClock(100)
     AssertOnQuery(, )
     CheckAnswer: 
[1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[117],[118],[119],[12],[120],[121],[122],[123],[124],[125],[126],[127],[128],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30]
  
  == Stream ==
  Output Mode: Append
  Stream state: {KafkaSourceV1[Subscribe[topic-41]]: 
{"topic-41":{"2":1,"1":11,"0":21}}}
  Thread state: alive
  Thread stack trace: java.lang.Object.wait(Native Method)
  org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:67)
  
org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
  
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:76)
  
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:222)
  
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:350)
  
org.apache.spark.sql.execution.streaming.StreamExecution$$Lambda$3081/1859014229.apply$mcV$sp(Unknown
 Source)
  scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:857)
  
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
  
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:252)
  
  
  == Sink ==
  0: [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] 
[1] [10] [11] [12] [13] [14] [15]
  1: [112] [113] [114] [115] [116] [117] [118] [119] [120] [16] [17] [18] [19] 
[20]
  
  
  == Plan ==
  == Parsed Logical Plan ==
  WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@48d73c40
  +- SerializeFromObject [input[0, int, false] AS value#7455]
     +- MapElements 
org.apache.spark.sql.kafka010.KafkaMicroBatchSourceSuiteBase$$Lambda$6114/1322446561@4df69693,
 class scala.Tuple2, [StructField(_1,StringType,true), 
StructField(_2,StringType,true)], obj#7454: int
        +- DeserializeToObject newInstance(class scala.Tuple2), obj#7453: 
scala.Tuple2
           +- Project [cast(key#7429 as string) AS key#7443, cast(value#7430 as 
string) AS value#7444]
              +- Project [key#7501 AS key#7429, value#7502 AS value#7430, 
topic#7503 AS topic#7431, partition#7504 AS partition#7432, offset#7505L AS 
offset#7433L, timestamp#7506 AS timestamp#7434, timestampType#7507 AS 
timestampType#7435]
                 +- LogicalRDD [

[jira] [Updated] (SPARK-38046) Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing

2022-01-27 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-38046:
--
Description: 
There is a test call "compositeReadLimit"

 

[https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L460]

 

that is flaky.  The problem is because the Kakfa connector is always getting 
the actual system time and not advancing it manually, thus leaving room for 
non-deterministic behaviors especially since the source determines if 
"maxTriggerDelayMs" is satisfied by comparing the last trigger time with the 
current system time.  One can simply "sleep" at points in the test to generate 
different outcomes.

 

Example output when test fails:

 
{code:java}
- compositeReadLimit *** FAILED *** (7 seconds, 862 milliseconds)  == Results 
==  !== Correct Answer - 0 ==   == Spark Answer - 14 ==  !struct<>  
 struct  !   [112]  !
   [113]  !   [114]  !   
[115]  !   [116]  !   [117]  !  
 [118]  !   [119]  !
   [120]  !   [16]  !   
[17]  !   [18]  !   [19]  ! 
  [20]  == Progress == 
StartStream(ProcessingTimeTrigger(100),org.apache.spark.sql.streaming.util.StreamManualClock@30075210,Map(),null)
 AssertOnQuery(, ) CheckAnswer: 
[1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[12],[13],[14],[15]
 AdvanceManualClock(100) AssertOnQuery(, )  => 
CheckNewAnswer:  Assert(, ) AdvanceManualClock(100) 
AssertOnQuery(, ) CheckAnswer: 
[1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[12],[117],[118],[119],[120],[121],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24]
 AdvanceManualClock(100) AssertOnQuery(, ) 
CheckNewAnswer:  Assert(, ) AdvanceManualClock(100) 
AssertOnQuery(, ) CheckAnswer: 
[1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[117],[118],[119],[12],[120],[121],[122],[123],[124],[125],[126],[127],[128],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30]
== Stream ==  Output Mode: Append  Stream state: 
{KafkaSourceV1[Subscribe[topic-41]]: {"topic-41":{"2":1,"1":11,"0":21}}}  
Thread state: alive  Thread stack trace: java.lang.Object.wait(Native Method)  
org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:67)  
org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
  
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:76)
  
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:222)
  
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:350)
  
org.apache.spark.sql.execution.streaming.StreamExecution$$Lambda$3081/1859014229.apply$mcV$sp(Unknown
 Source)  
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)  
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:857)  
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
  
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:252)
  == Sink ==  0: [100] [101] [102] [103] [104] [105] [106] [107] [108] 
[109] [110] [111] [1] [10] [11] [12] [13] [14] [15]  1: [112] [113] [114] [115] 
[116] [117] [118] [119] [120] [16] [17] [18] [19] [20]  == Plan ==  == 
Parsed Logical Plan ==  WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@48d73c40  +- 
SerializeFromObject [input[0, int, false] AS value#7455] +- MapElements 
org.apache.spark.sql.kafka010.KafkaMicroBatchSourceSuiteBase$$Lambda$6114/1322446561@4df69693,
 class scala.Tuple2, [StructField(_1,StringType,true), 
StructField(_2,StringType,true)], obj#7454: int+- DeserializeToObject 
newInstance(class scala.Tuple2), obj#7453: scala.Tuple2   +- Project 
[cast(key#7429 as string) AS key#7443, cast(value#7430 as string) AS 
value#7444]  +- Project [key#7501 AS key#7429, value#7502 AS 
value#7430, topic#7503 AS topic#7431, partition#7504 AS partition#7432, 
offset#7505L AS offset#7433L, timestamp#7506 AS timestamp#7434, 
timestampType#7507 AS timestampType#7435] +- LogicalRDD 
[key#7501, value#7502, topic#7503, parti

[jira] [Created] (SPARK-38564) Support collecting metrics from streaming sinks

2022-03-15 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-38564:
-

 Summary: Support collecting metrics from streaming sinks
 Key: SPARK-38564
 URL: https://issues.apache.org/jira/browse/SPARK-38564
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.1
Reporter: Boyang Jerry Peng


Currently, only streaming sources have the capability to return custom metrics 
but not sinks. Allow streaming sinks to also return custom metrics is very 
useful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38670) Add offset commit time to streaming query listener

2022-03-27 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-38670:
-

 Summary: Add offset commit time to streaming query listener
 Key: SPARK-38670
 URL: https://issues.apache.org/jira/browse/SPARK-38670
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.1
Reporter: Boyang Jerry Peng


A major portion of the batch duration is committing offsets at the end of the 
micro-batch.  The timing for this operation is missing from the durationMs 
metrics.  Lets add this metric to have a more complete picture of where the 
time is going during the processing of a micro-batch



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream

2023-04-12 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-43118:
-

 Summary: Remove unnecessary assert for UninterruptibleThread in 
KafkaMicroBatchStream
 Key: SPARK-43118
 URL: https://issues.apache.org/jira/browse/SPARK-43118
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.2
Reporter: Boyang Jerry Peng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43118) Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream

2023-04-12 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-43118:
--
Description: 
The assert 

 
{code:java}
 assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code}
 

found 
[https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239]
 

 

is not needed.  The reason is the following

 
 # This assert was put there due to some issues when the old and deprecated 
KafkaOffsetReaderConsumer is used.  The default offset reader implementation 
has been changed to KafkaOffsetReaderAdmin which no longer require it run via 
UninterruptedThread.
 # Even if the deprecated KafkaOffsetReaderConsumer is used, there are already 
asserts in that impl to check if it is running via UninterruptedThread e.g. 
[https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130]
 thus the assert in KafkaMicroBatchStream is redundant.

> Remove unnecessary assert for UninterruptibleThread in KafkaMicroBatchStream
> 
>
> Key: SPARK-43118
> URL: https://issues.apache.org/jira/browse/SPARK-43118
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Boyang Jerry Peng
>Priority: Minor
>
> The assert 
>  
> {code:java}
>  assert(Thread.currentThread().isInstanceOf[UninterruptibleThread]) {code}
>  
> found 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L239]
>  
>  
> is not needed.  The reason is the following
>  
>  # This assert was put there due to some issues when the old and deprecated 
> KafkaOffsetReaderConsumer is used.  The default offset reader implementation 
> has been changed to KafkaOffsetReaderAdmin which no longer require it run via 
> UninterruptedThread.
>  # Even if the deprecated KafkaOffsetReaderConsumer is used, there are 
> already asserts in that impl to check if it is running via 
> UninterruptedThread e.g. 
> [https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala#L130]
>  thus the assert in KafkaMicroBatchStream is redundant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-10-19 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements


SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements
SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40849) Async log purge

2022-10-19 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-40849:
-

 Summary: Async log purge
 Key: SPARK-40849
 URL: https://issues.apache.org/jira/browse/SPARK-40849
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Boyang Jerry Peng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-10-19 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements

SPARK-40849 - Async log purge

SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements

SPARK-40849 

 

SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-40849 - Async log purge
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark

2022-10-19 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40025:
--
Description: 
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements

SPARK-40849 

 

SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency

  was:
Project Lightspeed is an umbrella project aimed at improving a couple of key 
aspects of Spark Streaming:
 * Improving the latency and ensuring it is predictable
 * Enhancing functionality for processing data with new operators and APIs

 

Umbrella Jira to track all tickets under Project Lightspeed

SPARK-39585 - Multiple Stateful Operators in Structured Streaming
SPARK-39586 - Advanced Windowing in Structured Streaming
SPARK-39587 - Schema Evolution for Stateful Pipelines
SPARK-39589 - Asynchronous I/O support
SPARK-39590 - Python API for Arbitrary Stateful Processing
SPARK-39591 - Offset Management Improvements


SPARK-39592 - Asynchronous State Checkpointing
SPARK-39593 - Configurable State Checkpointing Frequency


> Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
> --
>
> Key: SPARK-40025
> URL: https://issues.apache.org/jira/browse/SPARK-40025
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Project Lightspeed is an umbrella project aimed at improving a couple of key 
> aspects of Spark Streaming:
>  * Improving the latency and ensuring it is predictable
>  * Enhancing functionality for processing data with new operators and APIs
>  
> Umbrella Jira to track all tickets under Project Lightspeed
> SPARK-39585 - Multiple Stateful Operators in Structured Streaming
> SPARK-39586 - Advanced Windowing in Structured Streaming
> SPARK-39587 - Schema Evolution for Stateful Pipelines
> SPARK-39589 - Asynchronous I/O support
> SPARK-39590 - Python API for Arbitrary Stateful Processing
> SPARK-39591 - Offset Management Improvements
> SPARK-40849 
>  
> SPARK-39592 - Asynchronous State Checkpointing
> SPARK-39593 - Configurable State Checkpointing Frequency



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40849) Async log purge

2022-10-19 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40849:
--
Description: Purging old entries in both the offset log and commit log will 
be done asynchronously

> Async log purge
> ---
>
> Key: SPARK-40849
> URL: https://issues.apache.org/jira/browse/SPARK-40849
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Purging old entries in both the offset log and commit log will be done 
> asynchronously



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40849) Async log purge

2022-10-19 Thread Boyang Jerry Peng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-40849:
--
Description: 
Purging old entries in both the offset log and commit log will be done 
asynchronously.

 

For every micro-batch, older entries in both offset log and commit log are 
deleted. This is done so that the offset log and commit log do not continually 
grow.  Please reference logic here

 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539]
 

 

The time spent performing these log purges is grouped with the “walCommit” 
execution time in the StreamingProgressListener metrics.  Around two thirds of 
the “walCommit” execution time is performing these purge operations thus making 
these operations asynchronous will also reduce latency.  Also, we do not 
necessarily need to perform the purges every micro-batch.  When these purges 
are executed asynchronously, they do not need to block micro-batch execution 
and we don’t need to start another purge until the current one is finished.  
The purges can happen essentially in the background.  We will just have to 
synchronize the purges with the offset WAL commits and completion commits so 
that we don’t have concurrent modifications of the offset log and commit log.

  was:Purging old entries in both the offset log and commit log will be done 
asynchronously


> Async log purge
> ---
>
> Key: SPARK-40849
> URL: https://issues.apache.org/jira/browse/SPARK-40849
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Purging old entries in both the offset log and commit log will be done 
> asynchronously.
>  
> For every micro-batch, older entries in both offset log and commit log are 
> deleted. This is done so that the offset log and commit log do not 
> continually grow.  Please reference logic here
>  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539]
>  
>  
> The time spent performing these log purges is grouped with the “walCommit” 
> execution time in the StreamingProgressListener metrics.  Around two thirds 
> of the “walCommit” execution time is performing these purge operations thus 
> making these operations asynchronous will also reduce latency.  Also, we do 
> not necessarily need to perform the purges every micro-batch.  When these 
> purges are executed asynchronously, they do not need to block micro-batch 
> execution and we don’t need to start another purge until the current one is 
> finished.  The purges can happen essentially in the background.  We will just 
> have to synchronize the purges with the offset WAL commits and completion 
> commits so that we don’t have concurrent modifications of the offset log and 
> commit log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40957) Add in memory cache in HDFSMetadataLog

2022-10-28 Thread Boyang Jerry Peng (Jira)
Boyang Jerry Peng created SPARK-40957:
-

 Summary: Add in memory cache in HDFSMetadataLog
 Key: SPARK-40957
 URL: https://issues.apache.org/jira/browse/SPARK-40957
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Boyang Jerry Peng


Every time entries in offset log or commit log needs to be access, we read from 
disk which is slow.  Can a cache of recent entries to speed up reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org