Tom Bar Yacov created SPARK-26008: ------------------------------------- Summary: Structured Streaming Manual clock for simulation Key: SPARK-26008 URL: https://issues.apache.org/jira/browse/SPARK-26008 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0 Reporter: Tom Bar Yacov
Structured streaming Internal {color:#333333}StreamTest{color} class allows to test incremental logic and verify outputs between multiple triggers. It support changing the internal spark clock to get full deterministic simulation of the incremental state and APIs. This is not possible outside tests since {color:#333333}DataStreamWriter{color} hides the triggerClock parameter and is final. This can be very useful not only in unit test mode but also for a real running query. for example when you have all the Kafka historical data persisted to hdfs with its Kafka timestamp and you want to "play" the data and simulate the streaming application output as if running on this data in live streaming including incremental output between triggers. Today I can simulate multiple triggers and incremental logic for some of the APIs, but for APIs that depend on the execution clock like {color:#333333}mapGroupsWithState{color} with execution based timeout I did not find a way to do this. Question is - Is it a possible to support a similar solution like in StreamTest - Allow passing an external manual clock as parameter to DataStreamWriter and allowing the user an external control over this clock? what possible failures that can occur if running with manual clock in real cluster mode? Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org