[jira] [Commented] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector
[ https://issues.apache.org/jira/browse/SPARK-48964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867877#comment-17867877 ] Yuchen Liu commented on SPARK-48964: Thank you [~dongjoon]. > Fix the discrepancy between implementation, comment and documentation of > option recursive.fields.max.depth in ProtoBuf connector > > > Key: SPARK-48964 > URL: https://issues.apache.org/jira/browse/SPARK-48964 > Project: Spark > Issue Type: Documentation > Components: Connect >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2, 3.5.3 >Reporter: Yuchen Liu >Priority: Major > > After the three PRs ([https://github.com/apache/spark/pull/38922,] > [https://github.com/apache/spark/pull/40011,] > [https://github.com/apache/spark/pull/40141]) working on the same option, > there are some legacy comments and documentation that has not been updated to > the latest implementation. This task should consolidate them. Below is the > correct description of the behavior. > The `recursive.fields.max.depth` parameter can be specified in the > from_protobuf options to control the maximum allowed recursion depth for a > field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, > setting it to 2 allows it to be recursed once, and setting it to 3 allows it > to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a > value greater than 10 is not allowed. If the `recursive.fields.max.depth` is > specified to a value smaller than 1, recursive fields are not permitted. The > default value of the option is -1. if a protobuf record has more depth for > recursive fields than the allowed value, it will be truncated and some fields > may be discarded. This check is based on the fully qualified field type. SQL > Schema for the protobuf message > {code:java} > message Person { string name = 1; Person bff = 2 }{code} > will vary based on the value of `recursive.fields.max.depth`. > {code:java} > 1: struct > 2: struct> > 3: struct>> ... > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector
[ https://issues.apache.org/jira/browse/SPARK-48964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48964: --- Description: After the three PRs ([https://github.com/apache/spark/pull/38922,] [https://github.com/apache/spark/pull/40011,] [https://github.com/apache/spark/pull/40141]) working on the same option, there are some legacy comments and documentation that has not been updated to the latest implementation. This task should consolidate them. Below is the correct description of the behavior. The `recursive.fields.max.depth` parameter can be specified in the from_protobuf options to control the maximum allowed recursion depth for a field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, setting it to 2 allows it to be recursed once, and setting it to 3 allows it to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a value greater than 10 is not allowed. If the `recursive.fields.max.depth` is specified to a value smaller than 1, recursive fields are not permitted. The default value of the option is -1. if a protobuf record has more depth for recursive fields than the allowed value, it will be truncated and some fields may be discarded. This check is based on the fully qualified field type. SQL Schema for the protobuf message {code:java} message Person { string name = 1; Person bff = 2 }{code} will vary based on the value of `recursive.fields.max.depth`. {code:java} 1: struct 2: struct> 3: struct>> ... {code} was: After the three PRs ([https://github.com/apache/spark/pull/38922,] [https://github.com/apache/spark/pull/40011,] [https://github.com/apache/spark/pull/40141]) working on the same option, there are some legacy comments and documentation that has not been updated to the latest implementation. This task should consolidate them. Below is the correct description of the behavior. The `recursive.fields.max.depth` parameter can be specified in the from_protobuf options to control the maximum allowed recursion depth for a field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, setting it to 2 allows it to be recursed once, and setting it to 3 allows it to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a value greater than 10 is not allowed. If the `recursive.fields.max.depth` is specified to a value smaller than 1, recursive fields are not permitted. The default value of the option is -1. if a protobuf record has more depth for recursive fields than the allowed value, it will be truncated and some fields may be discarded. This check is based on the fully qualified field type. SQL Schema for the protobuf message {code:java} message Person { string name = 1; Person bff = 2 }{code} will vary based on the value of `recursive.fields.max.depth`. {code:java} 1: struct 2: struct> 3: struct>> ... {code} > Fix the discrepancy between implementation, comment and documentation of > option recursive.fields.max.depth in ProtoBuf connector > > > Key: SPARK-48964 > URL: https://issues.apache.org/jira/browse/SPARK-48964 > Project: Spark > Issue Type: Documentation > Components: Connect >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2, 3.5.3 >Reporter: Yuchen Liu >Priority: Major > > After the three PRs ([https://github.com/apache/spark/pull/38922,] > [https://github.com/apache/spark/pull/40011,] > [https://github.com/apache/spark/pull/40141]) working on the same option, > there are some legacy comments and documentation that has not been updated to > the latest implementation. This task should consolidate them. Below is the > correct description of the behavior. > The `recursive.fields.max.depth` parameter can be specified in the > from_protobuf options to control the maximum allowed recursion depth for a > field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, > setting it to 2 allows it to be recursed once, and setting it to 3 allows it > to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a > value greater than 10 is not allowed. If the `recursive.fields.max.depth` is > specified to a value smaller than 1, recursive fields are not permitted. The > default value of the option is -1. if a protobuf record has more depth for > recursive fields than the allowed value, it will be truncated and some fields > may be discarded. This check is based on the fully qualified field type. SQL > Schema for the protobuf message > {code:java} > message Person { string name = 1; Person bff = 2 }{code} > will vary based on the value of `recursive.fields.max.depth`. > {code:java} > 1: struct > 2: struct> > 3: struct>> ... > {code} > -- This me
[jira] [Created] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector
Yuchen Liu created SPARK-48964: -- Summary: Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector Key: SPARK-48964 URL: https://issues.apache.org/jira/browse/SPARK-48964 Project: Spark Issue Type: Documentation Components: Connect Affects Versions: 3.5.1, 3.5.0, 4.0.0, 3.5.2, 3.5.3 Reporter: Yuchen Liu Fix For: 4.0.0, 3.5.2, 3.5.3, 3.5.1, 3.5.0 After the three PRs ([https://github.com/apache/spark/pull/38922,] [https://github.com/apache/spark/pull/40011,] [https://github.com/apache/spark/pull/40141]) working on the same option, there are some legacy comments and documentation that has not been updated to the latest implementation. This task should consolidate them. Below is the correct description of the behavior. The `recursive.fields.max.depth` parameter can be specified in the from_protobuf options to control the maximum allowed recursion depth for a field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, setting it to 2 allows it to be recursed once, and setting it to 3 allows it to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a value greater than 10 is not allowed. If the `recursive.fields.max.depth` is specified to a value smaller than 1, recursive fields are not permitted. The default value of the option is -1. if a protobuf record has more depth for recursive fields than the allowed value, it will be truncated and some fields may be discarded. This check is based on the fully qualified field type. SQL Schema for the protobuf message {code:java} message Person { string name = 1; Person bff = 2 }{code} will vary based on the value of `recursive.fields.max.depth`. {code:java} 1: struct 2: struct> 3: struct>> ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48939) Support recursive reference of Avro schema
[ https://issues.apache.org/jira/browse/SPARK-48939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48939: --- Description: Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: {code:java} { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } {code} This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it. was: We should support reading Avro message with recursive reference in schema. Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: {code:java} { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } {code} This is written in Avro Schema DSL and represents a linked list data structure. > Support recursive reference of Avro schema > -- > > Key: SPARK-48939 > URL: https://issues.apache.org/jira/browse/SPARK-48939 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Major > > Recursive reference denotes the case that the type of a field can be defined > before in the parent nodes. A simple example is: > {code:java} > { > "type": "record", > "name": "LongList", > "fields" : [ > {"name": "value", "type": "long"}, > {"name": "next", "type": ["null", "LongList"]} > ] > } > {code} > This is written in Avro Schema DSL and represents a linked list data > structure. Spark currently will throw an error on this schema. Many users > used schema like this, so we should support it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48939) Support recursive reference of Avro schema
Yuchen Liu created SPARK-48939: -- Summary: Support recursive reference of Avro schema Key: SPARK-48939 URL: https://issues.apache.org/jira/browse/SPARK-48939 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 4.0.0 Reporter: Yuchen Liu We should support reading Avro message with recursive reference in schema. Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: {code:java} { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } {code} This is written in Avro Schema DSL and represents a linked list data structure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48850) Add documentation for new options added to State Data Source
[ https://issues.apache.org/jira/browse/SPARK-48850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48850: --- Epic Link: SPARK-48588 Description: In [https://github.com/apache/spark/pull/46944] and [https://github.com/apache/spark/pull/47188], we introduced some new options to the State Data Source. This task aims to explain these new features in the documentation. (was: In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new options in the State Data Source. This task aims to explain these new features in the documentation.) Priority: Major (was: Minor) Summary: Add documentation for new options added to State Data Source (was: Add documentation for snapshot related options in State Data Source) > Add documentation for new options added to State Data Source > > > Key: SPARK-48850 > URL: https://issues.apache.org/jira/browse/SPARK-48850 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Major > Labels: pull-request-available > > In [https://github.com/apache/spark/pull/46944] and > [https://github.com/apache/spark/pull/47188], we introduced some new options > to the State Data Source. This task aims to explain these new features in the > documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48859) Add documentation for change feed related options in State Data Source
Yuchen Liu created SPARK-48859: -- Summary: Add documentation for change feed related options in State Data Source Key: SPARK-48859 URL: https://issues.apache.org/jira/browse/SPARK-48859 Project: Spark Issue Type: Documentation Components: SQL, Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu In this PR: [https://github.com/apache/spark/pull/47188], we added some options which are used to read change feed of state store. This task is to reflect the latest changes in the documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48850) Add documentation for snapshot related options in State Data Source
[ https://issues.apache.org/jira/browse/SPARK-48850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48850: --- Description: In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new options in the State Data Source. This task aims to explain these new features in the documentation. (was: In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new options in the State Data Source. This task aims to introduce these new features in the documentation.) > Add documentation for snapshot related options in State Data Source > --- > > Key: SPARK-48850 > URL: https://issues.apache.org/jira/browse/SPARK-48850 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Minor > > In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new > options in the State Data Source. This task aims to explain these new > features in the documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48850) Add documentation for snapshot related options in State Data Source
Yuchen Liu created SPARK-48850: -- Summary: Add documentation for snapshot related options in State Data Source Key: SPARK-48850 URL: https://issues.apache.org/jira/browse/SPARK-48850 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new options in the State Data Source. This task aims to introduce these new features in the documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48772) State Data Source Read Change Feed
[ https://issues.apache.org/jira/browse/SPARK-48772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48772: --- Description: The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes. This PR adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An example usage: {code:java} .format("statestore") .option("readChangeFeed", true) .option("changeStartBatchId", 5) #required .option("changeEndBatchId", 10) #not required, default: latest batch Id available {code} was: The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes. This adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An example usage: {code:java} .format("statestore") .option("readChangeFeed", true) .option("changeStartBatchId", 5) #required .option("changeEndBatchId", 10) #not required, default: latest batch Id available {code} > State Data Source Read Change Feed > -- > > Key: SPARK-48772 > URL: https://issues.apache.org/jira/browse/SPARK-48772 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Major > > The current state reader can only return the entire state at a specific > version. If an error occurs related to state, knowing the change of state > across versions to find out at which version state starts to go wrong is > important for debugging purposes. This PR adds ability of showing the > evolution of state as Change Data Capture (CDC) format to state data source. > An example usage: > {code:java} > .format("statestore") > .option("readChangeFeed", true) > .option("changeStartBatchId", 5) #required > .option("changeEndBatchId", 10) #not required, default: latest batch Id > available > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48772) State Data Source Read Change Feed
[ https://issues.apache.org/jira/browse/SPARK-48772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48772: --- Description: The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes. This adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An example usage: {code:java} .format("statestore") .option("readChangeFeed", true) .option("changeStartBatchId", 5) #required .option("changeEndBatchId", 10) #not required, default: latest batch Id available {code} was: The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes. This adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An example usage: {code:java} .format("statestore") .option("readChangeFeed", true) .option("changeStartBatchId", 5) #required .option("changeEndBatchId", 10) #not required, default: latest batch Id available {code} > State Data Source Read Change Feed > -- > > Key: SPARK-48772 > URL: https://issues.apache.org/jira/browse/SPARK-48772 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Major > > The current state reader can only return the entire state at a specific > version. If an error occurs related to state, knowing the change of state > across versions to find out at which version state starts to go wrong is > important for debugging purposes. This adds ability of showing the evolution > of state as Change Data Capture (CDC) format to state data source. > An example usage: > {code:java} > .format("statestore") > .option("readChangeFeed", true) > .option("changeStartBatchId", 5) #required > .option("changeEndBatchId", 10) #not required, default: latest batch Id > available > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48772) State Data Source Read Change Feed
Yuchen Liu created SPARK-48772: -- Summary: State Data Source Read Change Feed Key: SPARK-48772 URL: https://issues.apache.org/jira/browse/SPARK-48772 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes. This adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An example usage: {code:java} .format("statestore") .option("readChangeFeed", true) .option("changeStartBatchId", 5) #required .option("changeEndBatchId", 10) #not required, default: latest batch Id available {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48589) Add option snapshotStartBatchId and snapshotPartitionId to state data source
Yuchen Liu created SPARK-48589: -- Summary: Add option snapshotStartBatchId and snapshotPartitionId to state data source Key: SPARK-48589 URL: https://issues.apache.org/jira/browse/SPARK-48589 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu Define two new options, _snapshotStartBatchId_ and _snapshotPartitionId_, for the existing state reader. Both of them should be provided at the same time. # When there is no snapshot file at that batch (note there is an off-by-one issue between version and batch Id), throw an exception. # Otherwise, the reader should continue to rebuild the state by reading delta files only, and ignore all snapshot files afterwards. # Note that if a batchId option is already specified. That batchId is the ending batchId, we should then end at that batchId. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48588) Fine-grained State Data Source
Yuchen Liu created SPARK-48588: -- Summary: Fine-grained State Data Source Key: SPARK-48588 URL: https://issues.apache.org/jira/browse/SPARK-48588 Project: Spark Issue Type: Epic Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu The current state reader API replays the state store rows from the latest snapshot and newer delta files if any. The issue with this mechanism is that sometimes, the snapshot files could be wrongly constructed, or user want to know the change of state across batches. We need to improve the State Reader so that it can handle a variety of fine-grained requirements. For example, reconstruct a state based on arbitrary snapshot; support CDC mode for state evolution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48542) Give snapshotStartBatchId and snapshotPartitionId to the state data source
Yuchen Liu created SPARK-48542: -- Summary: Give snapshotStartBatchId and snapshotPartitionId to the state data source Key: SPARK-48542 URL: https://issues.apache.org/jira/browse/SPARK-48542 Project: Spark Issue Type: New Feature Components: SQL, Structured Streaming Affects Versions: 4.0.0 Environment: This should work for both HDFS state store and RocksDB state store. Reporter: Yuchen Liu Right now, to read a version of the state data, the state source will try to find the first snapshot file before the given version and construct it using the delta files. In some debugging scenarios, users need more granular control on how to reconstruct the given state, for example they want to start from a specific snapshot instead of the closest one. One use case is to find whether a snapshot has been corrupted after committing. This task introduces two options {{snapshotStartBatchId}} and {{snapshotPartitionId}} to the state data source. By specifying them, users can control the starting batch id of the snapshot and partition id of the state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48447) Check state store provider class before invoking the constructor
[ https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuchen Liu updated SPARK-48447: --- Priority: Major (was: Minor) > Check state store provider class before invoking the constructor > > > Key: SPARK-48447 > URL: https://issues.apache.org/jira/browse/SPARK-48447 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > We should restrict that only classes > [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73] > {{StateStoreProvider}} can be constructed to prevent customer from > instantiating arbitrary class of objects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48447) Check state store provider class before invoking the constructor
Yuchen Liu created SPARK-48447: -- Summary: Check state store provider class before invoking the constructor Key: SPARK-48447 URL: https://issues.apache.org/jira/browse/SPARK-48447 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu We should restrict that only classes [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73] {{StateStoreProvider}} can be constructed to prevent customer from instantiating arbitrary class of objects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48446) Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax
Yuchen Liu created SPARK-48446: -- Summary: Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax Key: SPARK-48446 URL: https://issues.apache.org/jira/browse/SPARK-48446 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Yuchen Liu For dropDuplicates, the example on [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=)%20%5C%0A%20%20.-,dropDuplicates,-(%22guid%22] is out of date compared with [https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html]. The argument should be a list. The discrepancy is also true for dropDuplicatesWithinWatermark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark
[ https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849356#comment-17849356 ] Yuchen Liu commented on SPARK-48411: I will work on this. > Add E2E test for DropDuplicateWithinWatermark > - > > Key: SPARK-48411 > URL: https://issues.apache.org/jira/browse/SPARK-48411 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > Currently we do not have a e2e test for DropDuplicateWithinWatermark, we > should add one. We can simply use one of the test written in Scala here (with > the testStream API) and replicate it to python: > [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103] > > The change should happen in > [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29] > > so we can test it in both connect and non-connect. > > Test with: > ``` > python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming > python/run-tests --testnames > pyspark.sql.tests.connect.streaming.test_parity_streaming > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org