[jira] [Commented] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector

2024-07-22 Thread Yuchen Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867877#comment-17867877
 ] 

Yuchen Liu commented on SPARK-48964:


Thank you [~dongjoon].

> Fix the discrepancy between implementation, comment and documentation of 
> option recursive.fields.max.depth in ProtoBuf connector
> 
>
> Key: SPARK-48964
> URL: https://issues.apache.org/jira/browse/SPARK-48964
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2, 3.5.3
>Reporter: Yuchen Liu
>Priority: Major
>
> After the three PRs ([https://github.com/apache/spark/pull/38922,] 
> [https://github.com/apache/spark/pull/40011,] 
> [https://github.com/apache/spark/pull/40141]) working on the same option, 
> there are some legacy comments and documentation that has not been updated to 
> the latest implementation. This task should consolidate them. Below is the 
> correct description of the behavior.
> The `recursive.fields.max.depth` parameter can be specified in the 
> from_protobuf options to control the maximum allowed recursion depth for a 
> field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, 
> setting it to 2 allows it to be recursed once, and setting it to 3 allows it 
> to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a 
> value greater than 10 is not allowed. If the `recursive.fields.max.depth` is 
> specified to a value smaller than 1, recursive fields are not permitted. The 
> default value of the option is -1. if a protobuf record has more depth for 
> recursive fields than the allowed value, it will be truncated and some fields 
> may be discarded. This check is based on the fully qualified field type. SQL 
> Schema for the protobuf message
> {code:java}
> message Person { string name = 1; Person bff = 2 }{code}
> will vary based on the value of `recursive.fields.max.depth`.
> {code:java}
> 1: struct
> 2: struct>
> 3: struct>> ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector

2024-07-22 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48964:
---
Description: 
After the three PRs ([https://github.com/apache/spark/pull/38922,] 
[https://github.com/apache/spark/pull/40011,] 
[https://github.com/apache/spark/pull/40141]) working on the same option, there 
are some legacy comments and documentation that has not been updated to the 
latest implementation. This task should consolidate them. Below is the correct 
description of the behavior.

The `recursive.fields.max.depth` parameter can be specified in the 
from_protobuf options to control the maximum allowed recursion depth for a 
field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, 
setting it to 2 allows it to be recursed once, and setting it to 3 allows it to 
be recursed twice. Attempting to set the `recursive.fields.max.depth` to a 
value greater than 10 is not allowed. If the `recursive.fields.max.depth` is 
specified to a value smaller than 1, recursive fields are not permitted. The 
default value of the option is -1. if a protobuf record has more depth for 
recursive fields than the allowed value, it will be truncated and some fields 
may be discarded. This check is based on the fully qualified field type. SQL 
Schema for the protobuf message
{code:java}
message Person { string name = 1; Person bff = 2 }{code}
will vary based on the value of `recursive.fields.max.depth`.
{code:java}
1: struct
2: struct>
3: struct>> ...
{code}
 

  was:
After the three PRs ([https://github.com/apache/spark/pull/38922,] 
[https://github.com/apache/spark/pull/40011,] 
[https://github.com/apache/spark/pull/40141]) working on the same option, there 
are some legacy comments and documentation that has not been updated to the 
latest implementation. This task should consolidate them. Below is the correct 
description of the behavior.

The `recursive.fields.max.depth` parameter can be specified in the 
from_protobuf options to control the maximum allowed recursion depth for a 
field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, 
setting it to 2 allows it to be recursed once, and setting it to 3 allows it to 
be recursed twice. Attempting to set the `recursive.fields.max.depth` to a 
value greater than 10 is not allowed. If the `recursive.fields.max.depth` is 
specified to a value smaller than 1, recursive fields are not permitted. The 
default value of the option is -1. if a protobuf record has more depth for 
recursive fields than the allowed value, it will be truncated and some fields 
may be discarded. This check is based on the fully qualified field type. SQL 
Schema for the protobuf message
{code:java}
message Person { string name = 1; Person bff = 2 }{code}
will vary based on the value of `recursive.fields.max.depth`.

 
{code:java}
1: struct
2: struct>
3: struct>> ...
{code}
 


> Fix the discrepancy between implementation, comment and documentation of 
> option recursive.fields.max.depth in ProtoBuf connector
> 
>
> Key: SPARK-48964
> URL: https://issues.apache.org/jira/browse/SPARK-48964
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2, 3.5.3
>Reporter: Yuchen Liu
>Priority: Major
>
> After the three PRs ([https://github.com/apache/spark/pull/38922,] 
> [https://github.com/apache/spark/pull/40011,] 
> [https://github.com/apache/spark/pull/40141]) working on the same option, 
> there are some legacy comments and documentation that has not been updated to 
> the latest implementation. This task should consolidate them. Below is the 
> correct description of the behavior.
> The `recursive.fields.max.depth` parameter can be specified in the 
> from_protobuf options to control the maximum allowed recursion depth for a 
> field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, 
> setting it to 2 allows it to be recursed once, and setting it to 3 allows it 
> to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a 
> value greater than 10 is not allowed. If the `recursive.fields.max.depth` is 
> specified to a value smaller than 1, recursive fields are not permitted. The 
> default value of the option is -1. if a protobuf record has more depth for 
> recursive fields than the allowed value, it will be truncated and some fields 
> may be discarded. This check is based on the fully qualified field type. SQL 
> Schema for the protobuf message
> {code:java}
> message Person { string name = 1; Person bff = 2 }{code}
> will vary based on the value of `recursive.fields.max.depth`.
> {code:java}
> 1: struct
> 2: struct>
> 3: struct>> ...
> {code}
>  



--
This me

[jira] [Created] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector

2024-07-22 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48964:
--

 Summary: Fix the discrepancy between implementation, comment and 
documentation of option recursive.fields.max.depth in ProtoBuf connector
 Key: SPARK-48964
 URL: https://issues.apache.org/jira/browse/SPARK-48964
 Project: Spark
  Issue Type: Documentation
  Components: Connect
Affects Versions: 3.5.1, 3.5.0, 4.0.0, 3.5.2, 3.5.3
Reporter: Yuchen Liu
 Fix For: 4.0.0, 3.5.2, 3.5.3, 3.5.1, 3.5.0


After the three PRs ([https://github.com/apache/spark/pull/38922,] 
[https://github.com/apache/spark/pull/40011,] 
[https://github.com/apache/spark/pull/40141]) working on the same option, there 
are some legacy comments and documentation that has not been updated to the 
latest implementation. This task should consolidate them. Below is the correct 
description of the behavior.

The `recursive.fields.max.depth` parameter can be specified in the 
from_protobuf options to control the maximum allowed recursion depth for a 
field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, 
setting it to 2 allows it to be recursed once, and setting it to 3 allows it to 
be recursed twice. Attempting to set the `recursive.fields.max.depth` to a 
value greater than 10 is not allowed. If the `recursive.fields.max.depth` is 
specified to a value smaller than 1, recursive fields are not permitted. The 
default value of the option is -1. if a protobuf record has more depth for 
recursive fields than the allowed value, it will be truncated and some fields 
may be discarded. This check is based on the fully qualified field type. SQL 
Schema for the protobuf message
{code:java}
message Person { string name = 1; Person bff = 2 }{code}
will vary based on the value of `recursive.fields.max.depth`.

 
{code:java}
1: struct
2: struct>
3: struct>> ...
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48939) Support recursive reference of Avro schema

2024-07-19 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48939:
---
Description: 
Recursive reference denotes the case that the type of a field can be defined 
before in the parent nodes. A simple example is:
{code:java}
{
  "type": "record",
  "name": "LongList",
  "fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
  ]
}
{code}
This is written in Avro Schema DSL and represents a linked list data structure. 
Spark currently will throw an error on this schema. Many users used schema like 
this, so we should support it. 

  was:
We should support reading Avro message with recursive reference in schema. 
Recursive reference denotes the case that the type of a field can be defined 
before in the parent nodes. A simple example is:

 
{code:java}
{
  "type": "record",
  "name": "LongList",
  "fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
  ]
}
{code}
This is written in Avro Schema DSL and represents a linked list data structure.


> Support recursive reference of Avro schema
> --
>
> Key: SPARK-48939
> URL: https://issues.apache.org/jira/browse/SPARK-48939
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>
> Recursive reference denotes the case that the type of a field can be defined 
> before in the parent nodes. A simple example is:
> {code:java}
> {
>   "type": "record",
>   "name": "LongList",
>   "fields" : [
> {"name": "value", "type": "long"},
> {"name": "next", "type": ["null", "LongList"]}
>   ]
> }
> {code}
> This is written in Avro Schema DSL and represents a linked list data 
> structure. Spark currently will throw an error on this schema. Many users 
> used schema like this, so we should support it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48939) Support recursive reference of Avro schema

2024-07-18 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48939:
--

 Summary: Support recursive reference of Avro schema
 Key: SPARK-48939
 URL: https://issues.apache.org/jira/browse/SPARK-48939
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 4.0.0
Reporter: Yuchen Liu


We should support reading Avro message with recursive reference in schema. 
Recursive reference denotes the case that the type of a field can be defined 
before in the parent nodes. A simple example is:

 
{code:java}
{
  "type": "record",
  "name": "LongList",
  "fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
  ]
}
{code}
This is written in Avro Schema DSL and represents a linked list data structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48850) Add documentation for new options added to State Data Source

2024-07-10 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48850:
---
  Epic Link: SPARK-48588
Description: In [https://github.com/apache/spark/pull/46944] and 
[https://github.com/apache/spark/pull/47188], we introduced some new options to 
the State Data Source. This task aims to explain these new features in the 
documentation.  (was: In https://issues.apache.org/jira/browse/SPARK-48589, we 
introduce some new options in the State Data Source. This task aims to explain 
these new features in the documentation.)
   Priority: Major  (was: Minor)
Summary: Add documentation for new options added to State Data Source  
(was: Add documentation for snapshot related options in State Data Source)

> Add documentation for new options added to State Data Source
> 
>
> Key: SPARK-48850
> URL: https://issues.apache.org/jira/browse/SPARK-48850
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
>
> In [https://github.com/apache/spark/pull/46944] and 
> [https://github.com/apache/spark/pull/47188], we introduced some new options 
> to the State Data Source. This task aims to explain these new features in the 
> documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48859) Add documentation for change feed related options in State Data Source

2024-07-10 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48859:
--

 Summary: Add documentation for change feed related options in 
State Data Source
 Key: SPARK-48859
 URL: https://issues.apache.org/jira/browse/SPARK-48859
 Project: Spark
  Issue Type: Documentation
  Components: SQL, Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


In this PR: [https://github.com/apache/spark/pull/47188], we added some options 
which are used to read change feed of state store. This task is to reflect the 
latest changes in the documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48850) Add documentation for snapshot related options in State Data Source

2024-07-09 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48850:
---
Description: In https://issues.apache.org/jira/browse/SPARK-48589, we 
introduce some new options in the State Data Source. This task aims to explain 
these new features in the documentation.  (was: In 
https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new 
options in the State Data Source. This task aims to introduce these new 
features in the documentation.)

> Add documentation for snapshot related options in State Data Source
> ---
>
> Key: SPARK-48850
> URL: https://issues.apache.org/jira/browse/SPARK-48850
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Minor
>
> In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new 
> options in the State Data Source. This task aims to explain these new 
> features in the documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48850) Add documentation for snapshot related options in State Data Source

2024-07-09 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48850:
--

 Summary: Add documentation for snapshot related options in State 
Data Source
 Key: SPARK-48850
 URL: https://issues.apache.org/jira/browse/SPARK-48850
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


In https://issues.apache.org/jira/browse/SPARK-48589, we introduce some new 
options in the State Data Source. This task aims to introduce these new 
features in the documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48772) State Data Source Read Change Feed

2024-07-02 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48772:
---
Description: 
The current state reader can only return the entire state at a specific 
version. If an error occurs related to state, knowing the change of state 
across versions to find out at which version state starts to go wrong is 
important for debugging purposes. This PR adds ability of showing the evolution 
of state as Change Data Capture (CDC) format to state data source.

An example usage:
{code:java}
.format("statestore")
.option("readChangeFeed", true)
.option("changeStartBatchId", 5) #required 
.option("changeEndBatchId", 10)  #not required, default: latest batch Id 
available
{code}

  was:
The current state reader can only return the entire state at a specific 
version. If an error occurs related to state, knowing the change of state 
across versions to find out at which version state starts to go wrong is 
important for debugging purposes. This adds ability of showing the evolution of 
state as Change Data Capture (CDC) format to state data source.

An example usage:
{code:java}
.format("statestore")
.option("readChangeFeed", true)
.option("changeStartBatchId", 5) #required 
.option("changeEndBatchId", 10)  #not required, default: latest batch Id 
available
{code}


> State Data Source Read Change Feed
> --
>
> Key: SPARK-48772
> URL: https://issues.apache.org/jira/browse/SPARK-48772
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>
> The current state reader can only return the entire state at a specific 
> version. If an error occurs related to state, knowing the change of state 
> across versions to find out at which version state starts to go wrong is 
> important for debugging purposes. This PR adds ability of showing the 
> evolution of state as Change Data Capture (CDC) format to state data source.
> An example usage:
> {code:java}
> .format("statestore")
> .option("readChangeFeed", true)
> .option("changeStartBatchId", 5) #required 
> .option("changeEndBatchId", 10)  #not required, default: latest batch Id 
> available
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48772) State Data Source Read Change Feed

2024-07-01 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48772:
---
Description: 
The current state reader can only return the entire state at a specific 
version. If an error occurs related to state, knowing the change of state 
across versions to find out at which version state starts to go wrong is 
important for debugging purposes. This adds ability of showing the evolution of 
state as Change Data Capture (CDC) format to state data source.

An example usage:
{code:java}
.format("statestore")
.option("readChangeFeed", true)
.option("changeStartBatchId", 5) #required 
.option("changeEndBatchId", 10)  #not required, default: latest batch Id 
available
{code}

  was:
The current state reader can only return the entire state at a specific 
version. If an error occurs related to state, knowing the change of state 
across versions to find out at which version state starts to go wrong is 
important for debugging purposes. This adds ability of showing the evolution of 
state as Change Data Capture (CDC) format to state data source.

An example usage:

 
{code:java}
.format("statestore")
.option("readChangeFeed", true)
.option("changeStartBatchId", 5) #required 
.option("changeEndBatchId", 10)  #not required, default: latest batch Id 
available
{code}
 


> State Data Source Read Change Feed
> --
>
> Key: SPARK-48772
> URL: https://issues.apache.org/jira/browse/SPARK-48772
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>
> The current state reader can only return the entire state at a specific 
> version. If an error occurs related to state, knowing the change of state 
> across versions to find out at which version state starts to go wrong is 
> important for debugging purposes. This adds ability of showing the evolution 
> of state as Change Data Capture (CDC) format to state data source.
> An example usage:
> {code:java}
> .format("statestore")
> .option("readChangeFeed", true)
> .option("changeStartBatchId", 5) #required 
> .option("changeEndBatchId", 10)  #not required, default: latest batch Id 
> available
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48772) State Data Source Read Change Feed

2024-07-01 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48772:
--

 Summary: State Data Source Read Change Feed
 Key: SPARK-48772
 URL: https://issues.apache.org/jira/browse/SPARK-48772
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


The current state reader can only return the entire state at a specific 
version. If an error occurs related to state, knowing the change of state 
across versions to find out at which version state starts to go wrong is 
important for debugging purposes. This adds ability of showing the evolution of 
state as Change Data Capture (CDC) format to state data source.

An example usage:

 
{code:java}
.format("statestore")
.option("readChangeFeed", true)
.option("changeStartBatchId", 5) #required 
.option("changeEndBatchId", 10)  #not required, default: latest batch Id 
available
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48589) Add option snapshotStartBatchId and snapshotPartitionId to state data source

2024-06-11 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48589:
--

 Summary: Add option snapshotStartBatchId and snapshotPartitionId 
to state data source
 Key: SPARK-48589
 URL: https://issues.apache.org/jira/browse/SPARK-48589
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


Define two new options, _snapshotStartBatchId_ and _snapshotPartitionId_, for 
the existing state reader. Both of them should be provided at the same time.
 # When there is no snapshot file at that batch (note there is an off-by-one 
issue between version and batch Id), throw an exception.
 # Otherwise, the reader should continue to rebuild the state by reading delta 
files only, and ignore all snapshot files afterwards.
 # Note that if a batchId option is already specified. That batchId is the 
ending batchId, we should then end at that batchId.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48588) Fine-grained State Data Source

2024-06-11 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48588:
--

 Summary: Fine-grained State Data Source
 Key: SPARK-48588
 URL: https://issues.apache.org/jira/browse/SPARK-48588
 Project: Spark
  Issue Type: Epic
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


The current state reader API replays the state store rows from the latest 
snapshot and newer delta files if any. The issue with this mechanism is that 
sometimes, the snapshot files could be wrongly constructed, or user want to 
know the change of state across batches. We need to improve the State Reader so 
that it can handle a variety of fine-grained requirements. For example, 
reconstruct a state based on arbitrary snapshot; support CDC mode for state 
evolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48542) Give snapshotStartBatchId and snapshotPartitionId to the state data source

2024-06-05 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48542:
--

 Summary: Give snapshotStartBatchId and snapshotPartitionId to the 
state data source
 Key: SPARK-48542
 URL: https://issues.apache.org/jira/browse/SPARK-48542
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Structured Streaming
Affects Versions: 4.0.0
 Environment: This should work for both HDFS state store and RocksDB 
state store.
Reporter: Yuchen Liu


Right now, to read a version of the state data, the state source will try to 
find the first snapshot file before the given version and construct it using 
the delta files. In some debugging scenarios, users need more granular control 
on how to reconstruct the given state, for example they want to start from a 
specific snapshot instead of the closest one. One use case is to find whether a 
snapshot has been corrupted after committing.

This task introduces two options {{snapshotStartBatchId}} and 
{{snapshotPartitionId}} to the state data source. By specifying them, users can 
control the starting batch id of the snapshot and partition id of the state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48447) Check state store provider class before invoking the constructor

2024-05-28 Thread Yuchen Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Liu updated SPARK-48447:
---
Priority: Major  (was: Minor)

> Check state store provider class before invoking the constructor
> 
>
> Key: SPARK-48447
> URL: https://issues.apache.org/jira/browse/SPARK-48447
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should restrict that only classes 
> [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73]
>  {{StateStoreProvider}} can be constructed to prevent customer from 
> instantiating arbitrary class of objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48447) Check state store provider class before invoking the constructor

2024-05-28 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48447:
--

 Summary: Check state store provider class before invoking the 
constructor
 Key: SPARK-48447
 URL: https://issues.apache.org/jira/browse/SPARK-48447
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


We should restrict that only classes 
[extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73]
 {{StateStoreProvider}} can be constructed to prevent customer from 
instantiating arbitrary class of objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48446) Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax

2024-05-28 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48446:
--

 Summary: Update SS Doc of dropDuplicatesWithinWatermark to use the 
right syntax
 Key: SPARK-48446
 URL: https://issues.apache.org/jira/browse/SPARK-48446
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


For dropDuplicates, the example on 
[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=)%20%5C%0A%20%20.-,dropDuplicates,-(%22guid%22]
 is out of date compared with 
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html].
 The argument should be a list.

The discrepancy is also true for dropDuplicatesWithinWatermark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-05-24 Thread Yuchen Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849356#comment-17849356
 ] 

Yuchen Liu commented on SPARK-48411:


I will work on this.

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org