[ 
https://issues.apache.org/jira/browse/FLINK-27934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frans King updated FLINK-27934:
-------------------------------
    Description: 
In the Python API state variables can be accessed via the UserFacingContext:

variable = context.storage.variable

This calls into the Cell instance for that state variable which has get() & 
set() methods.  The get() method always deserializes from the typed_value and 
the set() always re-serializes and marks the cell dirty.

 

This has two side effects

1:

var1 = context.storage.variable

var2 = context.storage.variable

id(var2) != id(var1) - they are different instances

 

2:

In a large batch (say 1000 calls to the same function type and id) this can 
result in deserializing and re-serializing the same same state variable 1000 
times when really it only needs to be deserialized in the first invocation in 
the batch, held in memory until the last invocation and then re-serialized 
prior to collecting the mutations.  

 

I think this can be improved by having a lazily initialized backing field in 
the Cell class but I don't know if this behavior was a conscious design 
decision to have the behavior described in 1. 

 

Any feedback would be welcome. 

  was:
In the Python API state variables can be accessed via the UserFacingContext:

variable = context.storage.variable

This calls into the Cell instance for that state variable which has get() & 
set() methods.  The get() method always deserializes from the typed_value and 
the set() always re-serializes and marks the cell dirty.

 

This has two side effects

1:

var1 = context.storage.variable

var2 = context.storage.variable

var2 != var1 - they are different instances

 

2:

In a large batch (say 1000 calls to the same function type and id) this can 
result in deserializing and re-serializing the same same state variable 1000 
times when really it only needs to be deserialized in the first invocation in 
the batch, held in memory until the last invocation and then re-serialized 
prior to collecting the mutations.  

 

I think this can be improved by having a lazily initialized backing field in 
the Cell class but I don't know if this behavior was a conscious design 
decision to have the behavior described in 1. 

 

Any feedback would be welcome. 


> Python API- Inefficient deserialization/serialization of state variables 
> within a batch
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-27934
>                 URL: https://issues.apache.org/jira/browse/FLINK-27934
>             Project: Flink
>          Issue Type: Improvement
>          Components: Stateful Functions
>    Affects Versions: statefun-3.2.0
>            Reporter: Frans King
>            Priority: Minor
>
> In the Python API state variables can be accessed via the UserFacingContext:
> variable = context.storage.variable
> This calls into the Cell instance for that state variable which has get() & 
> set() methods.  The get() method always deserializes from the typed_value and 
> the set() always re-serializes and marks the cell dirty.
>  
> This has two side effects
> 1:
> var1 = context.storage.variable
> var2 = context.storage.variable
> id(var2) != id(var1) - they are different instances
>  
> 2:
> In a large batch (say 1000 calls to the same function type and id) this can 
> result in deserializing and re-serializing the same same state variable 1000 
> times when really it only needs to be deserialized in the first invocation in 
> the batch, held in memory until the last invocation and then re-serialized 
> prior to collecting the mutations.  
>  
> I think this can be improved by having a lazily initialized backing field in 
> the Cell class but I don't know if this behavior was a conscious design 
> decision to have the behavior described in 1. 
>  
> Any feedback would be welcome. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to