[ 
https://issues.apache.org/jira/browse/FLINK-33354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-33354:
----------------------------
    Summary: Cache TaskInformation and JobInformation to avoid deserializing 
duplicate big objects  (was: Reuse the TaskInformation for multiple slots)

> Cache TaskInformation and JobInformation to avoid deserializing duplicate big 
> objects
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-33354
>                 URL: https://issues.apache.org/jira/browse/FLINK-33354
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Task
>    Affects Versions: 1.18.0, 1.17.1
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> The background is similar to FLINK-33315.
> A hive table with a lot of data, and the HiveSource#partitionBytes is 281MB. 
> When slotPerTM = 4, one TM will run 4 HiveSources at the same time.
>  
> How the TaskExecutor to submit a large task?
>  # TaskExecutor#loadBigData will read all bytes from file to 
> SerializedValue<TaskInformation> 
>  ** The SerializedValue<TaskInformation>  has a byte[]
>  ** It will cost the heap memory
>  ** It will be great than 281 MB, because it not only stores 
> HiveSource#partitionBytes, it also stores other information of 
> TaskInformation.
>  # Generate the TaskInformation from SerializedValue<TaskInformation> 
>  ** TaskExecutor#submitTask calls the 
> tdd.getSerializedTaskInformation()..deserializeValue()
>  ** tdd.getSerializedTaskInformation() is SerializedValue<TaskInformation> 
>  ** It will generate the TaskInformation
>  ** TaskInformation includes the Configuration 
> {color:#9876aa}taskConfiguration{color}
>  ** The {color:#9876aa}taskConfiguration{color} includes 
> StreamConfig#{color:#9876aa}SERIALIZEDUDF{color}
>  
> {color:#172b4d}Based on the above process, TM memory will have 2 big byte 
> array for each task:{color}
>  * {color:#172b4d}The SerializedValue<TaskInformation>{color}
>  * {color:#172b4d}The TaskInformation{color}
> When one TM runs 4 HiveSources at the same time, it will have 8 big byte 
> array.
> In our production environment, this is also a situation that often leads to 
> TM OOM.
> h2. Solution:
> These data is totally same due to the PermanentBlobKey is same. We can add a 
> cache for it to reduce the memory and cpu cost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to