[ 
https://issues.apache.org/jira/browse/FLINK-20827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoucao updated FLINK-20827:
---------------------------
    Description: 
When using Temporal table join, all hive tables' records will be loaded into 
cache. But sometimes, the size of hive temporal table is larger than expected, 
and users can't know exactly how big it is in memory. In this situation, some 
error will occur, for example, `GC overhead limit exceeded`, `the heartbeat of 
TaskManager timeout (caused by gc)`. 

Maybe we can optimize the number of records readed from hive table?  If the 
upstream records can be hashed only by using `Join key`,  then we only need to 
load the part of  records into cache, whose value of join key after being 
hashed, is equal to one fixed hash value. If it can be done, the whole table 
can be divided by the number of parallelism. I don't know whether it could come 
true in the upstream under the existing framework, but It is easy to support in 
`FileSystemLookupFunction`

If not, we can add some logs to tell others the size of cache to help them to 
set MemorySize or other parameter of TM.

  was:
When using Temporal table join, all hive tables' records will be loaded into 
cache. But sometimes, the size of hive temporal table is larger than expected, 
and users can't know exactly how big it is in memory. In this situation, some 
error will occur, for example, `GC overhead limit exceeded`, `the heartbeat of 
TaskManager timeout (caused by gc)`. 

Maybe we can optimize the number of records readed from hive table?  If the 
upstream records can be hashed only by using `Join key`,  then we only need to 
load the part of  records into cache, whose value of join key after being 
hashed, is equal to one fixed hash value. If it can be done, the whole table 
can be divided by the number of parallelism. I don't know whether it could come 
true In the upstream under the existing framework, but It is easy to support in 
`FileSystemLookupFunction`

If not, we can add some log to tell others the size of cache to help them to 
set MemorySize or other parameter of TM.


> Just read record correlating to join key in FilesystemLookUpFunc
> ----------------------------------------------------------------
>
>                 Key: FLINK-20827
>                 URL: https://issues.apache.org/jira/browse/FLINK-20827
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: zoucao
>            Priority: Major
>
> When using Temporal table join, all hive tables' records will be loaded into 
> cache. But sometimes, the size of hive temporal table is larger than 
> expected, and users can't know exactly how big it is in memory. In this 
> situation, some error will occur, for example, `GC overhead limit exceeded`, 
> `the heartbeat of TaskManager timeout (caused by gc)`. 
> Maybe we can optimize the number of records readed from hive table?  If the 
> upstream records can be hashed only by using `Join key`,  then we only need 
> to load the part of  records into cache, whose value of join key after being 
> hashed, is equal to one fixed hash value. If it can be done, the whole table 
> can be divided by the number of parallelism. I don't know whether it could 
> come true in the upstream under the existing framework, but It is easy to 
> support in `FileSystemLookupFunction`
> If not, we can add some logs to tell others the size of cache to help them to 
> set MemorySize or other parameter of TM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to