Loading a Hive lookup table data into TM memory takes so long.

2022-02-04 Thread Jason Yi
Hello, I created external tables on Hive with data in s3 and wanted to use those tables as a lookup table in Flink. When I used an external table containing a small size of data as a lookup table, Flink quickly loaded the data into TM memory and did a Temporal join to an event stream. But, when I

Re: Loading a Hive lookup table data into TM memory takes so long.

2022-02-06 Thread Caizhi Weng
Hi! Each parallelism of the lookup operation will load all data from the lookup table source, so you're loading 10GB of data to each parallelism and storing them in JVM memory. That is not only slow but also very memory-consuming. Have you tried joining your main stream with the hive table direct

Re: Loading a Hive lookup table data into TM memory takes so long.

2022-02-07 Thread Jason Yi
Hi Caizhi, Could you tell me more details about streaming joins that you suggested? Did you mean putting the Hive table data into a Kafka/Kinesis and joining the main stream with the hive table data streaming with a very long watermark? In my use case, the hive table is an account dimension table

Re: Loading a Hive lookup table data into TM memory takes so long.

2022-02-07 Thread Caizhi Weng
Hi! If Flink is not happy with a large Hive table data Currently it is. Hive lookup table (currently implemented just like a filesystem lookup table) cannot look up values with a specific key, so it has to load all data into memory. Did you mean putting the Hive table data into a Kafka/Kinesis a