dianfu commented on issue #11771:
URL: https://github.com/apache/flink/pull/11771#issuecomment-616387332


   There are two caches in RelDataTypeFactoryImpl: KEY2TYPE_CACHE and 
DATATYPE_CACHE. KEY2TYPE_CACHE caches the mapping of Key(consists of field 
names and field types, etc) to RelDataType and can be used for the canonization 
of row types per my understanding. DATATYPE_CACHE caches the RelDataType 
instances.
   
   PythonCalcSplitRule will split a Calc RelNode which contains both 
non-vectorized Python UDF and vectorized Python UDF into two Calc RelNodes. 
   
   For the failure test case, the output type of the bottom Calc consists of 
two fields (f0: INTEGER, f1: INTEGER), let's call it row_type_0. This row type 
is already available in the cache (generated by other test cases, it's held in 
variable KEY2TYPE_CACHE) and so it will hit the cache when constructing this 
row type. However, during debugging, I found that the INTEGER type referenced 
by row_type_0 is already cleaned up from the cache DATATYPE_CACHE. Then when 
constructing the RexProgram for the top Calc, it creates another INTEGER type 
and failure happens.
   
   To work around this problem, we adjust the test case a bit to make the 
output row type of the bottom Calc consisting of three fields instead of two 
fields to make the cache hit fail. It seems a little hack, however, it did 
could solve this problem. I'm glad to try if there is more elegant way to 
address this problem wholly in Flink which could avoid this problem thoroughly. 
Do you have any suggestions? Glad to hear!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to