KnightChess created HUDI-7267:
---------------------------------

             Summary: csi will cause data loss during sql query
                 Key: HUDI-7267
                 URL: https://issues.apache.org/jira/browse/HUDI-7267
             Project: Apache Hudi
          Issue Type: Bug
          Components: index
            Reporter: KnightChess
         Attachments: image-2023-12-28-13-29-15-943.png

from the picture, csi will use parquet chunk block meta calculate min/max 
value, and save it to mdt col stat. For complex cols, such as **info 
array<struct<name: string, age: int>>** , parquet meta will contain only 
`info.array.name`, `infor.array.age`, but hudi will only calculate `info` 
column, so this meta in mdt will be null.

And if sql expression contain `IsNotNull(info)`, the file will all be skip.

And consider common cols, which will be add in the future and old file will not 
contain this col, may cause some other question. So, make code logical clean, 
Check for null before evaluating the value:min/mav/nullValue.

!image-2023-12-28-13-29-15-943.png|width=1458,height=798!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to