Hi:
I have a csv file with 20,000,000 row. And create parquet file for each 1,000,000 row, which means, I will have 20 parquet files in folder "/usr/download/com/togeek/data/csv/sample", now I use drill in embedded mode to select: SELECT * FROM dfs.`/usr/download/com/togeek/data/csv/sample` WHERE Column0='1' the result should only have one row. Drill takes about 1 minute in my ENV, it is too slow. I see the plan is: 00-00 Screen 00-01 Project(*=[$0]) 00-02 UnionExchange 01-01 Project(T0¦¦*=[$0]) 01-02 SelectionVectorRemover 01-03 Filter(condition=[=($1, '1')]) 01-04 Project(T0¦¦*=[$0], Column0=[$1]) 01-05 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/usr/download/com/togeek/data/csv/sample]], selectionRoot=/usr/download/com/togeek/data/csv/sample, numFiles=1, columns=[`*`]]]) I think it means: first: iterator all fields from all files second: filter ..... Is it right? --------------------------------------------------------------------------------------- Why not: 1: iterator the ID field from all files 2: filter, so we know which files are hitted, and the rows are hitted for the file 3: iterator all fields from hitted files .... Even query from one single parquet file, we can also apply this rule, so only few row group in parquet will be scanned. Then columar storage can get best performance. Is this solution sounds correct? If yes, why we not use this solution? If I want to let such sql run quickly, do you have any suggestion? the detail the better. Thanks. --------Davy Chen
