Hi:

I have a csv file with 20,000,000 row.  And create parquet file for each 
1,000,000 row, which means, I will have 20 parquet files in folder 
"/usr/download/com/togeek/data/csv/sample", now I use drill in embedded mode to 
select:
   SELECT * FROM dfs.`/usr/download/com/togeek/data/csv/sample` WHERE 
Column0='1'


the result should only have one row. Drill takes about 1 minute in my ENV, it 
is too slow.
I see the plan is:
00-00    Screen
00-01      Project(*=[$0])
00-02        UnionExchange
01-01          Project(T0¦¦*=[$0])
01-02            SelectionVectorRemover
01-03              Filter(condition=[=($1, '1')])
01-04                Project(T0¦¦*=[$0], Column0=[$1])
01-05                  Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=file:/usr/download/com/togeek/data/csv/sample]], 
selectionRoot=/usr/download/com/togeek/data/csv/sample, numFiles=1, 
columns=[`*`]]])



I think it means:
first: iterator all fields from all files
second: filter
.....


Is it right?
---------------------------------------------------------------------------------------
Why not:
1: iterator the ID field from all files
2: filter, so we know which files are hitted, and the rows are hitted for the 
file
3: iterator all fields from hitted files
....
Even query from one single parquet file, we can also apply this rule, so only 
few row group in parquet will be scanned. Then columar storage can get best 
performance.


Is this solution sounds correct? If yes, why we not use this solution?
If I want to let such sql run quickly, do you have any suggestion? the detail 
the better.

Thanks.


--------Davy Chen



Reply via email to