How big are the files and what system are you running on? Can you provide a Drill show files for the directory listed?
—Andries On Jun 26, 2015, at 12:47 AM, 陈礼剑 <[email protected]> wrote: > Hi: > > > I have a csv file with 20,000,000 row. And create parquet file for each > 1,000,000 row, which means, I will have 20 parquet files in folder > "/usr/download/com/togeek/data/csv/sample", now I use drill in embedded mode > to select: > SELECT * FROM dfs.`/usr/download/com/togeek/data/csv/sample` WHERE > Column0='1' > > > the result should only have one row. Drill takes about 1 minute in my ENV, it > is too slow. > I see the plan is: > 00-00 Screen > 00-01 Project(*=[$0]) > 00-02 UnionExchange > 01-01 Project(T0¦¦*=[$0]) > 01-02 SelectionVectorRemover > 01-03 Filter(condition=[=($1, '1')]) > 01-04 Project(T0¦¦*=[$0], Column0=[$1]) > 01-05 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath > [path=file:/usr/download/com/togeek/data/csv/sample]], > selectionRoot=/usr/download/com/togeek/data/csv/sample, numFiles=1, > columns=[`*`]]]) > > > > I think it means: > first: iterator all fields from all files > second: filter > ..... > > > Is it right? > --------------------------------------------------------------------------------------- > Why not: > 1: iterator the ID field from all files > 2: filter, so we know which files are hitted, and the rows are hitted for the > file > 3: iterator all fields from hitted files > .... > Even query from one single parquet file, we can also apply this rule, so only > few row group in parquet will be scanned. Then columar storage can get best > performance. > > > Is this solution sounds correct? If yes, why we not use this solution? > If I want to let such sql run quickly, do you have any suggestion? the detail > the better. > > Thanks. > > > --------Davy Chen > > >
