We are adding support of partitioning in CTAS, which may help in your case.

CREATE TABLE Parquet_Table ( Column0, Column1, ...)
PARTITION BY (Column0)
FROM your_csv_file

Then, the query would use partition pruning and see improved performance:

SELECT * from Parquet_Table
WHERE Column0 = '1';

The partitioning support is currently an on-going work, and is target at
1.1 release. You may have a try on the current master branch.



On Fri, Jun 26, 2015 at 7:43 AM, Andries Engelbrecht <
[email protected]> wrote:

> How big are the files and what system are you running on?
>
> Can you provide a Drill show files for the directory listed?
>
>
> —Andries
>
>
>
> On Jun 26, 2015, at 12:47 AM, 陈礼剑 <[email protected]> wrote:
>
> > Hi:
> >
> >
> > I have a csv file with 20,000,000 row.  And create parquet file for each
> 1,000,000 row, which means, I will have 20 parquet files in folder
> "/usr/download/com/togeek/data/csv/sample", now I use drill in embedded
> mode to select:
> >   SELECT * FROM dfs.`/usr/download/com/togeek/data/csv/sample` WHERE
> Column0='1'
> >
> >
> > the result should only have one row. Drill takes about 1 minute in my
> ENV, it is too slow.
> > I see the plan is:
> > 00-00    Screen
> > 00-01      Project(*=[$0])
> > 00-02        UnionExchange
> > 01-01          Project(T0¦¦*=[$0])
> > 01-02            SelectionVectorRemover
> > 01-03              Filter(condition=[=($1, '1')])
> > 01-04                Project(T0¦¦*=[$0], Column0=[$1])
> > 01-05                  Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/usr/download/com/togeek/data/csv/sample]],
> selectionRoot=/usr/download/com/togeek/data/csv/sample, numFiles=1,
> columns=[`*`]]])
> >
> >
> >
> > I think it means:
> > first: iterator all fields from all files
> > second: filter
> > .....
> >
> >
> > Is it right?
> >
> ---------------------------------------------------------------------------------------
> > Why not:
> > 1: iterator the ID field from all files
> > 2: filter, so we know which files are hitted, and the rows are hitted
> for the file
> > 3: iterator all fields from hitted files
> > ....
> > Even query from one single parquet file, we can also apply this rule, so
> only few row group in parquet will be scanned. Then columar storage can get
> best performance.
> >
> >
> > Is this solution sounds correct? If yes, why we not use this solution?
> > If I want to let such sql run quickly, do you have any suggestion? the
> detail the better.
> >
> > Thanks.
> >
> >
> > --------Davy Chen
> >
> >
> >
>
>

Reply via email to