Hi, 

 

Drill is taking 23 minutes for a simple select * query with limit 100 on 1GB 
uncompressed parquet data. EXPLAIN PLAN for this query is also taking that 
long(~23 minutes).

Query: select * from <plugin>.root.`testdata` limit 100;

Query  Plan:

00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative 
cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1429

00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0, 
cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
id = 1428

00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]): rowcount = 
100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 1427

00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = 
100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 1426

00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]], 
selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, 
cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]): 
rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 1425

 

I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is in 
Oracle Storage Cloud Service. When I run the same query on 1GB TSV file in this 
location it is taking only 38 seconds .

Also testdata contains around 2144 .parquet files each around 500KB.

 

Is there any additional configuration required for parquet?

Kindly suggest how to improve the response time here.

 

Regards
Jeena

 

 

 

 

 

Reply via email to