Looks like the metadata cache is being used  "usedMetadataFile=true, ". But to 
be sure did you perform a REFRESH TABLE METADATA <path to table> on the parquet 
data?


However it looks like it is reading a full batch " rowcount = 32600.0, 
cumulative cost = {32600.0 rows, 32600.0"


Didn't the limit operator get pushed down to the parquet reader in 1.9?

Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets 
reduced to 100.


Can you look in the query profile where time is spend, also how long it takes 
before the query starts to run in the WebUI profile.


Best Regards


Andries Engelbrecht


Senior Solutions Architect

MapR Alliances and Channels Engineering


aengelbre...@mapr.com


[1483990071965_mapr-logo-signature.png]

________________________________
From: Jinfeng Ni <j...@apache.org>
Sent: Thursday, February 23, 2017 4:53:34 PM
To: user
Subject: Re: Explain Plan for Parquet data is taking a lot of timre

The reason the plan shows only one single parquet file is because
"LIMIT 100" is applied and filter out the rest of them.

Agreed that parquet metadata caching might help reduce planning time,
when there are large number of parquet files.

On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli
<challapallira...@gmail.com> wrote:
> You said there are 2144 parquet files but the plan suggests that you only
> have a single parquet file. In any case its a long time to plan the query.
> Did you try the metadata caching feature [1]?
>
> Also how many rowgroups and columns are present in the parquet file?
>
> [1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>
> - Rahul
>
> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <jeena.vi...@oracle.com> wrote:
>
>> Hi,
>>
>>
>>
>> Drill is taking 23 minutes for a simple select * query with limit 100 on
>> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking
>> that long(~23 minutes).
>>
>> Query: select * from <plugin>.root.`testdata` limit 100;
>>
>> Query  Plan:
>>
>> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0
>> memory}, id = 1429
>>
>> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
>> 0.0 memory}, id = 1428
>>
>> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
>> network, 0.0 memory}, id = 1427
>>
>> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0
>> network, 0.0 memory}, id = 1426
>>
>> 00-04            Scan(groupscan=[ParquetGroupScan
>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000-
>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata,
>> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata,
>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0,
>> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0
>> memory}, id = 1425
>>
>>
>>
>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is
>> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file
>> in this location it is taking only 38 seconds .
>>
>> Also testdata contains around 2144 .parquet files each around 500KB.
>>
>>
>>
>> Is there any additional configuration required for parquet?
>>
>> Kindly suggest how to improve the response time here.
>>
>>
>>
>> Regards
>> Jeena
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Reply via email to