Re: Explain Plan for Parquet data is taking a lot of timre

Padma Penumarthy Fri, 24 Feb 2017 09:54:06 -0800

Yes, limit is pushed down to parquet reader in 1.9. But, that will not help 
with planning time. 
It is definitely worth trying with 1.9 though.


Thanks,
Padma


> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <[email protected]> 
> wrote:
> 
> Looks like the metadata cache is being used  "usedMetadataFile=true, ". But 
> to be sure did you perform a REFRESH TABLE METADATA <path to table> on the 
> parquet data?
> 
> 
> However it looks like it is reading a full batch " rowcount = 32600.0, 
> cumulative cost = {32600.0 rows, 32600.0"
> 
> 
> Didn't the limit operator get pushed down to the parquet reader in 1.9?
> 
> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets 
> reduced to 100.
> 
> 
> Can you look in the query profile where time is spend, also how long it takes 
> before the query starts to run in the WebUI profile.
> 
> 
> Best Regards
> 
> 
> Andries Engelbrecht
> 
> 
> Senior Solutions Architect
> 
> MapR Alliances and Channels Engineering
> 
> 
> [email protected]
> 
> 
> [1483990071965_mapr-logo-signature.png]
> 
> ________________________________
> From: Jinfeng Ni <[email protected]>
> Sent: Thursday, February 23, 2017 4:53:34 PM
> To: user
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
> 
> The reason the plan shows only one single parquet file is because
> "LIMIT 100" is applied and filter out the rest of them.
> 
> Agreed that parquet metadata caching might help reduce planning time,
> when there are large number of parquet files.
> 
> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli
> <[email protected]> wrote:
>> You said there are 2144 parquet files but the plan suggests that you only
>> have a single parquet file. In any case its a long time to plan the query.
>> Did you try the metadata caching feature [1]?
>> 
>> Also how many rowgroups and columns are present in the parquet file?
>> 
>> [1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>> 
>> - Rahul
>> 
>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> Drill is taking 23 minutes for a simple select * query with limit 100 on
>>> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking
>>> that long(~23 minutes).
>>> 
>>> Query: select * from <plugin>.root.`testdata` limit 100;
>>> 
>>> Query  Plan:
>>> 
>>> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0
>>> memory}, id = 1429
>>> 
>>> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
>>> 0.0 memory}, id = 1428
>>> 
>>> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
>>> network, 0.0 memory}, id = 1427
>>> 
>>> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0
>>> network, 0.0 memory}, id = 1426
>>> 
>>> 00-04            Scan(groupscan=[ParquetGroupScan
>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000-
>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata,
>>> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata,
>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0,
>>> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0
>>> memory}, id = 1425
>>> 
>>> 
>>> 
>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is
>>> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file
>>> in this location it is taking only 38 seconds .
>>> 
>>> Also testdata contains around 2144 .parquet files each around 500KB.
>>> 
>>> 
>>> 
>>> Is there any additional configuration required for parquet?
>>> 
>>> Kindly suggest how to improve the response time here.
>>> 
>>> 
>>> 
>>> Regards
>>> Jeena
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>

Re: Explain Plan for Parquet data is taking a lot of timre

Reply via email to