Yes, limit is pushed down to parquet reader in 1.9. But, that will not help with planning time. It is definitely worth trying with 1.9 though.
Thanks, Padma > On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <[email protected]> > wrote: > > Looks like the metadata cache is being used "usedMetadataFile=true, ". But > to be sure did you perform a REFRESH TABLE METADATA <path to table> on the > parquet data? > > > However it looks like it is reading a full batch " rowcount = 32600.0, > cumulative cost = {32600.0 rows, 32600.0" > > > Didn't the limit operator get pushed down to the parquet reader in 1.9? > > Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets > reduced to 100. > > > Can you look in the query profile where time is spend, also how long it takes > before the query starts to run in the WebUI profile. > > > Best Regards > > > Andries Engelbrecht > > > Senior Solutions Architect > > MapR Alliances and Channels Engineering > > > [email protected] > > > [1483990071965_mapr-logo-signature.png] > > ________________________________ > From: Jinfeng Ni <[email protected]> > Sent: Thursday, February 23, 2017 4:53:34 PM > To: user > Subject: Re: Explain Plan for Parquet data is taking a lot of timre > > The reason the plan shows only one single parquet file is because > "LIMIT 100" is applied and filter out the rest of them. > > Agreed that parquet metadata caching might help reduce planning time, > when there are large number of parquet files. > > On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli > <[email protected]> wrote: >> You said there are 2144 parquet files but the plan suggests that you only >> have a single parquet file. In any case its a long time to plan the query. >> Did you try the metadata caching feature [1]? >> >> Also how many rowgroups and columns are present in the parquet file? >> >> [1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/ >> >> - Rahul >> >> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <[email protected]> wrote: >> >>> Hi, >>> >>> >>> >>> Drill is taking 23 minutes for a simple select * query with limit 100 on >>> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking >>> that long(~23 minutes). >>> >>> Query: select * from <plugin>.root.`testdata` limit 100; >>> >>> Query Plan: >>> >>> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0, >>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 >>> memory}, id = 1429 >>> >>> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = >>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, >>> 0.0 memory}, id = 1428 >>> >>> 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]): >>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 >>> network, 0.0 memory}, id = 1427 >>> >>> 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): >>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 >>> network, 0.0 memory}, id = 1426 >>> >>> 00-04 Scan(groupscan=[ParquetGroupScan >>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000- >>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata, >>> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata, >>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0, >>> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0 >>> memory}, id = 1425 >>> >>> >>> >>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is >>> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file >>> in this location it is taking only 38 seconds . >>> >>> Also testdata contains around 2144 .parquet files each around 500KB. >>> >>> >>> >>> Is there any additional configuration required for parquet? >>> >>> Kindly suggest how to improve the response time here. >>> >>> >>> >>> Regards >>> Jeena >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>
