RE: Explain Plan for Parquet data is taking a lot of timre

Jeena Vinod Sat, 25 Feb 2017 12:52:49 -0800

Please find attached the full JSON profile.

Regards
Jeena


-----Original Message-----
From: Padma Penumarthy [mailto:[email protected]] 
Sent: Saturday, February 25, 2017 3:31 AM
To: [email protected]
Subject: Re: Explain Plan for Parquet data is taking a lot of timre

Yes, please do send the JSON profile.

Thanks,
Padma

> On Feb 24, 2017, at 1:56 PM, Jeena Vinod <[email protected]> wrote:
> 
> Thanks for the suggestions.
> 
> I did run REFRESH TABLE METADATA command on this path before firing select 
> query.
> 
> In Drill 1.9, there is an improvement in performance. I have 1.9 setup on a 2 
> node 16GB cluster and here select * with limit 100 is taking less time than 
> 1.8, though the number of rows in ParquetGroupScan remains unchanged. Select 
> query is taking around 8 minutes and explain plan took around 7 minutes. Also 
> in the Web console profile, the query stays in the STARTING status for almost 
> 7 minutes.
> 
> Query Plan for 1.9:
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative 
> cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1721
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0, 
> cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 
> memory}, id = 1720
> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]): rowcount 
> = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 
> 0.0 memory}, id = 1719
> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = 
> 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 
> memory}, id = 1718
> 00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]], 
> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, 
> cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]): 
> rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 1717
> 
> And from the query profile, it looks like the most time is spent in 
> PARQUET_ROW_GROUP_SCAN. I can attach the full JSON profile if it helps.
> 
> Can there be further improvement in performance with 1.9?
> 
> Regards
> Jeena
> 
> 
> -----Original Message-----
> From: Padma Penumarthy [mailto:[email protected]]
> Sent: Friday, February 24, 2017 11:22 PM
> To: [email protected]
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
> 
> Yes, limit is pushed down to parquet reader in 1.9. But, that will not help 
> with planning time. 
> It is definitely worth trying with 1.9 though.
> 
> Thanks,
> Padma
> 
> 
>> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <[email protected]> 
>> wrote:
>> 
>> Looks like the metadata cache is being used  "usedMetadataFile=true, ". But 
>> to be sure did you perform a REFRESH TABLE METADATA <path to table> on the 
>> parquet data?
>> 
>> 
>> However it looks like it is reading a full batch " rowcount = 32600.0, 
>> cumulative cost = {32600.0 rows, 32600.0"
>> 
>> 
>> Didn't the limit operator get pushed down to the parquet reader in 1.9?
>> 
>> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets 
>> reduced to 100.
>> 
>> 
>> Can you look in the query profile where time is spend, also how long it 
>> takes before the query starts to run in the WebUI profile.
>> 
>> 
>> Best Regards
>> 
>> 
>> Andries Engelbrecht
>> 
>> 
>> Senior Solutions Architect
>> 
>> MapR Alliances and Channels Engineering
>> 
>> 
>> [email protected]
>> 
>> 
>> [1483990071965_mapr-logo-signature.png]
>> 
>> ________________________________
>> From: Jinfeng Ni <[email protected]>
>> Sent: Thursday, February 23, 2017 4:53:34 PM
>> To: user
>> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>> 
>> The reason the plan shows only one single parquet file is because 
>> "LIMIT 100" is applied and filter out the rest of them.
>> 
>> Agreed that parquet metadata caching might help reduce planning time, 
>> when there are large number of parquet files.
>> 
>> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli 
>> <[email protected]> wrote:
>>> You said there are 2144 parquet files but the plan suggests that you 
>>> only have a single parquet file. In any case its a long time to plan the 
>>> query.
>>> Did you try the metadata caching feature [1]?
>>> 
>>> Also how many rowgroups and columns are present in the parquet file?
>>> 
>>> [1]
>>> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>>> 
>>> - Rahul
>>> 
>>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <[email protected]> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> Drill is taking 23 minutes for a simple select * query with limit
>>>> 100 on 1GB uncompressed parquet data. EXPLAIN PLAN for this query 
>>>> is also taking that long(~23 minutes).
>>>> 
>>>> Query: select * from <plugin>.root.`testdata` limit 100;
>>>> 
>>>> Query  Plan:
>>>> 
>>>> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network,
>>>> 0.0 memory}, id = 1429
>>>> 
>>>> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 
>>>> network,
>>>> 0.0 memory}, id = 1428
>>>> 
>>>> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 
>>>> io, 0.0 network, 0.0 memory}, id = 1427
>>>> 
>>>> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 
>>>> io, 0.0 network, 0.0 memory}, id = 1426
>>>> 
>>>> 00-04            Scan(groupscan=[ParquetGroupScan
>>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000- 
>>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]],
>>>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, 
>>>> cacheFileRoot=/testdata,
>>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 
>>>> 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 
>>>> network, 0.0 memory}, id = 1425
>>>> 
>>>> 
>>>> 
>>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the 
>>>> data is in Oracle Storage Cloud Service. When I run the same query 
>>>> on 1GB TSV file in this location it is taking only 38 seconds .
>>>> 
>>>> Also testdata contains around 2144 .parquet files each around 500KB.
>>>> 
>>>> 
>>>> 
>>>> Is there any additional configuration required for parquet?
>>>> 
>>>> Kindly suggest how to improve the response time here.
>>>> 
>>>> 
>>>> 
>>>> Regards
>>>> Jeena
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>

{
    "id": {
        "part1": 2832582062316356600,
        "part2": 4428624082760145400
    },
    "type": 1,
    "start": 1487971742419,
    "end": 1487972129557,
    "query": "select * from <plugin>.`testdata` limit 100",
    "plan": "00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0, 
cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
id = 2844\n00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 
100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 2843\n00-02        SelectionVectorRemover : rowType = 
(DrillRecordRow[*]): rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2842\n00-03          
Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = 100.0, 
cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
id = 2841\n00-04            Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]], 
selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, 
cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]): 
rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 2840\n",
    "foreman": {
        "address": "xxx",
        "userPort": 31010,
        "controlPort": 31011,
        "dataPort": 31012
    },
    "state": 2,
    "totalFragments": 1,
    "finishedFragments": 0,
    "fragmentProfile": [
        {
            "majorFragmentId": 0,
            "minorFragmentProfile": [
                {
                    "state": 3,
                    "minorFragmentId": 0,
                    "operatorProfile": [
                        {
                            "inputProfile": [
                                {
                                    "records": 100,
                                    "batches": 1,
                                    "schemas": 1
                                }
                            ],
                            "operatorId": 4,
                            "operatorType": 21,
                            "setupNanos": 0,
                            "processNanos": 1425997968,
                            "peakLocalMemoryAllocated": 30009856,
                            "metric": [
                                {
                                    "metricId": 0,
                                    "longValue": 8
                                },
                                {
                                    "metricId": 6,
                                    "longValue": 294015
                                },
                                {
                                    "metricId": 5,
                                    "longValue": 120211
                                },
                                {
                                    "metricId": 3,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 4,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 9,
                                    "longValue": 9615620
                                },
                                {
                                    "metricId": 15,
                                    "longValue": 31376
                                },
                                {
                                    "metricId": 10,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 16,
                                    "longValue": 9584244
                                },
                                {
                                    "metricId": 11,
                                    "longValue": 303668
                                },
                                {
                                    "metricId": 12,
                                    "longValue": 499130
                                },
                                {
                                    "metricId": 13,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 8,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 14,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 7,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 1,
                                    "longValue": 0
                                },
                                {
                                    "metricId": 2,
                                    "longValue": 8
                                }
                            ],
                            "waitNanos": 0
                        },
                        {
                            "inputProfile": [
                                {
                                    "records": 100,
                                    "batches": 1,
                                    "schemas": 1
                                }
                            ],
                            "operatorId": 3,
                            "operatorType": 7,
                            "setupNanos": 338378,
                            "processNanos": 338991,
                            "peakLocalMemoryAllocated": 19157760,
                            "waitNanos": 0
                        },
                        {
                            "inputProfile": [
                                {
                                    "records": 100,
                                    "batches": 1,
                                    "schemas": 1
                                }
                            ],
                            "operatorId": 2,
                            "operatorType": 14,
                            "setupNanos": 8199887,
                            "processNanos": 5983803,
                            "peakLocalMemoryAllocated": 802816,
                            "waitNanos": 0
                        },
                        {
                            "inputProfile": [
                                {
                                    "records": 100,
                                    "batches": 1,
                                    "schemas": 1
                                }
                            ],
                            "operatorId": 1,
                            "operatorType": 10,
                            "setupNanos": 1575386,
                            "processNanos": 268240,
                            "peakLocalMemoryAllocated": 524288,
                            "waitNanos": 0
                        },
                        {
                            "inputProfile": [
                                {
                                    "records": 100,
                                    "batches": 1,
                                    "schemas": 1
                                }
                            ],
                            "operatorId": 0,
                            "operatorType": 13,
                            "setupNanos": 0,
                            "processNanos": 533734,
                            "peakLocalMemoryAllocated": 524288,
                            "metric": [
                                {
                                    "metricId": 0,
                                    "longValue": 106190
                                }
                            ],
                            "waitNanos": 53964820
                        }
                    ],
                    "startTime": 1487972127535,
                    "endTime": 1487972129551,
                    "memoryUsed": 0,
                    "maxMemoryUsed": 34009856,
                    "endpoint": {
                        "address": "xxx",
                        "userPort": 31010,
                        "controlPort": 31011,
                        "dataPort": 31012
                    },
                    "lastUpdate": 1487972129553,
                    "lastProgress": 1487972129553
                }
            ]
        }
    ],
    "user": "anonymous",
    "optionsJson": "[ ]"
}

RE: Explain Plan for Parquet data is taking a lot of timre

Reply via email to