Please find attached the full JSON profile.
Regards
Jeena
-----Original Message-----
From: Padma Penumarthy [mailto:[email protected]]
Sent: Saturday, February 25, 2017 3:31 AM
To: [email protected]
Subject: Re: Explain Plan for Parquet data is taking a lot of timre
Yes, please do send the JSON profile.
Thanks,
Padma
> On Feb 24, 2017, at 1:56 PM, Jeena Vinod <[email protected]> wrote:
>
> Thanks for the suggestions.
>
> I did run REFRESH TABLE METADATA command on this path before firing select
> query.
>
> In Drill 1.9, there is an improvement in performance. I have 1.9 setup on a 2
> node 16GB cluster and here select * with limit 100 is taking less time than
> 1.8, though the number of rows in ParquetGroupScan remains unchanged. Select
> query is taking around 8 minutes and explain plan took around 7 minutes. Also
> in the Web console profile, the query stays in the STARTING status for almost
> 7 minutes.
>
> Query Plan for 1.9:
> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative
> cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1721
> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0,
> cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1720
> 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]): rowcount
> = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 1719
> 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount =
> 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1718
> 00-04 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]],
> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true,
> cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]):
> rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 1717
>
> And from the query profile, it looks like the most time is spent in
> PARQUET_ROW_GROUP_SCAN. I can attach the full JSON profile if it helps.
>
> Can there be further improvement in performance with 1.9?
>
> Regards
> Jeena
>
>
> -----Original Message-----
> From: Padma Penumarthy [mailto:[email protected]]
> Sent: Friday, February 24, 2017 11:22 PM
> To: [email protected]
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>
> Yes, limit is pushed down to parquet reader in 1.9. But, that will not help
> with planning time.
> It is definitely worth trying with 1.9 though.
>
> Thanks,
> Padma
>
>
>> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <[email protected]>
>> wrote:
>>
>> Looks like the metadata cache is being used "usedMetadataFile=true, ". But
>> to be sure did you perform a REFRESH TABLE METADATA <path to table> on the
>> parquet data?
>>
>>
>> However it looks like it is reading a full batch " rowcount = 32600.0,
>> cumulative cost = {32600.0 rows, 32600.0"
>>
>>
>> Didn't the limit operator get pushed down to the parquet reader in 1.9?
>>
>> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets
>> reduced to 100.
>>
>>
>> Can you look in the query profile where time is spend, also how long it
>> takes before the query starts to run in the WebUI profile.
>>
>>
>> Best Regards
>>
>>
>> Andries Engelbrecht
>>
>>
>> Senior Solutions Architect
>>
>> MapR Alliances and Channels Engineering
>>
>>
>> [email protected]
>>
>>
>> [1483990071965_mapr-logo-signature.png]
>>
>> ________________________________
>> From: Jinfeng Ni <[email protected]>
>> Sent: Thursday, February 23, 2017 4:53:34 PM
>> To: user
>> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>>
>> The reason the plan shows only one single parquet file is because
>> "LIMIT 100" is applied and filter out the rest of them.
>>
>> Agreed that parquet metadata caching might help reduce planning time,
>> when there are large number of parquet files.
>>
>> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli
>> <[email protected]> wrote:
>>> You said there are 2144 parquet files but the plan suggests that you
>>> only have a single parquet file. In any case its a long time to plan the
>>> query.
>>> Did you try the metadata caching feature [1]?
>>>
>>> Also how many rowgroups and columns are present in the parquet file?
>>>
>>> [1]
>>> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>>>
>>> - Rahul
>>>
>>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Drill is taking 23 minutes for a simple select * query with limit
>>>> 100 on 1GB uncompressed parquet data. EXPLAIN PLAN for this query
>>>> is also taking that long(~23 minutes).
>>>>
>>>> Query: select * from <plugin>.root.`testdata` limit 100;
>>>>
>>>> Query Plan:
>>>>
>>>> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network,
>>>> 0.0 memory}, id = 1429
>>>>
>>>> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
>>>> network,
>>>> 0.0 memory}, id = 1428
>>>>
>>>> 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0
>>>> io, 0.0 network, 0.0 memory}, id = 1427
>>>>
>>>> 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0
>>>> io, 0.0 network, 0.0 memory}, id = 1426
>>>>
>>>> 00-04 Scan(groupscan=[ParquetGroupScan
>>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000-
>>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]],
>>>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true,
>>>> cacheFileRoot=/testdata,
>>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount =
>>>> 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0
>>>> network, 0.0 memory}, id = 1425
>>>>
>>>>
>>>>
>>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the
>>>> data is in Oracle Storage Cloud Service. When I run the same query
>>>> on 1GB TSV file in this location it is taking only 38 seconds .
>>>>
>>>> Also testdata contains around 2144 .parquet files each around 500KB.
>>>>
>>>>
>>>>
>>>> Is there any additional configuration required for parquet?
>>>>
>>>> Kindly suggest how to improve the response time here.
>>>>
>>>>
>>>>
>>>> Regards
>>>> Jeena
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>
{
"id": {
"part1": 2832582062316356600,
"part2": 4428624082760145400
},
"type": 1,
"start": 1487971742419,
"end": 1487972129557,
"query": "select * from <plugin>.`testdata` limit 100",
"plan": "00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0,
cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 2844\n00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 2843\n00-02 SelectionVectorRemover : rowType =
(DrillRecordRow[*]): rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2842\n00-03
Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = 100.0,
cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 2841\n00-04 Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]],
selectionRoot=/testdata, numFiles=1, usedMetadataFile=true,
cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]):
rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 2840\n",
"foreman": {
"address": "xxx",
"userPort": 31010,
"controlPort": 31011,
"dataPort": 31012
},
"state": 2,
"totalFragments": 1,
"finishedFragments": 0,
"fragmentProfile": [
{
"majorFragmentId": 0,
"minorFragmentProfile": [
{
"state": 3,
"minorFragmentId": 0,
"operatorProfile": [
{
"inputProfile": [
{
"records": 100,
"batches": 1,
"schemas": 1
}
],
"operatorId": 4,
"operatorType": 21,
"setupNanos": 0,
"processNanos": 1425997968,
"peakLocalMemoryAllocated": 30009856,
"metric": [
{
"metricId": 0,
"longValue": 8
},
{
"metricId": 6,
"longValue": 294015
},
{
"metricId": 5,
"longValue": 120211
},
{
"metricId": 3,
"longValue": 0
},
{
"metricId": 4,
"longValue": 0
},
{
"metricId": 9,
"longValue": 9615620
},
{
"metricId": 15,
"longValue": 31376
},
{
"metricId": 10,
"longValue": 0
},
{
"metricId": 16,
"longValue": 9584244
},
{
"metricId": 11,
"longValue": 303668
},
{
"metricId": 12,
"longValue": 499130
},
{
"metricId": 13,
"longValue": 0
},
{
"metricId": 8,
"longValue": 0
},
{
"metricId": 14,
"longValue": 0
},
{
"metricId": 7,
"longValue": 0
},
{
"metricId": 1,
"longValue": 0
},
{
"metricId": 2,
"longValue": 8
}
],
"waitNanos": 0
},
{
"inputProfile": [
{
"records": 100,
"batches": 1,
"schemas": 1
}
],
"operatorId": 3,
"operatorType": 7,
"setupNanos": 338378,
"processNanos": 338991,
"peakLocalMemoryAllocated": 19157760,
"waitNanos": 0
},
{
"inputProfile": [
{
"records": 100,
"batches": 1,
"schemas": 1
}
],
"operatorId": 2,
"operatorType": 14,
"setupNanos": 8199887,
"processNanos": 5983803,
"peakLocalMemoryAllocated": 802816,
"waitNanos": 0
},
{
"inputProfile": [
{
"records": 100,
"batches": 1,
"schemas": 1
}
],
"operatorId": 1,
"operatorType": 10,
"setupNanos": 1575386,
"processNanos": 268240,
"peakLocalMemoryAllocated": 524288,
"waitNanos": 0
},
{
"inputProfile": [
{
"records": 100,
"batches": 1,
"schemas": 1
}
],
"operatorId": 0,
"operatorType": 13,
"setupNanos": 0,
"processNanos": 533734,
"peakLocalMemoryAllocated": 524288,
"metric": [
{
"metricId": 0,
"longValue": 106190
}
],
"waitNanos": 53964820
}
],
"startTime": 1487972127535,
"endTime": 1487972129551,
"memoryUsed": 0,
"maxMemoryUsed": 34009856,
"endpoint": {
"address": "xxx",
"userPort": 31010,
"controlPort": 31011,
"dataPort": 31012
},
"lastUpdate": 1487972129553,
"lastProgress": 1487972129553
}
]
}
],
"user": "anonymous",
"optionsJson": "[ ]"
}