Re: Performance with multiple FLATTENs
One solution seems to be to pre-flatten the data in a CTE, resulting in dramatically lower runtimes: ~~~ WITH flat AS (SELECT id, FLATTEN(data) AS data) SELECT id, data[0] AS dttm, data)[1] AS result FROM flat ~~~ This was tested on a single node, and each JSON array to be flattened has 1,440 elements. I can extract some query profiles, and will enter a Jira if the plan looks like it could be optimized. On 19 Jul 2016, at 16:40, rahul challapalli wrote: Matt, Having multiple flatten's in your query leads to cross-join between the output of each flatten. So a performance hit is expected with the addition of each flatten. And there could also be a genuine performance bug for this scenario. To be sure it is a bug we need more information as Abhishek pointed out. However if you want to do some computations after you flattened out your query, it might be helpful sometimes to rewrite the query such that multiple flatten's fall in multiple sub-queries. You may see some performance improvement. Let me know how it goes. - Rahul On Tue, Jul 19, 2016 at 1:22 PM, Abhishek Girish wrote: Hi Matt, Can you please share more information on your setup, specifically the size of your dataset, including an approximate average size of individual JSON files, the number of nodes, including Drillbit memory config. Also can you share the query profiles for the few scenarios you mention. Regards, Abhishek On Friday, July 15, 2016, Matt wrote: I have JSON data with with a nested list and am using FLATTEN to extract two of three list elements as: ~~~ SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ... ~~~ This works, but each FLATTEN seems to slow the query down dramatically, 3x slower with the second flatten. Is there a better approach to extracting list elements? ~~~ [ { "id": 16, "data": [ [ "2016-07-13 00:00", 509, "OK" ], [ "2016-07-13 00:01", 461, "OK" ], [ "2016-07-13 00:02", 508, "OK" ], ~~~
Re: Performance with multiple FLATTENs
Matt, Having multiple flatten's in your query leads to cross-join between the output of each flatten. So a performance hit is expected with the addition of each flatten. And there could also be a genuine performance bug for this scenario. To be sure it is a bug we need more information as Abhishek pointed out. However if you want to do some computations after you flattened out your query, it might be helpful sometimes to rewrite the query such that multiple flatten's fall in multiple sub-queries. You may see some performance improvement. Let me know how it goes. - Rahul On Tue, Jul 19, 2016 at 1:22 PM, Abhishek Girish wrote: > Hi Matt, > > Can you please share more information on your setup, specifically the size > of your dataset, including an approximate average size of individual JSON > files, the number of nodes, including Drillbit memory config. > > Also can you share the query profiles for the few scenarios you mention. > > Regards, > Abhishek > > On Friday, July 15, 2016, Matt wrote: > > > I have JSON data with with a nested list and am using FLATTEN to extract > > two of three list elements as: > > > > ~~~ > > SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ... > > ~~~ > > > > This works, but each FLATTEN seems to slow the query down dramatically, > 3x > > slower with the second flatten. > > > > Is there a better approach to extracting list elements? > > > > ~~~ > > [ > > { > > "id": 16, > > "data": [ > > [ > > "2016-07-13 00:00", > > 509, > > "OK" > > ], > > [ > > "2016-07-13 00:01", > > 461, > > "OK" > > ], > > [ > > "2016-07-13 00:02", > > 508, > > "OK" > > ], > > ~~~ > > >
Re: Performance with multiple FLATTENs
Hi Matt, Can you please share more information on your setup, specifically the size of your dataset, including an approximate average size of individual JSON files, the number of nodes, including Drillbit memory config. Also can you share the query profiles for the few scenarios you mention. Regards, Abhishek On Friday, July 15, 2016, Matt wrote: > I have JSON data with with a nested list and am using FLATTEN to extract > two of three list elements as: > > ~~~ > SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ... > ~~~ > > This works, but each FLATTEN seems to slow the query down dramatically, 3x > slower with the second flatten. > > Is there a better approach to extracting list elements? > > ~~~ > [ > { > "id": 16, > "data": [ > [ > "2016-07-13 00:00", > 509, > "OK" > ], > [ > "2016-07-13 00:01", > 461, > "OK" > ], > [ > "2016-07-13 00:02", > 508, > "OK" > ], > ~~~ >
Performance with multiple FLATTENs
I have JSON data with with a nested list and am using FLATTEN to extract two of three list elements as: ~~~ SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ... ~~~ This works, but each FLATTEN seems to slow the query down dramatically, 3x slower with the second flatten. Is there a better approach to extracting list elements? ~~~ [ { "id": 16, "data": [ [ "2016-07-13 00:00", 509, "OK" ], [ "2016-07-13 00:01", 461, "OK" ], [ "2016-07-13 00:02", 508, "OK" ], ~~~