Re: Performance with multiple FLATTENs

2016-07-19 Thread MattK
One solution seems to be to pre-flatten the data in a CTE, resulting in 
dramatically lower runtimes:


~~~
WITH flat AS (SELECT id, FLATTEN(data) AS data)
SELECT id, data[0] AS dttm, data)[1] AS result
FROM flat
~~~

This was tested on a single node, and each JSON array to be flattened 
has 1,440 elements. I can extract some query profiles, and will enter a 
Jira if the plan looks like it could be optimized.



On 19 Jul 2016, at 16:40, rahul challapalli wrote:


Matt,

Having multiple flatten's in your query leads to cross-join between 
the
output of each flatten. So a performance hit is expected with the 
addition
of each flatten. And there could also be a genuine performance bug for 
this

scenario. To be sure it is a bug we need more information as Abhishek
pointed out.

However if you want to do some computations after you flattened out 
your

query, it might be helpful sometimes to rewrite the query such that
multiple flatten's fall in multiple sub-queries. You may see some
performance improvement. Let me know how it goes.

- Rahul

On Tue, Jul 19, 2016 at 1:22 PM, Abhishek Girish 


wrote:


Hi Matt,

Can you please share more information on your setup, specifically the 
size
of your dataset, including an approximate average size of individual 
JSON

files, the number of nodes, including Drillbit memory config.

Also can you share the query profiles for the few scenarios you 
mention.


Regards,
Abhishek

On Friday, July 15, 2016, Matt  wrote:

I have JSON data with with a nested list and am using FLATTEN to 
extract

two of three list elements as:

~~~
SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM 
...

~~~

This works, but each FLATTEN seems to slow the query down 
dramatically,

3x

slower with the second flatten.

Is there a better approach to extracting list elements?

~~~
[
  {
"id": 16,
"data": [
  [
"2016-07-13 00:00",
509,
"OK"
  ],
  [
"2016-07-13 00:01",
461,
"OK"
  ],
  [
"2016-07-13 00:02",
508,
"OK"
  ],
~~~





Re: Performance with multiple FLATTENs

2016-07-19 Thread rahul challapalli
Matt,

Having multiple flatten's in your query leads to cross-join between the
output of each flatten. So a performance hit is expected with the addition
of each flatten. And there could also be a genuine performance bug for this
scenario. To be sure it is a bug we need more information as Abhishek
pointed out.

However if you want to do some computations after you flattened out your
query, it might be helpful sometimes to rewrite the query such that
multiple flatten's fall in multiple sub-queries. You may see some
performance improvement. Let me know how it goes.

- Rahul

On Tue, Jul 19, 2016 at 1:22 PM, Abhishek Girish 
wrote:

> Hi Matt,
>
> Can you please share more information on your setup, specifically the size
> of your dataset, including an approximate average size of individual JSON
> files, the number of nodes, including Drillbit memory config.
>
> Also can you share the query profiles for the few scenarios you mention.
>
> Regards,
> Abhishek
>
> On Friday, July 15, 2016, Matt  wrote:
>
> > I have JSON data with with a nested list and am using FLATTEN to extract
> > two of three list elements as:
> >
> > ~~~
> > SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
> > ~~~
> >
> > This works, but each FLATTEN seems to slow the query down dramatically,
> 3x
> > slower with the second flatten.
> >
> > Is there a better approach to extracting list elements?
> >
> > ~~~
> > [
> >   {
> > "id": 16,
> > "data": [
> >   [
> > "2016-07-13 00:00",
> > 509,
> > "OK"
> >   ],
> >   [
> > "2016-07-13 00:01",
> > 461,
> > "OK"
> >   ],
> >   [
> > "2016-07-13 00:02",
> > 508,
> > "OK"
> >   ],
> > ~~~
> >
>


Re: Performance with multiple FLATTENs

2016-07-19 Thread Abhishek Girish
Hi Matt,

Can you please share more information on your setup, specifically the size
of your dataset, including an approximate average size of individual JSON
files, the number of nodes, including Drillbit memory config.

Also can you share the query profiles for the few scenarios you mention.

Regards,
Abhishek

On Friday, July 15, 2016, Matt  wrote:

> I have JSON data with with a nested list and am using FLATTEN to extract
> two of three list elements as:
>
> ~~~
> SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
> ~~~
>
> This works, but each FLATTEN seems to slow the query down dramatically, 3x
> slower with the second flatten.
>
> Is there a better approach to extracting list elements?
>
> ~~~
> [
>   {
> "id": 16,
> "data": [
>   [
> "2016-07-13 00:00",
> 509,
> "OK"
>   ],
>   [
> "2016-07-13 00:01",
> 461,
> "OK"
>   ],
>   [
> "2016-07-13 00:02",
> 508,
> "OK"
>   ],
> ~~~
>


Performance with multiple FLATTENs

2016-07-15 Thread Matt
I have JSON data with with a nested list and am using FLATTEN to extract 
two of three list elements as:


~~~
SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
~~~

This works, but each FLATTEN seems to slow the query down dramatically, 
3x slower with the second flatten.


Is there a better approach to extracting list elements?

~~~
[
  {
"id": 16,
"data": [
  [
"2016-07-13 00:00",
509,
"OK"
  ],
  [
"2016-07-13 00:01",
461,
"OK"
  ],
  [
"2016-07-13 00:02",
508,
"OK"
  ],
~~~