One solution seems to be to pre-flatten the data in a CTE, resulting in dramatically lower runtimes:

~~~
WITH flat AS (SELECT id, FLATTEN(data) AS data)
SELECT id, data[0] AS dttm, data)[1] AS result
FROM flat
~~~

This was tested on a single node, and each JSON array to be flattened has 1,440 elements. I can extract some query profiles, and will enter a Jira if the plan looks like it could be optimized.


On 19 Jul 2016, at 16:40, rahul challapalli wrote:

Matt,

Having multiple flatten's in your query leads to cross-join between the output of each flatten. So a performance hit is expected with the addition of each flatten. And there could also be a genuine performance bug for this
scenario. To be sure it is a bug we need more information as Abhishek
pointed out.

However if you want to do some computations after you flattened out your
query, it might be helpful sometimes to rewrite the query such that
multiple flatten's fall in multiple sub-queries. You may see some
performance improvement. Let me know how it goes.

- Rahul

On Tue, Jul 19, 2016 at 1:22 PM, Abhishek Girish <abhishek.gir...@gmail.com>
wrote:

Hi Matt,

Can you please share more information on your setup, specifically the size of your dataset, including an approximate average size of individual JSON
files, the number of nodes, including Drillbit memory config.

Also can you share the query profiles for the few scenarios you mention.

Regards,
Abhishek

On Friday, July 15, 2016, Matt <bsg...@gmail.com> wrote:

I have JSON data with with a nested list and am using FLATTEN to extract
two of three list elements as:

~~~
SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
~~~

This works, but each FLATTEN seems to slow the query down dramatically,
3x
slower with the second flatten.

Is there a better approach to extracting list elements?

~~~
[
  {
    "id": 16,
    "data": [
      [
        "2016-07-13 00:00",
        509,
        "OK"
      ],
      [
        "2016-07-13 00:01",
        461,
        "OK"
      ],
      [
        "2016-07-13 00:02",
        508,
        "OK"
      ],
~~~


Reply via email to