Re: Why are nested aggregations illegal? Best alternatives?

Julian Hyde Wed, 23 Feb 2022 09:36:16 -0800

It’s a parser error. That usually means that the user has made a mistake.


Try your SQL on another DB with JSON support before you declare this a Calcite 
bug. 

Julian

> On Feb 23, 2022, at 09:22, Gavin Ray <ray.gavi...@gmail.com> wrote:
> 
> No dice still unfortunately =/
> 
> If it's any easier, I put a self-contained single class reproduction on
> Github:
> https://github.com/GavinRay97/calcite-nested-json-subquery-bug/blob/main/src/test/java/com/example/AppTest.java
> 
> Is it worth filing a JIRA ticket over you think?
> 
> 
>> On Wed, Feb 23, 2022 at 3:45 AM Julian Hyde <jhyde.apa...@gmail.com> wrote:
>> 
>> Try ‘value ((‘ in place of ‘value (‘.
>> 
>> Julian
>> 
>>>> On Feb 21, 2022, at 9:33 AM, Gavin Ray <ray.gavi...@gmail.com> wrote:
>>> 
>>> I hadn't thought about the fact that ORM's probably have to solve this
>>> problem as well
>>> That is a great suggestion, I will try to investigate some of the popular
>>> ORM codebases and see if there are any tricks they are using.
>>> 
>>> I seem to maybe be getting a tiny bit closer by using subqueries like
>>> Julian suggested instead of operator calls
>>> But if I may ask what is probably a very stupid question:
>>> 
>>> What might the error message
>>> "parse failed: Query expression encountered in illegal context
>>> (state=,code=0)"
>>> 
>>> Mean in the below query?
>>> 
>>> The reason why I am confused is because the query runs if I remove the
>>> innermost subquery ("todos")
>>> But the innermost subquery is a direct copy-paste of the subquery above
>> it,
>>> so I know it MUST be valid
>>> 
>>> As usual, thank you so much for your help/guidance Stamatis.
>>> 
>>> select
>>> "g0"."id" "id",
>>> "g0"."address" "address",
>>> (
>>>   select json_arrayagg(json_object(
>>>     key 'id' value "g1"."id",
>>>     key 'todos' value (
>>>       select json_arrayagg(json_object(
>>>         key 'id' value "g2"."id",
>>>         key 'description' value "g2"."description",
>>>       ))
>>>       from (
>>>         select * from "todos"
>>>         where "g1"."id" = "user_id"
>>>         order by "id"
>>>       ) "g2"
>>>     )
>>>   ))
>>>   from (
>>>     select * from "users"
>>>     where "g0"."id" = "house_id"
>>>     order by "id"
>>>   ) "g1"
>>> ) "users"
>>> from "houses" "g0"
>>> order by "g0"."id"
>>> 
>>>> On Mon, Feb 21, 2022 at 8:07 AM Stamatis Zampetakis <zabe...@gmail.com>
>>>> wrote:
>>>> 
>>>> Hi Gavin,
>>>> 
>>>> A few more comments in case they help to get you a bit further on your
>>>> work.
>>>> 
>>>> The need to return the result as a single object is a common problem in
>>>> object relational mapping (ORM) frameworks/APIS (JPA, Datanucleus,
>>>> Hibernate, etc.). Apart from the suggestions so far maybe you could look
>>>> into these frameworks as well for more inspiration.
>>>> 
>>>> Moreover your approach of decomposing the query into individual parts is
>>>> commonly known as the N+1 problem [1].
>>>> 
>>>> Lastly, keep in mind that you can introduce custom UDF, UDAF functions
>> if
>>>> you need more flexibility on reconstructing the final result.
>>>> 
>>>> Best,
>>>> Stamatis
>>>> 
>>>> [1]
>>>> 
>>>> 
>> https://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem-in-orm-object-relational-mapping
>>>> 
>>>>> On Sun, Feb 13, 2022 at 3:59 AM Gavin Ray <ray.gavi...@gmail.com>
>> wrote:
>>>>> 
>>>>> Ah wait nevermind, got excited and spoke too soon. Looking at it more
>>>>> closely, that data isn't correct.
>>>>> At least it's in somewhat the right shape, ha!
>>>>> 
>>>>>> On Sat, Feb 12, 2022 at 9:57 PM Gavin Ray <ray.gavi...@gmail.com>
>> wrote:
>>>>> 
>>>>>> After ~5 hours, I think I may have made some progress =)
>>>>>> 
>>>>>> I have this, which currently works. The problem is that the nested
>>>>> columns
>>>>>> don't have names on them.
>>>>>> Since I need to return a nested "Map<String, Object>", I have to
>> figure
>>>>>> out how to convert this query into a form that gives column names.
>>>>>> 
>>>>>> But this is still great progress I think!
>>>>>> 
>>>>>> SELECT
>>>>>>   "todos".*,
>>>>>>   ARRAY(
>>>>>>       SELECT
>>>>>>           "users".*,
>>>>>>           ARRAY(
>>>>>>               SELECT
>>>>>>                   "todos".*
>>>>>>               FROM
>>>>>>                   "todos"
>>>>>>           ) AS "todos"
>>>>>>       FROM
>>>>>>           "users"
>>>>>>   ) AS "users"
>>>>>> FROM
>>>>>>   "todos"
>>>>>> WHERE
>>>>>>   "user_id" IN (
>>>>>>       SELECT
>>>>>>           "user_id"
>>>>>>       FROM
>>>>>>           "users"
>>>>>>       WHERE
>>>>>>           "house_id" IN (
>>>>>>               SELECT
>>>>>>                   "id"
>>>>>>               FROM
>>>>>>                   "houses"
>>>>>>           )
>>>>>>   );
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> +----+---------+------------------------+------------------------------------------------------------------------------+
>>>>>> | id | user_id |      description       |
>>>>>>                                           |
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> +----+---------+------------------------+------------------------------------------------------------------------------+
>>>>>> | 1  | 1       | Take out the trash     | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 2  | 1       | Watch my favorite show | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 3  | 1       | Charge my phone        | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 4  | 2       | Cook dinner            | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 5  | 2       | Read a book            | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 6  | 2       | Organize office        | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 7  | 3       | Walk the dog           | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> | 8  | 3       | Feed the cat           | [{1, John, 1, [{1, 1, Take
>>>> out
>>>>>> the trash}, {2, 1, Watch my favorite show}, { |
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> +----+---------+------------------------+------------------------------------------------------------------------------+
>>>>>> 
>>>>>> On Sat, Feb 12, 2022 at 4:13 PM Gavin Ray <ray.gavi...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Nevermind, this is a standard term not something Calcite-specific it
>>>>>>> seems!
>>>>>>> 
>>>>>>> https://en.wikipedia.org/wiki/Correlated_subquery
>>>>>>> 
>>>>>>> On Sat, Feb 12, 2022 at 3:46 PM Gavin Ray <ray.gavi...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Forgive my ignorance/lack of experience
>>>>>>>> 
>>>>>>>> I am somewhat familiar with the ARRAY() function, but not sure I
>> know
>>>>>>>> the term "correlated"
>>>>>>>> Searching the Calcite codebase for uses of "correlated" + "query", I
>>>>>>>> found:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/calcite/blob/1d4f1b394bfdba03c5538017e12ab2431b137ca9/core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java#L1603-L1612
>>>>>>>> 
>>>>>>>> @Test void testCorrelatedSubQueryInJoin() {
>>>>>>>>   final String sql = "select *\n"
>>>>>>>>       + "from emp as e\n"
>>>>>>>>       + "join dept as d using (deptno)\n"
>>>>>>>>       + "where d.name = (\n"
>>>>>>>>       + "  select max(name)\n"
>>>>>>>>       + "  from dept as d2\n"
>>>>>>>>       + "  where d2.deptno = d.deptno)";
>>>>>>>>   sql(sql).withExpand(false).ok();
>>>>>>>> }
>>>>>>>> 
>>>>>>>> But I also see this, which says it is "uncorrelated" but seems very
>>>>>>>> similar?
>>>>>>>> 
>>>>>>>> @Test void testInUncorrelatedSubQuery() {
>>>>>>>>   final String sql = "select empno from emp where deptno in"
>>>>>>>>       + " (select deptno from dept)";
>>>>>>>>   sql(sql).ok();
>>>>>>>> }
>>>>>>>> 
>>>>>>>> I wouldn't blame you for not answering such a basic question -- but
>>>>> what
>>>>>>>> exactly does "correlation" mean here?
>>>>>>>> 
>>>>>>>> Thanks, as usual Julian
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Feb 12, 2022 at 3:08 PM Julian Hyde <jhyde.apa...@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Correlated ARRAY sub-query?
>>>>>>>>> 
>>>>>>>>>> On Feb 12, 2022, at 10:40 AM, Gavin Ray <ray.gavi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Apologies for the delay in replying
>>>>>>>>>> 
>>>>>>>>>> This makes things clear and seems obvious now that you point it
>>>> out.
>>>>>>>>>> Thank you, Justin and Julian =)
>>>>>>>>>> 
>>>>>>>>>> Let me ask another question (if I may) that I am struggling to
>>>>> phrase
>>>>>>>>>> easily.
>>>>>>>>>> 
>>>>>>>>>> So with GraphQL, you might have a query like:
>>>>>>>>>> - "Get houses"
>>>>>>>>>> - "For each house get the user that lives in the house
>>>>>>>>>> - "And for each user get their list of todos"
>>>>>>>>>> 
>>>>>>>>>> The result has to come back such that it's a single object for
>>>> each
>>>>>>>>> row
>>>>>>>>>> ==============
>>>>>>>>>> {
>>>>>>>>>> houses: [{
>>>>>>>>>>  address: "123 Main Street",
>>>>>>>>>>  users: [{
>>>>>>>>>>    name: "Joe",
>>>>>>>>>>    todos: [{
>>>>>>>>>>      description: "Take out trash"
>>>>>>>>>>    }]
>>>>>>>>>>  }]
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> From a SQL perspective, the logical equivalent would be something
>>>>>>>>> like:
>>>>>>>>>> ==============
>>>>>>>>>> SELECT
>>>>>>>>>>  house.address,
>>>>>>>>>>  (somehow nest users + double-nest todos under user)
>>>>>>>>>> FROM
>>>>>>>>>>  house
>>>>>>>>>> JOIN
>>>>>>>>>>  user ON user.house_id = house.id
>>>>>>>>>>  todos ON todos.user_id = user.id
>>>>>>>>>> WHERE
>>>>>>>>>>  house.id = 1
>>>>>>>>>> 
>>>>>>>>>> I'm not familiar enough with SQL to have figured out a way to make
>>>>>>>>> this
>>>>>>>>>> kind of
>>>>>>>>>> query using operators that are supported across most of the DB's
>>>>>>>>> Calcite has
>>>>>>>>>> adapters for.
>>>>>>>>>> 
>>>>>>>>>> Currently what I have done instead, on a tip from Gopalakrishna
>>>>> Holla
>>>>>>>>> from
>>>>>>>>>> LinkedIn Coral team who has built GraphQL-on-Calcite, was to break
>>>>> up
>>>>>>>>> the
>>>>>>>>>> query
>>>>>>>>>> into individual parts and then do the join in-memory:
>>>>>>>>>> 
>>>>>>>>>> SELECT ... FROM users;
>>>>>>>>>> SELECT ... FROM todos WHERE todos.user_id IN (ResultSet from prev
>>>>>>>>> response);
>>>>>>>>>> 
>>>>>>>>>> However, the way I am doing this seems like it's probably very
>>>>>>>>> inefficient.
>>>>>>>>>> Because I do a series of nested loops to add the key to each
>>>> object
>>>>>>>>> in the
>>>>>>>>>> parent ResultSet row:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/GavinRay97/GraphQLCalcite/blob/648e0ac4f6810a3c360d13a03e6597c33406de4b/src/main/kotlin/TableDataFetcher.kt#L135-L153
>>>>>>>>>> 
>>>>>>>>>> Is there some better way of doing this?
>>>>>>>>>> I would be eternally grateful for any advice.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 10, 2022 at 3:27 PM Julian Hyde <
>>>> jhyde.apa...@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Yes, if you want to do multiple layers of aggregation, use CTEs
>>>>>>>>> (WITH) or
>>>>>>>>>>> nested sub-queries. For example, the following is I believe valid
>>>>>>>>> standard
>>>>>>>>>>> SQL, and actually computes something useful:
>>>>>>>>>>> 
>>>>>>>>>>> WITH q1 AS
>>>>>>>>>>> (SELECT deptno, job, AVG(sal) AS avg_sal
>>>>>>>>>>>  FROM emp
>>>>>>>>>>>  GROUP BY deptno, job)
>>>>>>>>>>> WITH q2 AS
>>>>>>>>>>> (SELECT deptno, AVG(avg_sal) AS avg_avg_sal
>>>>>>>>>>>  FROM q1
>>>>>>>>>>>  GROUP BY deptno)
>>>>>>>>>>> SELECT AVG(avg_avg_sal)
>>>>>>>>>>> FROM q2
>>>>>>>>>>> GROUP BY ()
>>>>>>>>>>> 
>>>>>>>>>>> (You can omit the “GROUP BY ()” line, but I think it makes things
>>>>>>>>> clearer.)
>>>>>>>>>>> 
>>>>>>>>>>> Julian
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 10, 2022, at 12:17 PM, Justin Swanhart <
>>>>> greenl...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I wish you could unsend emails :)  Answering my own question,
>>>> no,
>>>>>>>>> because
>>>>>>>>>>>> that would return three rows with the average :D
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Feb 10, 2022 at 3:16 PM Justin Swanhart <
>>>>>>>>> greenl...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Just out of curiosity, is the second level aggregation using
>>>> AVG
>>>>>>>>> in a
>>>>>>>>>>>>> window context?  It the frame is the whole table and it
>>>>> aggregates
>>>>>>>>> over
>>>>>>>>>>> it?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 10, 2022 at 3:12 PM Justin Swanhart <
>>>>>>>>> greenl...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That is really neat about Oracle.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The alternative in general is to use a subquery:
>>>>>>>>>>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
>>>>>>>>>>>>>> becomes
>>>>>>>>>>>>>> select avg(the_avg)
>>>>>>>>>>>>>> from (select avg(sal) from emp group b deptno) an_alias;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> with the_cte as (select avg(sal) x from emp group by deptno)
>>>>>>>>>>>>>> select avg(x) from the_cte;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Feb 10, 2022 at 3:03 PM Julian Hyde <
>>>>>>>>> jhyde.apa...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Some databases, e.g. Oracle, allow TWO levels of nesting:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SELECT avg(sal) FROM emp GROUP BY deptno;
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> AVG(SAL)
>>>>>>>>>>>>>>> ========
>>>>>>>>>>>>>>> 1,566.67
>>>>>>>>>>>>>>> 2,175.00
>>>>>>>>>>>>>>> 2,916.65
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SELECT avg(avg(sal)) FROM emp GROUP BY deptno;
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> AVG(SUM(SAL))
>>>>>>>>>>>>>>> =============
>>>>>>>>>>>>>>>     9,675
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The first level aggregates by department (returning 3
>>>> records),
>>>>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>>>>>> second level computes the grand total (returning 1 record).
>>>> But
>>>>>>>>> that
>>>>>>>>>>> is an
>>>>>>>>>>>>>>> exceptional case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Generally, any expression in the SELECT or HAVING clause of
>>>> an
>>>>>>>>>>> aggregate
>>>>>>>>>>>>>>> query is either ‘before’ or ‘after’ aggregation. Consider
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SELECT t.x + 1 AS a, 2 + SUM(t.y + 3) AS b
>>>>>>>>>>>>>>> FROM t
>>>>>>>>>>>>>>> GROUP BY t.x
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The expressions “t.y” and “t.y + 3” occur before aggregation;
>>>>>>>>> “t.x”,
>>>>>>>>>>>>>>> “t.x + 1”, “SUM(t.y + 3)” and “2 + SUM(t.y + 3)” occur after
>>>>>>>>>>> aggregation.
>>>>>>>>>>>>>>> SQL semantics rely heavily on this stratification. Allowing
>>>> an
>>>>>>>>> extra
>>>>>>>>>>> level
>>>>>>>>>>>>>>> of aggregation would mess it all up.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Julian
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Feb 10, 2022, at 9:45 AM, Justin Swanhart <
>>>>>>>>> greenl...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is a SQL limitation.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> mysql> select sum(1);
>>>>>>>>>>>>>>>> +--------+
>>>>>>>>>>>>>>>> | sum(1) |
>>>>>>>>>>>>>>>> +--------+
>>>>>>>>>>>>>>>> |      1 |
>>>>>>>>>>>>>>>> +--------+
>>>>>>>>>>>>>>>> 1 row in set (0.00 sec)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> mysql> select sum(sum(1));
>>>>>>>>>>>>>>>> ERROR 1111 (HY000): Invalid use of group function
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Feb 10, 2022 at 12:39 PM Gavin Ray <
>>>>>>>>> ray.gavi...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Went to test this query out and found that it can't be
>>>>>>>>> performed:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> SELECT
>>>>>>>>>>>>>>>>> JSON_OBJECT(
>>>>>>>>>>>>>>>>>    KEY 'users'
>>>>>>>>>>>>>>>>>    VALUE JSON_ARRAYAGG(
>>>>>>>>>>>>>>>>>        JSON_OBJECT(
>>>>>>>>>>>>>>>>>            KEY 'name' VALUE "users"."name",
>>>>>>>>>>>>>>>>>            KEY 'todos' VALUE JSON_ARRAYAGG(
>>>>>>>>>>>>>>>>>                JSON_OBJECT(
>>>>>>>>>>>>>>>>>                    KEY 'description' VALUE
>>>>>>>>> "todos"."description"
>>>>>>>>>>>>>>>>>                )
>>>>>>>>>>>>>>>>>            )
>>>>>>>>>>>>>>>>>        )
>>>>>>>>>>>>>>>>>    )
>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>> FROM
>>>>>>>>>>>>>>>>> "users"
>>>>>>>>>>>>>>>>> LEFT OUTER JOIN
>>>>>>>>>>>>>>>>> "todos" ON "users"."id" = "todos"."user_id";
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Checking the source, seems this is a blanket policy, not a
>>>>>>>>>>>>>>>>> datasource-specific thing.
>>>>>>>>>>>>>>>>> From a functional perspective, it doesn't feel like it's
>>>> much
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> from JOINs
>>>>>>>>>>>>>>>>> But I don't understand relational theory or DB
>>>> functionality
>>>>>>>>> in the
>>>>>>>>>>>>>>> least,
>>>>>>>>>>>>>>>>> so I'm not fit to judge.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Just curious why Calcite doesn't allow this
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>>> 
>>

Re: Why are nested aggregations illegal? Best alternatives?

Reply via email to