> In other words, it might be a correctness bug in other systems if they
run your query as a hash join.

Thanks for the reply. But for PostgreSQL, this query produces a nested-loop
join:

# explain select * from t11 join t22 on t11.id = floor(random() * 9) +
t22.id;

                                                       QUERY PLAN


------------------------------------------------------------------------------------------------------------------------

 Nested Loop  (cost=0.00..195152.38 rows=32512 width=8)

   Join Filter: ((t11.id)::double precision = (floor((random() *
'9'::double precision)) + (t22.id)::double precision))

   ->  Seq Scan on t11  (cost=0.00..35.50 rows=2550 width=4)

   ->  Materialize  (cost=0.00..48.25 rows=2550 width=4)

         ->  Seq Scan on t22  (cost=0.00..35.50 rows=2550 width=4)

(5 rows)



> Would you mind briefly elaborating what you're trying to do to reduce
skew?

We use Apache Calcite as a query federation layer which backends are
SparkSQL and Presto. The Calcite can accept the above query using rand() in
the join condition.
For example: the join conditions contains "nvl(B.app_name, concat('-',
rand())) = A.app_name", the purpose from user is to avoid data skew. I know
Spark has AQE to handle some cases of data skew but not for Presto, so
users used this pattern to handle it by themselves. After optimizing by the
Calcite, it rewrites the join condition and passes a deterministic expr to
join condition in the plan, so the underlay Spark could work well. But
sometimes this optimization will be skipped, or users specify the Spark
without Calcite layer. The query can not run.

So should Spark support this statement like Calcite/PostgreSQL/Presto?

Thanks,
Lantao



Lantao Jin <jinlan...@gmail.com> 于2021年10月20日周三 下午6:10写道:

> > Would you mind briefly elaborating what you're trying to do to reduce
> skew?
>
> We use Apache Calcite as a query federation layer which backends are
> SparkSQL and Presto. The Calcite can accept the above query using rand() in
> the join condition.
> For example: the join conditions contains "nvl(B.app_name, concat('-',
> rand())) = A.app_name", the purpose from user is to avoid data skew. I know
> Spark has AQE to handle some cases of data skew but not for Presto, so
> users used this pattern to handle it by themselves. After optimizing by the
> Calcite, it rewrites the join condition and passes a deterministic expr to
> join condition in the plan, so the underlay Spark could work well. But
> sometimes this optimization will be skipped, or users specify the Spark
> without Calcite layer. The query can not run.
>
> So should Spark support this statement like Calcite/PostgreSQL/Presto?
>
> Thanks,
> Lantao
>
> Lantao Jin <jinlan...@gmail.com> 于2021年10月20日周三 下午5:45写道:
>
>> > In other words, it might be a correctness bug in other systems if they
>> run your query as a hash join.
>>
>> Thanks for the reply. But for PostgreSQL, this query produces a
>> nested-loop join:
>>
>> # explain select * from t11 join t22 on t11.id = floor(random() * 9) +
>> t22.id;
>>
>>                                                        QUERY PLAN
>>
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>>  Nested Loop  (cost=0.00..195152.38 rows=32512 width=8)
>>
>>    Join Filter: ((t11.id)::double precision = (floor((random() *
>> '9'::double precision)) + (t22.id)::double precision))
>>
>>    ->  Seq Scan on t11  (cost=0.00..35.50 rows=2550 width=4)
>>
>>    ->  Materialize  (cost=0.00..48.25 rows=2550 width=4)
>>
>>          ->  Seq Scan on t22  (cost=0.00..35.50 rows=2550 width=4)
>>
>> (5 rows)
>>
>> Yingyi Bu <buyin...@gmail.com> 于2021年10月20日周三 下午2:20写道:
>>
>>> > Do you mean something like this:
>>> > select * from t1 join (select floor(random()*9) + id as x from t2) m
>>> on t1.id = m.x ?
>>> > Yes, that works.
>>>
>>> Yes.
>>>
>>> > But that raise another question: theses two queries seem semantically
>>> equivalent,
>>> > yet we treat them differently: one raises an analysis exception, one
>>> can work well.
>>> > Should we treat them equally?
>>>
>>> They're not semantically equivalent, according to the SQL spec. See page
>>> 241 in SQL-99 spec (http://web.cecs.pdx.edu/~len/sql1999.pdf) - the
>>> general rules for <joined table>.
>>>
>>> > Here the purpose to add a random in join key is to resolve the data
>>> skew problem.
>>>
>>> Would you mind briefly elaborating what you're trying to do to reduce
>>> skew?
>>>
>>> Best,
>>> Yingyi
>>>
>>>
>>> On Tue, Oct 19, 2021 at 9:07 PM Ye Xianjin <advance...@gmail.com> wrote:
>>>
>>>> > For that, you can add a table subquery and do it in the select list.
>>>>
>>>> Do you mean something like this:
>>>> select * from t1 join (select floor(random()*9) + id as x from t2) m on
>>>> t1.id = m.x ?
>>>>
>>>> Yes, that works. But that raise another question: theses two queries
>>>> seem semantically equivalent, yet we treat them differently: one raises an
>>>> analysis exception, one can work well.
>>>> Should we treat them equally?
>>>>
>>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 20, 2021, at 9:55 AM, Yingyi Bu <buyin...@gmail.com> wrote:
>>>>
>>>> 
>>>> Per SQL spec, I think your join query can only be run as a
>>>> NestedLoopJoin or CartesianProduct.  See page 241 in SQL-99 (
>>>> http://web.cecs.pdx.edu/~len/sql1999.pdf).
>>>> In other words, it might be a correctness bug in other systems if they
>>>> run your query as a hash join.
>>>>
>>>> > Here the purpose of adding a random in join key is to resolve the
>>>> data skew problem.
>>>>
>>>> For that, you can add a table subquery and do it in the select list.
>>>>
>>>> Best,
>>>> Yingyi
>>>>
>>>>
>>>> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin <jinlan...@gmail.com>
>>>> wrote:
>>>>
>>>>> In PostgreSQL and Presto, the below query works well
>>>>> sql> create table t1 (id int);
>>>>> sql> create table t2 (id int);
>>>>> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id;
>>>>>
>>>>> But it throws "Error in query: nondeterministic expressions are only
>>>>> allowed in Project, Filter, Aggregate or Window". Why Spark doesn't 
>>>>> support
>>>>> random expressions in join condition?
>>>>> Here the purpose to add a random in join key is to resolve the data
>>>>> skew problem.
>>>>>
>>>>> Thanks,
>>>>> Lantao
>>>>>
>>>>

Reply via email to