Re: Random expr in join key not support

Yingyi Bu Tue, 19 Oct 2021 23:21:15 -0700

> Do you mean something like this:
> select * from t1 join (select floor(random()*9) + id as x from t2) m on
t1.id = m.x ?
> Yes, that works.


Yes.

> But that raise another question: theses two queries seem semantically
equivalent,
> yet we treat them differently: one raises an analysis exception, one can
work well.
> Should we treat them equally?

They're not semantically equivalent, according to the SQL spec. See page
241 in SQL-99 spec (http://web.cecs.pdx.edu/~len/sql1999.pdf) - the general
rules for <joined table>.

> Here the purpose to add a random in join key is to resolve the data skew
problem.

Would you mind briefly elaborating what you're trying to do to reduce skew?

Best,
Yingyi


On Tue, Oct 19, 2021 at 9:07 PM Ye Xianjin <advance...@gmail.com> wrote:

> > For that, you can add a table subquery and do it in the select list.
>
> Do you mean something like this:
> select * from t1 join (select floor(random()*9) + id as x from t2) m on
> t1.id = m.x ?
>
> Yes, that works. But that raise another question: theses two queries seem
> semantically equivalent, yet we treat them differently: one raises an
> analysis exception, one can work well.
> Should we treat them equally?
>
>
>
>
> Sent from my iPhone
>
> On Oct 20, 2021, at 9:55 AM, Yingyi Bu <buyin...@gmail.com> wrote:
>
> 
> Per SQL spec, I think your join query can only be run as a NestedLoopJoin
> or CartesianProduct.  See page 241 in SQL-99 (
> http://web.cecs.pdx.edu/~len/sql1999.pdf).
> In other words, it might be a correctness bug in other systems if they run
> your query as a hash join.
>
> > Here the purpose of adding a random in join key is to resolve the data
> skew problem.
>
> For that, you can add a table subquery and do it in the select list.
>
> Best,
> Yingyi
>
>
> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin <jinlan...@gmail.com> wrote:
>
>> In PostgreSQL and Presto, the below query works well
>> sql> create table t1 (id int);
>> sql> create table t2 (id int);
>> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id;
>>
>> But it throws "Error in query: nondeterministic expressions are only
>> allowed in Project, Filter, Aggregate or Window". Why Spark doesn't support
>> random expressions in join condition?
>> Here the purpose to add a random in join key is to resolve the data skew
>> problem.
>>
>> Thanks,
>> Lantao
>>
>

Re: Random expr in join key not support

Reply via email to