> For that, you can add a table subquery and do it in the select list. Do you mean something like this: select * from t1 join (select floor(random()*9) + id as x from t2) m on t1.id = m.x ?
Yes, that works. But that raise another question: theses two queries seem semantically equivalent, yet we treat them differently: one raises an analysis exception, one can work well. Should we treat them equally? Sent from my iPhone > On Oct 20, 2021, at 9:55 AM, Yingyi Bu <buyin...@gmail.com> wrote: > > > Per SQL spec, I think your join query can only be run as a NestedLoopJoin or > CartesianProduct. See page 241 in SQL-99 > (http://web.cecs.pdx.edu/~len/sql1999.pdf). > In other words, it might be a correctness bug in other systems if they run > your query as a hash join. > > > Here the purpose of adding a random in join key is to resolve the data skew > > problem. > > For that, you can add a table subquery and do it in the select list. > > Best, > Yingyi > > >> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin <jinlan...@gmail.com> wrote: >> In PostgreSQL and Presto, the below query works well >> sql> create table t1 (id int); >> sql> create table t2 (id int); >> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id; >> >> But it throws "Error in query: nondeterministic expressions are only allowed >> in Project, Filter, Aggregate or Window". Why Spark doesn't support random >> expressions in join condition? >> Here the purpose to add a random in join key is to resolve the data skew >> problem. >> >> Thanks, >> Lantao