[ https://issues.apache.org/jira/browse/SPARK-30045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-30045: --------------------------------- Environment: (was: {code:java} {code}) > Difference in subquery performance > ---------------------------------- > > Key: SPARK-30045 > URL: https://issues.apache.org/jira/browse/SPARK-30045 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Max Lapan > Priority: Major > > Hi! > There is a weird performance difference between > {code:java} > WHERE field IN (SELECT s FROM table) > {code} > and > {code:java} > WHERE NOT EXISTS (SELECT s FROM table WHERE s = field){code} > In my toy benchmark code (attached to the ticket), which has 1M rows in the > outer select and 20k rows in the inner, the first query runs in 4.5 minutes > and the second completes in 14 seconds. > With larger datasets, the difference is even more dramatic, so, the first > query form just never ends and eats all the memory. > We've tested this on Spark 2.4.0 (AWS EMR) > Reproduction code > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession \ > .builder \ > .appName("SQL bug reproduce") \ > .getOrCreate() > def do_q1(): > df = spark.sql(""" > select * from df1 > where df1.id not in (select id from df2) > """) > return df.count() > def do_q2(): > dff = spark.sql(""" > select * from df1 > where not exists (select id from df2 where df2.id = df1.id) > """) > return dff.count() > def build_data(n1, n2, p1=2, p2=2): > d1 = (Row(id=v) for v in range(n1)) > d2 = (Row(id=v) for v in range(n2)) > df1 = spark.createDataFrame(d1).repartition(p1) > df2 = spark.createDataFrame(d2).repartition(p2) > df1.createOrReplaceTempView("df1") > df2.createOrReplaceTempView("df2") > build_data(1000000, 20000, 10, 5) > # 14 seconds > do_q2() > # 4.5 minutes > do_q1() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org