ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-646699595
Env: centos 7, 40cores, 4GB ---- test 1 ---- ``` val df1 = spark.range(0, 20000, 1, 200).selectExpr("uuid() as c1") val df2 = spark.range(0, 20000, 1, 200).selectExpr("uuid() as c2") val start = System.currentTimeMillis df1.join(df2).where("c2 like c1").count() // 3 times test // before 159228, 157541, 157721 // after 14378, 11545, 11498 println(System.currentTimeMillis - start) ``` ---- test2 ---- ``` // 17+1 length stirngs val df1 = spark.range(0, 20000, 1, 200).selectExpr("concat('aaaaaaaaaaaaaaaaa', id%2) as c1") val df2 = spark.range(0, 20000, 1, 200).selectExpr("concat('bbbbbbbbbbbbbbbbb', id%2) as c2") val start = System.currentTimeMillis df1.join(df2).where("c2 like c1").count() // 3 times test // before 90054, 90350, 90283 // after 13077, 10097, 9770 println(System.currentTimeMillis - start) ``` About 10x time performance improvement. Seems equals is more quickly than compile pattern. And longer strings would make performance improvement better. cc @HyukjinKwon ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org