[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-650841831 Maybe I miss some point, test3 is the worst case so that other scenarios is always better than it, and test1 and test2 is the best positive case. The result is that add the cache can make the performance better within [0, 10]x times. But if you have any special scenarios, I'm ready to do the benchmark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-650702868 yeah, but it also has little performance regression with normal case seen as test3. I think it's a reason to do this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647982945 It's different. test1 condition use `c2 like c1` and test3 use `c1 like c2`. The result is test1 can always reuse the pattern and test2 always need to re-compile. As I said above, make column that need to compile at join `BufferedSide` can avoid compile. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647899052 The positive test case (test1, test2) is to confirm this pr has much better performance in dynamic like scene. The test1 and test2 aim to check the different string length. `%2` has no meaning. The negative test case (test3) is to confirm this pr has little performance regression. > Why is test1 and test3 so different in time? Since it's a join opt, the best way is using the column which is at right of like as the `BufferedSide`, then when loop `StreamSide`, the `BufferedSide` column will always same as last, so it can avoid pattern compile each row. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647245944 Seems merged PR cann't reopen. Is there any way ? If not I will send an another pr for this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-647244110 test3 ``` val df1 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c1") val df2 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c2") val start = System.currentTimeMillis df1.join(df2).where("c1 like c2").count() // 3 times test // before 159226, 159147, 159587 // after 159641, 160960, 160091 println(System.currentTimeMillis - start) ``` The worst case is that do compare and compile each row. And it seems only little regression. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-646699595 Env: centos 7, 40cores, 4GB test 1 ``` val df1 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c1") val df2 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c2") val start = System.currentTimeMillis df1.join(df2).where("c2 like c1").count() // 3 times test // before 159228, 157541, 157721 // after 14378, 11545, 11498 println(System.currentTimeMillis - start) ``` test2 ``` // 17+1 length stirngs val df1 = spark.range(0, 2, 1, 200).selectExpr("concat('a', id%2) as c1") val df2 = spark.range(0, 2, 1, 200).selectExpr("concat('b', id%2) as c2") val start = System.currentTimeMillis df1.join(df2).where("c2 like c1").count() // 3 times test // before 90054, 90350, 90283 // after 13077, 10097, 9770 println(System.currentTimeMillis - start) ``` About 10x time performance improvement. Seems equals is more quickly than compile pattern. And longer strings would make performance improvement better. cc @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static
ulysses-you commented on pull request #26875: URL: https://github.com/apache/spark/pull/26875#issuecomment-646556878 @HyukjinKwon sorry, I missed many things. And thanks @beliefer do the benchmark, but it seems something wrong, the right part should not be a foldable value. The PR aims to improve perf of the dynamic like. e.g select count(*) from t where c1 like c2. Not affect the static like. I will do a new benchmark with a very long strings what @HyukjinKwon suggested. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org