[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-28 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-650841831


   Maybe I miss some point, test3 is the worst case so that other scenarios is 
always better than it, and test1 and test2 is the best positive case.
   
   The result is that add the cache can make the performance better within [0, 
10]x times. But if you have any special scenarios, I'm ready to do the 
benchmark.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-27 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-650702868


   yeah, but it also has little performance regression with normal case seen as 
test3.  I think it's a reason to do this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-23 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647982945


   It's different. test1 condition use `c2 like c1` and test3 use `c1 like c2`. 
The result is test1 can always reuse the pattern and test2 always need to 
re-compile.
   
   As I said above, make column that need to compile at join `BufferedSide` can 
avoid compile.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-22 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647899052


   The positive test case (test1, test2) is to confirm this pr has much better 
performance in dynamic like scene. The test1 and test2 aim to check the 
different string length. `%2` has no meaning.
   
   The negative test case (test3) is to confirm this pr has little performance 
regression.
   
   > Why is test1 and test3 so different in time?
   
   Since it's a join opt, the best way is using the column which is at right of 
like as the `BufferedSide`, then when loop `StreamSide`, the `BufferedSide` 
column will always same as last, so it can avoid pattern compile each row.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-21 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647245944


   Seems merged PR cann't reopen. Is there any way ? If not I will send an 
another pr for this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-21 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-647244110


    test3 
   ```
   val df1 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c1")
   val df2 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c2")
   val start = System.currentTimeMillis
   df1.join(df2).where("c1 like c2").count()
   // 3 times test
   // before  159226, 159147, 159587
   // after   159641, 160960, 160091
   println(System.currentTimeMillis - start)
   ```
   The worst case is that do compare and compile each row. And it seems only 
little regression.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-19 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-646699595


   Env: centos 7, 40cores, 4GB
   
    test 1 
   ```
   val df1 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c1")
   val df2 = spark.range(0, 2, 1, 200).selectExpr("uuid() as c2")
   val start = System.currentTimeMillis
   df1.join(df2).where("c2 like c1").count()
   // 3 times test
   // before  159228, 157541, 157721
   // after   14378,  11545,  11498
   println(System.currentTimeMillis - start)
   ```
    test2 
   ```
   // 17+1 length stirngs
   val df1 = spark.range(0, 2, 1, 
200).selectExpr("concat('a', id%2) as c1")
   val df2 = spark.range(0, 2, 1, 
200).selectExpr("concat('b', id%2) as c2")
   val start = System.currentTimeMillis
   df1.join(df2).where("c2 like c1").count()
   // 3 times test
   // before  90054, 90350, 90283
   // after   13077, 10097, 9770
   println(System.currentTimeMillis - start)
   ```
   
   About 10x time performance improvement. Seems equals is more quickly than 
compile pattern. And longer strings would make performance improvement better.
   cc @HyukjinKwon 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #26875: [SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static

2020-06-19 Thread GitBox


ulysses-you commented on pull request #26875:
URL: https://github.com/apache/spark/pull/26875#issuecomment-646556878


   @HyukjinKwon sorry, I missed many things. And thanks @beliefer do the 
benchmark, but it seems something wrong, the right part should not be a 
foldable value.
   
   The PR aims to improve perf of the dynamic like. e.g select count(*) from t 
where c1 like c2. Not affect the static like.
   
   I will do a new benchmark with a very long strings what @HyukjinKwon 
suggested.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org