[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-04 Thread aokolnychyi
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 As @rxin said, if we introduce a separate expression for the switch-based approach, then we will need to modify other places. For example, `DataSourceStrategy$translateFilter`. So, integrating

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 Basically logically there are only two expressions: In which handles arbitrary expressions, and InSet which handles expressions with literals. Both could work: (1) we provide two separate expressions

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread dbtsai
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 @rxin `switch` in Java is still significantly faster than hash set even without boxing / unboxing problems when the number of elements are small. We were thinking about to have two implementations

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 How about, we create an `OptimizedIn`, and convert `In` to `OptimizedIn` if the list is all literals? `OptimizedIn` will pick `switch` or hash set based on the length of the list. ---

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 I thought InSwitch logically is the same as InSet, in which all the child expressions are literals? On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < notificati...@github.com > wrote: >

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 I think `InSet` is not an optimized version of `In`, but just a way to separate the implementation for different conditions (the length of the list). Maybe we should do the same thing here,

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 That probably means we should just optimize InSet to have the switch version though? Rather than do it in In? On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < notificati...@github.com > wrote:

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 @rxin I proposed the same thing before, but one problem is that, we only convert `In` to `InSet` when the length of list reaches the threshold. If the `switch` way is faster than hash set when

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 I'm not a big fan of making the physical implementation of an expression very different depending on the situation. Why can't we just make InSet efficient and convert these cases to that? ---

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23171 yes @aokolnychyi , I agree that the work can be done later (not in the scope of this PR). We can maybe just open a new JIRA about it so we won't forget. ---

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread aokolnychyi
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 To sum up, I would set the goal of this PR is to make `In` expressions as efficient as possible for bytes/shorts/ints. Then we can do benchmarks for `In` vs `InSet` in

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread aokolnychyi
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 @dbtsai @mgaido91 I think we can come back to this question once [SPARK-26203](https://issues.apache.org/jira/browse/SPARK-26203) is resolved. That JIRA will give us enough information about

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23171 @dbtsai I see, it would be great, though, to check which is this threshold. My understanding is that the current solution has better performance even for several hundreds of items. If this number

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread dbtsai
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 @cloud-fan as @aokolnychyi said, `switch` will still be faster than optimized `Set` without autoboxing when the number of elements are small. As a result, this PR is still very useful.

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread aokolnychyi
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 @cloud-fan, yeah, let’s see if this PR is useful. The original idea wasn’t to avoid fixing autoboxing in `InSet`. `In` was tested on 250 numbers to prove O(1) time complexity on

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 I'm wondering if this is still useful after we fix the boxing issue in `InSet`. We can write a binary hash set for primitive types, like `LongToUnsafeRowMap`, which should have better

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23171 Also cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99393/ Test PASSed. ---

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23171 **[Test build #99393 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)** for PR 23171 at commit

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread dbtsai
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 The approach looks great, and can significantly improve the performance. For Long, I agree that we should also implement binary search approach for `O(logn)` look up. Wondering which one

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread aokolnychyi
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23171 @gatorsmile @cloud-fan @dongjoon-hyun @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23171 **[Test build #99393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)** for PR 23171 at commit

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23171 Test PASSed. Refer to this link for build results (access rights to CI server needed):