Github user aokolnychyi commented on the issue:
https://github.com/apache/spark/pull/23171
As @rxin said, if we introduce a separate expression for the switch-based
approach, then we will need to modify other places. For example,
`DataSourceStrategy$translateFilter`. So, integrating
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/23171
Basically logically there are only two expressions: In which handles
arbitrary expressions, and InSet which handles expressions with literals. Both
could work: (1) we provide two separate expressions
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/23171
@rxin `switch` in Java is still significantly faster than hash set even
without boxing / unboxing problems when the number of elements are small. We
were thinking about to have two implementations
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/23171
How about, we create an `OptimizedIn`, and convert `In` to `OptimizedIn` if
the list is all literals? `OptimizedIn` will pick `switch` or hash set based on
the length of the list.
---
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/23171
I thought InSwitch logically is the same as InSet, in which all the child
expressions are literals?
On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < notificati...@github.com >
wrote:
>
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/23171
I think `InSet` is not an optimized version of `In`, but just a way to
separate the implementation for different conditions (the length of the list).
Maybe we should do the same thing here,
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/23171
That probably means we should just optimize InSet to have the switch
version though? Rather than do it in In?
On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < notificati...@github.com >
wrote:
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/23171
@rxin I proposed the same thing before, but one problem is that, we only
convert `In` to `InSet` when the length of list reaches the threshold. If the
`switch` way is faster than hash set when
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/23171
I'm not a big fan of making the physical implementation of an expression
very different depending on the situation. Why can't we just make InSet
efficient and convert these cases to that?
---
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/23171
yes @aokolnychyi , I agree that the work can be done later (not in the
scope of this PR). We can maybe just open a new JIRA about it so we won't
forget.
---
Github user aokolnychyi commented on the issue:
https://github.com/apache/spark/pull/23171
To sum up, I would set the goal of this PR is to make `In` expressions as
efficient as possible for bytes/shorts/ints. Then we can do benchmarks for `In`
vs `InSet` in
Github user aokolnychyi commented on the issue:
https://github.com/apache/spark/pull/23171
@dbtsai @mgaido91 I think we can come back to this question once
[SPARK-26203](https://issues.apache.org/jira/browse/SPARK-26203) is resolved.
That JIRA will give us enough information about
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/23171
@dbtsai I see, it would be great, though, to check which is this threshold.
My understanding is that the current solution has better performance even for
several hundreds of items. If this number
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/23171
@cloud-fan as @aokolnychyi said, `switch` will still be faster than
optimized `Set` without autoboxing when the number of elements are small. As a
result, this PR is still very useful.
Github user aokolnychyi commented on the issue:
https://github.com/apache/spark/pull/23171
@cloud-fan, yeah, letâs see if this PR is useful.
The original idea wasnât to avoid fixing autoboxing in `InSet`. `In` was
tested on 250 numbers to prove O(1) time complexity on
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/23171
I'm wondering if this is still useful after we fix the boxing issue in
`InSet`. We can write a binary hash set for primitive types, like
`LongToUnsafeRowMap`, which should have better
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/23171
Also cc @ueshin
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23171
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23171
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99393/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23171
**[Test build #99393 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)**
for PR 23171 at commit
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/23171
The approach looks great, and can significantly improve the performance.
For Long, I agree that we should also implement binary search approach for
`O(logn)` look up.
Wondering which one
Github user aokolnychyi commented on the issue:
https://github.com/apache/spark/pull/23171
@gatorsmile @cloud-fan @dongjoon-hyun @viirya
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23171
**[Test build #99393 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99393/testReport)**
for PR 23171 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23171
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23171
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
25 matches
Mail list logo