[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22313 **[Test build #95637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95637/testReport)** for PR 22313 at commit [`4acbaf8`](https://github.com/apache/spark/commit/4acbaf8be9e572c5cdbc61c49b488e8aef9e646b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22313 Thank you for review and advice, @cloud-fan . It turns out that my initial assessment is not enough. First of all, from the beginning, [SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-6cac9bc2656e3782b0312dceb8c55d47R75) is designed as a recursive function like the following. Please see `tryLeft` and `tryRight`. It's a purely computation to check if it succeeds. There is no reuse here. So, I tried to cache the first two `tryLeft` and `tryRight` operation since they can be re-used. ```scala val tryLeft = buildSearchArgument(left, newBuilder) val tryRight = buildSearchArgument(right, newBuilder) val conjunction = for { _ <- tryLeft _ <- tryRight lhs <- buildSearchArgument(left, builder.startAnd()) rhs <- buildSearchArgument(right, lhs) } yield rhs.end() ``` However, before that, `createFilter` generates the target tree with [reduceOption(And)](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-6cac9bc2656e3782b0312dceb8c55d47R35) as a deeply skewed tree. That was the root cause. I'll update this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22313 Do you know why `createFilter` function has exponential time complexity? Let's make sure the algorithm is good before adding the cache. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22313 Thank you for review, @kiszk . First, I don't want to hold the memory up after query completion. If we do, it will be a regression. So, I wanted `time` first. Second, It's difficult to estimate the enough limit for the number of filters. - As we know codegen JVM limit issue. There are several attempts to generate a single complex query for wide tables (thousands of columns). - Spark's optimizer like `InferFiltersFromConstraints` adds more constraints like 'NotNull(col1)`. Usually, the number of filters becomes double here. - Also, it's not a good design if we need to increase this limitation whenever we add a new optimizer like `InferFiltersFromConstraints`. - If the limit is too high, we waste the memory. If the limit is small, the eviction will bite us again. In short, `time` was enough and the simplest for this purpose. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22313 General question: Why do we use `time` instead of `entry size` to control cache? I am neutral on this decision. I would like to hear the reason of this decision. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22313 Could you review this PR, @gatorsmile and @cloud-fan ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95583/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22313 **[Test build #95583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)** for PR 22313 at commit [`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])` * ` case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22313 **[Test build #95583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)** for PR 22313 at commit [`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2762/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org