[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22313
  
**[Test build #95637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95637/testReport)**
 for PR 22313 at commit 
[`4acbaf8`](https://github.com/apache/spark/commit/4acbaf8be9e572c5cdbc61c49b488e8aef9e646b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22313
  
Thank you for review and advice, @cloud-fan . It turns out that my initial 
assessment is not enough.

First of all, from the beginning, 
[SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-6cac9bc2656e3782b0312dceb8c55d47R75)
 is designed as a recursive function like the following. Please see `tryLeft` 
and `tryRight`. It's a purely computation to check if it succeeds. There is no 
reuse here. So, I tried to cache the first two `tryLeft` and `tryRight` 
operation since they can be re-used.
```scala
val tryLeft = buildSearchArgument(left, newBuilder)
val tryRight = buildSearchArgument(right, newBuilder)
val conjunction = for {
  _ <- tryLeft
  _ <- tryRight
  lhs <- buildSearchArgument(left, builder.startAnd())
  rhs <- buildSearchArgument(right, lhs)
} yield rhs.end()
```

However, before that, `createFilter` generates the target tree with 
[reduceOption(And)](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-6cac9bc2656e3782b0312dceb8c55d47R35)
 as a deeply skewed tree. That was the root cause. I'll update this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-02 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22313
  
Do you know why `createFilter` function has exponential time complexity? 
Let's make sure the algorithm is good before adding the cache.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22313
  
Thank you for review, @kiszk .

First, I don't want to hold the memory up after query completion. If we do, 
it will be a regression. So, I wanted `time` first.

Second, It's difficult to estimate the enough limit for the number of 
filters.
  - As we know codegen JVM limit issue. There are several attempts to 
generate a single complex query for wide tables (thousands of columns).
  - Spark's optimizer like `InferFiltersFromConstraints` adds more 
constraints like 'NotNull(col1)`. Usually, the number of filters becomes double 
here.
- Also, it's not a good design if we need to increase this limitation 
whenever we add a new optimizer like `InferFiltersFromConstraints`.
  - If the limit is too high, we waste the memory. If the limit is small, 
the eviction will bite us again.

In short, `time` was enough and the simplest for this purpose.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-01 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22313
  
General question: Why do we use `time` instead of `entry size` to control 
cache? I am neutral on this decision. I would like to hear the reason of this 
decision.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22313
  
Could you review this PR, @gatorsmile and @cloud-fan ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95583/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

2018-09-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22313
  
**[Test build #95583 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)**
 for PR 22313 at commit 
[`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, 
DataType])`
  * `  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, 
DataType])`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

2018-09-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22313
  
**[Test build #95583 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)**
 for PR 22313 at commit 
[`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

2018-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2762/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

2018-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org