[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #97865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97865/testReport)** for PR 2 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #97828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97828/testReport)** for PR 2 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #97845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97845/testReport)** for PR 2 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/2 @cloud-fan @rdblue I want to leave some comments and thoughts during looking into this again, hope these can help us deciding the next step plan. Currently all the plan assumed input row is `RDD[InternalRow]`, whole framework treat columnar read as special case. Also the `inputRDDs` function not only be called in `WholeStageCodegenExec`, but also all the father physical node, it's very easy to get a mess in the scenario of nested plan during debug this fix. So we may have these 3 choices, the first two can totally remove cast but maybe have many changes on `CodegenSupport`, the last one can limited the changes but still has cast problem: 1. Erasure the type of `inputRDDs`, because we should allow both RDD[InternalRow] and RDD[ColumnarBatch] passed, mainly for the parent physical plan call the child. This is implemented as the last commit in this PR: https://github.com/apache/spark/pull/2/files 2. Refactor the framework to let all plan dealing with columnar batch 3. Limited the changes in `ColumnarBatchScan`, don't change `CodegenSupport`, but still left the cast problem. This is implemented as the first two commit in this PR: https://github.com/apache/spark/pull/2/files/7e88599dfc2caf177d12e890d588be68bdd3bc8e If all of these are not make sense, I'll just close this. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/2 Got it, I'll revert the changes in file source in this commit, thanks for your reply. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/2 can we do it for data source v2 first? It seems hard to fix the file source, as its reader function may lie about the return type. Let's see what's the simplest fix to remove the hack for data source v2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95422/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #95422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95422/testReport)** for PR 2 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #95422 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95422/testReport)** for PR 2 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2674/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/2 @cloud-fan Thanks for your reply Wenchen, I'm trying to achieve this in this commit, please take a look, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/2 +1 on @rdblue 's idea. One point is, we should use `ColumnarBatchScan.supportsBatch` to indicate columnar scan or not, instead of asking the RDD to report it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/2 @xuanyuanking, while this does remove the hack, it doesn't address the underlying problem. The problem is that there is a single RDD, which may contain InternalRow or may contain ColumnarBatch. Generated code knows how to differentiate between the two and use the RDD contents correctly. While this is an improvement because it uses the actual type of records in the RDD, the work that needs to be done is to update the columnar case so that it does return an `RDD[InternalRow]` for anyone that accesses data using that RDD, and then update the generated code to detect a data source RDD and access the underlying `RDD[ColumnarBatch]`. Here's some pseudo-code to demonstrate what I mean. The current code does something like this with a cast. Your change wouldn't fix the need to cast to `RDD[ColumnarBatch]`: ```scala def doExecute(rdd: DataSourceRDD[InternalRow]) { // with your change, DataSourceRDD[_] if (rdd.isColumnar) { doExecuteColumnarBatch(rdd.asInstanceOf[RDD[ColumnarBatch]]) } else { doExecuteRows(rdd) } } ``` I think that should be changed to something like this which is type safe: ```scala def doExecute(rdd: DataSourceRDD[InternalRow]) { if (rdd.isColumnar) { doExecuteColumnarBatch(rdd.getColumnBatchRDD) } else { doExecuteRows(rdd) } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95261/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #95261 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95261/testReport)** for PR 2 at commit [`7e88599`](https://github.com/apache/spark/commit/7e88599dfc2caf177d12e890d588be68bdd3bc8e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #95261 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95261/testReport)** for PR 2 at commit [`7e88599`](https://github.com/apache/spark/commit/7e88599dfc2caf177d12e890d588be68bdd3bc8e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2556/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/2 cc @cloud-fan and @rdblue have a look when you have time. If this PR doesn't coincide with your expect, I'll close this soon. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95217/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #95217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95217/testReport)** for PR 2 at commit [`992a08b`](https://github.com/apache/spark/commit/992a08b1d77d59daeac95c67d07e5b8efe20ce20). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class DataSourceRDD[T: ClassTag](` * `class DataSourceRowRDD(` * `class DataSourceColumnarBatchRDD(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/2 **[Test build #95217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95217/testReport)** for PR 2 at commit [`992a08b`](https://github.com/apache/spark/commit/992a08b1d77d59daeac95c67d07e5b8efe20ce20). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/2 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2534/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org