[GitHub] [spark] zhengruifeng closed pull request #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
zhengruifeng closed pull request #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
AmplabJenkins removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-581002857 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117710/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
AmplabJenkins removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-581002848 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
AmplabJenkins commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-581002857 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117710/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
AmplabJenkins commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-581002848 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
SparkQA removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580991397 **[Test build #117710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117710/testReport)** for PR 27426 at commit [`a54d262`](https://github.com/apache/spark/commit/a54d2629bccea6c8cc18006fcdd2142c820603b9). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf
AmplabJenkins removed a comment on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf URL: https://github.com/apache/spark/pull/27425#issuecomment-581002668 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117709/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
SparkQA commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-581002709 **[Test build #117710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117710/testReport)** for PR 27426 at commit [`a54d262`](https://github.com/apache/spark/commit/a54d2629bccea6c8cc18006fcdd2142c820603b9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf
AmplabJenkins removed a comment on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf URL: https://github.com/apache/spark/pull/27425#issuecomment-581002667 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf
AmplabJenkins commented on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf URL: https://github.com/apache/spark/pull/27425#issuecomment-581002668 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117709/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf
AmplabJenkins commented on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf URL: https://github.com/apache/spark/pull/27425#issuecomment-581002667 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf
SparkQA removed a comment on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf URL: https://github.com/apache/spark/pull/27425#issuecomment-580981804 **[Test build #117709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117709/testReport)** for PR 27425 at commit [`e4c3b38`](https://github.com/apache/spark/commit/e4c3b388f8be051f3ef619f08a636975d1156d0f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf
SparkQA commented on issue #27425: [SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf URL: https://github.com/apache/spark/pull/27425#issuecomment-581002533 **[Test build #117709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117709/testReport)** for PR 27425 at commit [`e4c3b38`](https://github.com/apache/spark/commit/e4c3b388f8be051f3ef619f08a636975d1156d0f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-581001696 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117711/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-581001695 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-581001695 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-581001696 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117711/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
SparkQA removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-580997146 **[Test build #117711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117711/testReport)** for PR 27427 at commit [`7bee0b0`](https://github.com/apache/spark/commit/7bee0b03f030f108f4db1b2b54daa7b4238e027e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
SparkQA commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-581001642 **[Test build #117711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117711/testReport)** for PR 27427 at commit [`7bee0b0`](https://github.com/apache/spark/commit/7bee0b03f030f108f4db1b2b54daa7b4238e027e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
dongjoon-hyun commented on issue #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource URL: https://github.com/apache/spark/pull/27366#issuecomment-581001372 Okay. Then, let's talk later for this PR. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun removed a comment on issue #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
dongjoon-hyun removed a comment on issue #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource URL: https://github.com/apache/spark/pull/27366#issuecomment-580969052 Hi, @MaxGekk . Could you update once more when you have a chance? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on issue #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#issuecomment-581001211 > Currently, this WIP PR only has FValueRegressionSelector implemented. FValueClassificationSelector is very similar. The calculation for classification f value is a little more complicated. I think `f_classif` is different enough for another PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
dongjoon-hyun commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy URL: https://github.com/apache/spark/pull/27424#issuecomment-581001148 Thank you, @HyukjinKwon ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373762412 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373762346 ## File path: mllib/src/main/scala/org/apache/spark/ml/stat/FRegressionTest.scala ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import org.apache.commons.math3.distribution.FDistribution + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.feature.LabeledPoint +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions.col + + +/** + * F-Regression Test + */ +@Since("3.1.0") +object FRegressionTest { + + case class FRegressionTestResult( + pValue: Double, + degreesOfFreedom: Int, + fValue: Double) + + /** + * @param dataset DataFrame of continuous labels and continuous features. + * @param featuresCol Name of features column in dataset, of type `Vector` (`VectorUDT`) + * @param labelCol Name of label column in dataset, of any numerical type + * @return Array containing the FRegressionTestResult for every feature against the label. + */ + @Since("3.1.0") + def test_regression(dataset: Dataset[_], featuresCol: String, labelCol: String): Review comment: this method name `test_regression` should follow Camel-Case This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373761706 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373762504 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373762290 ## File path: mllib/src/main/scala/org/apache/spark/ml/stat/FRegressionTest.scala ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import org.apache.commons.math3.distribution.FDistribution + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.feature.LabeledPoint +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions.col + + +/** + * F-Regression Test + */ +@Since("3.1.0") +object FRegressionTest { + + case class FRegressionTestResult( + pValue: Double, + degreesOfFreedom: Int, + fValue: Double) + + /** + * @param dataset DataFrame of continuous labels and continuous features. + * @param featuresCol Name of features column in dataset, of type `Vector` (`VectorUDT`) + * @param labelCol Name of label column in dataset, of any numerical type + * @return Array containing the FRegressionTestResult for every feature against the label. + */ + @Since("3.1.0") + def test_regression(dataset: Dataset[_], featuresCol: String, labelCol: String): +Array[FRegressionTestResult] = { + +val spark = dataset.sparkSession +import spark.implicits._ + +SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) +SchemaUtils.checkNumericType(dataset.schema, labelCol) +val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)] + .rdd.map { case (label, features) => LabeledPoint(label, features) } + +val numOfFeatures = rdd.first().features.size +val numOfSamples = rdd.count() +val degreeOfFreedom = numOfSamples.toInt - 2 + +var fTestResultArray = new Array[FRegressionTestResult](numOfFeatures) +val labels = rdd.map(d => d.label) +for (i <- 0 until numOfFeatures) { Review comment: compute each col at once? This should be inefficient, I guess only one pass is needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373761805 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373762537 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373762117 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection
zhengruifeng commented on a change in pull request #27322: [SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection URL: https://github.com/apache/spark/pull/27322#discussion_r373761981 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala ## @@ -0,0 +1,357 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.Since +import org.apache.spark.ml._ +import org.apache.spark.ml.attribute.{AttributeGroup, _} +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.stat.FRegressionTest +import org.apache.spark.ml.util._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql._ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + + +/** + * Params for [[FRegressionSelector]] and [[FRegressionSelectorModel]]. + * TODO: put all these params in shared.scala + * TODO: Not include fdr and fwe for now. Need to check if these two are applicable!!! + */ +private[feature] trait FRegressionSelectorParams extends Params + with HasFeaturesCol with HasOutputCol with HasLabelCol { + + /** + * Number of features that selector will select, ordered by ascending p-value. If the + * number of features is less than numTopFeatures, then this will select all features. + * Only applicable when selectorType = "numTopFeatures". + * The default value of numTopFeatures is 50. + * + * @group param + */ + @Since("3.1.0") + final val numTopFeatures = new IntParam(this, "numTopFeatures", +"Number of features that selector will select, ordered by ascending p-value. If the" + + " number of features is < numTopFeatures, then this will select all features.", +ParamValidators.gtEq(1)) + setDefault(numTopFeatures -> 50) + + /** @group getParam */ + @Since("3.1.0") + def getNumTopFeatures: Int = $(numTopFeatures) + + /** + * Percentile of features that selector will select, ordered by statistics value descending. + * Only applicable when selectorType = "percentile". + * Default value is 0.1. + * @group param + */ + @Since("3.1.0") + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by ascending p-value.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + @Since("3.1.0") + def getPercentile: Double = $(percentile) + + /** + * The highest p-value for features to be kept. + * Only applicable when selectorType = "fpr". + * Default value is 0.05. + * @group param + */ + @Since("3.1.0") + final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(fpr -> 0.05) + + /** @group getParam */ + @Since("3.1.0") + def getFpr: Double = $(fpr) + + /** + * The selector type of the FRegressionSelector. + * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * @group param + */ + @Since("3.1.0") + final val selectorType = new Param[String](this, "selectorType", +"The selector type of the FRegressionSelector. " + + "Supported options: numTopFeatures, percentile, fpr") + + /** @group getParam */ + @Since("3.1.0") + def getSelectorType: String = $(selectorType) +} + +/** + * Regression F-value Selector + * This feature selector is for regressions where features are continuous and labels are continuous. + * ANOVA F-value Classification Selector is for when features are continuous and labels are + * categorical. + * Currently, Chi-Squared is for categorical features and categorical labels + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr` + * - `numTopFeatures` chooses a fixed number of top features according to a fRegression test. + * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. + * - `fpr` c
[GitHub] [spark] nikunjb removed a comment on issue #22423: [SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindo…
nikunjb removed a comment on issue #22423: [SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindo… URL: https://github.com/apache/spark/pull/22423#issuecomment-581000427 Please review this PR and the related one for SPARK-25303 too. I am reopening both. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
HyukjinKwon closed pull request #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy URL: https://github.com/apache/spark/pull/27424 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
HyukjinKwon commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy URL: https://github.com/apache/spark/pull/27424#issuecomment-581000488 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon removed a comment on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
HyukjinKwon removed a comment on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy URL: https://github.com/apache/spark/pull/27424#issuecomment-581000476 Yeah, I think it's fine to increase and see if it actually fixes. I think it fixes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
HyukjinKwon commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy URL: https://github.com/apache/spark/pull/27424#issuecomment-581000468 Yeah, I think it's fine to increase and see if it actually fixes. I think it fixes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
HyukjinKwon commented on issue #27424: [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy URL: https://github.com/apache/spark/pull/27424#issuecomment-581000476 Yeah, I think it's fine to increase and see if it actually fixes. I think it fixes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nikunjb commented on issue #22423: [SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindo…
nikunjb commented on issue #22423: [SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindo… URL: https://github.com/apache/spark/pull/22423#issuecomment-581000427 Please review this PR and the related one for SPARK-25303 too. I am reopening both. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource URL: https://github.com/apache/spark/pull/27366#discussion_r373761854 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/StructFilters.scala ## @@ -0,0 +1,161 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst + +import scala.util.Try + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.{BooleanType, StructType} + +/** + * The class provides API for applying pushed down filters to partially or + * fully set internal rows that have the struct schema. + * + * @param filters The pushed down source filters. The filters should refer to + *the fields of the provided schema. + * @param schema The required schema of records from datasource files. + */ +abstract class StructFilters(filters: Seq[sources.Filter], schema: StructType) { + + assert(filters.forall(StructFilters.checkFilterRefs(_, schema)), +"A pushed down filter refers to a non-existing schema field.") + + /** + * Applies pushed down source filters to the given row assuming that + * value at `index` has been already set. + * + * @param row The row with fully or partially set values. + * @param index The index of already set value. + * @return true if currently processed row can be skipped otherwise false. + */ + def skipRow(row: InternalRow, index: Int): Boolean + + /** + * Resets states of pushed down filters. The method must be called before + * precessing any new row otherwise skipRow() may return wrong result. + */ + def reset(): Unit + + /** + * Compiles source filters to a predicate. + */ + def toPredicate(filters: Seq[sources.Filter]): BasePredicate = { +val reducedExpr = filters + .sortBy(_.references.size) Review comment: Why is `length` better than `size`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27381: [MINOR][SQL] Improve readability for some code comments
AmplabJenkins removed a comment on issue #27381: [MINOR][SQL] Improve readability for some code comments URL: https://github.com/apache/spark/pull/27381#issuecomment-580999519 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27381: [MINOR][SQL] Improve readability for some code comments
AmplabJenkins removed a comment on issue #27381: [MINOR][SQL] Improve readability for some code comments URL: https://github.com/apache/spark/pull/27381#issuecomment-580999521 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22472/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27381: [MINOR][SQL] Improve readability for some code comments
AmplabJenkins commented on issue #27381: [MINOR][SQL] Improve readability for some code comments URL: https://github.com/apache/spark/pull/27381#issuecomment-580999519 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27381: [MINOR][SQL] Improve readability for some code comments
AmplabJenkins commented on issue #27381: [MINOR][SQL] Improve readability for some code comments URL: https://github.com/apache/spark/pull/27381#issuecomment-580999521 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22472/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27381: [MINOR][SQL] Improve readability for some code comments
SparkQA commented on issue #27381: [MINOR][SQL] Improve readability for some code comments URL: https://github.com/apache/spark/pull/27381#issuecomment-580999428 **[Test build #117712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117712/testReport)** for PR 27381 at commit [`2e403bf`](https://github.com/apache/spark/commit/2e403bfa8edff38e961ffb6f4c9576fbe38d541d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource URL: https://github.com/apache/spark/pull/27366#discussion_r373761100 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/StructFilters.scala ## @@ -0,0 +1,161 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst + +import scala.util.Try + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.{BooleanType, StructType} + +/** + * The class provides API for applying pushed down filters to partially or + * fully set internal rows that have the struct schema. + * + * @param filters The pushed down source filters. The filters should refer to + *the fields of the provided schema. + * @param schema The required schema of records from datasource files. + */ +abstract class StructFilters(filters: Seq[sources.Filter], schema: StructType) { + + assert(filters.forall(StructFilters.checkFilterRefs(_, schema)), +"A pushed down filter refers to a non-existing schema field.") + + /** + * Applies pushed down source filters to the given row assuming that + * value at `index` has been already set. + * + * @param row The row with fully or partially set values. + * @param index The index of already set value. + * @return true if currently processed row can be skipped otherwise false. + */ + def skipRow(row: InternalRow, index: Int): Boolean + + /** + * Resets states of pushed down filters. The method must be called before + * precessing any new row otherwise skipRow() may return wrong result. + */ + def reset(): Unit + + /** + * Compiles source filters to a predicate. + */ + def toPredicate(filters: Seq[sources.Filter]): BasePredicate = { +val reducedExpr = filters + .sortBy(_.references.size) + .flatMap(StructFilters.filterToExpression(_, toRef)) + .reduce(And) +Predicate.create(reducedExpr) + } + + // Finds a filter attribute in the schema and converts it to a `BoundReference` + def toRef(attr: String): Option[BoundReference] = { +schema.getFieldIndex(attr).map { index => + val field = schema(index) + BoundReference(schema.fieldIndex(attr), field.dataType, field.nullable) +} + } +} + +object StructFilters { + private def checkFilterRefs(filter: sources.Filter, schema: StructType): Boolean = { +val fieldNames = schema.fields.map(_.name).toSet +filter.references.forall(fieldNames.contains(_)) + } + + /** + * Returns the filters currently supported by the datasource. + * @param filters The filters pushed down to the datasource. + * @param schema data schema of datasource files. + * @return a sub-set of `filters` that can be handled by the datasource. + */ + def pushedFilters(filters: Array[sources.Filter], schema: StructType): Array[sources.Filter] = { +filters.filter(checkFilterRefs(_, schema)) + } + + private def zip[A, B](a: Option[A], b: Option[B]): Option[(A, B)] = { Review comment: Semantically this function does what `zip` should do. The problem is `zip` for Option returns `Iterable[(A, B)]` instead of `Option[(A, B)]`. I cannot agree that the name could mislead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-580997249 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22471/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins removed a comment on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-580997246 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-580997246 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
AmplabJenkins commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-580997249 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22471/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
SparkQA commented on issue #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427#issuecomment-580997146 **[Test build #117711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117711/testReport)** for PR 27427 at commit [`7bee0b0`](https://github.com/apache/spark/commit/7bee0b03f030f108f4db1b2b54daa7b4238e027e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization
zhengruifeng opened a new pull request #27427: [SPARK-30700][ML] NaiveBayesModel predict optimization URL: https://github.com/apache/spark/pull/27427 ### What changes were proposed in this pull request? var `negThetaSum` is always used together with `pi`, so we can add them at first ### Why are the changes needed? only need to add one var `piMinusThetaSum`, instead of `pi` and `negThetaSum` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580996683 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580996684 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117698/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580996683 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account
AmplabJenkins removed a comment on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account URL: https://github.com/apache/spark/pull/27213#issuecomment-580996544 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account
AmplabJenkins removed a comment on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account URL: https://github.com/apache/spark/pull/27213#issuecomment-580996547 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117702/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580996684 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117698/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account
AmplabJenkins commented on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account URL: https://github.com/apache/spark/pull/27213#issuecomment-580996547 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117702/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account
AmplabJenkins commented on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account URL: https://github.com/apache/spark/pull/27213#issuecomment-580996544 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
SparkQA commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580996481 **[Test build #117698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117698/testReport)** for PR 27423 at commit [`b9005ea`](https://github.com/apache/spark/commit/b9005ea9b6ad7c26b03b59cbaa932a2d59e14529). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
SparkQA removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580968521 **[Test build #117698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117698/testReport)** for PR 27423 at commit [`b9005ea`](https://github.com/apache/spark/commit/b9005ea9b6ad7c26b03b59cbaa932a2d59e14529). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account
SparkQA removed a comment on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account URL: https://github.com/apache/spark/pull/27213#issuecomment-580972034 **[Test build #117702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117702/testReport)** for PR 27213 at commit [`e948650`](https://github.com/apache/spark/commit/e948650ba47fa5f24af2881b2b2278d3897a431f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account
SparkQA commented on issue #27213: [SPARK-30516][SQL] statistic estimation of FileScan should take partitionFilters into account URL: https://github.com/apache/spark/pull/27213#issuecomment-580996331 **[Test build #117702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117702/testReport)** for PR 27213 at commit [`e948650`](https://github.com/apache/spark/commit/e948650ba47fa5f24af2881b2b2278d3897a431f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580995842 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580995845 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117701/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580995845 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117701/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
AmplabJenkins commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580995842 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
SparkQA removed a comment on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580970883 **[Test build #117701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117701/testReport)** for PR 27423 at commit [`b6379c0`](https://github.com/apache/spark/commit/b6379c08e9f1060f1c94791667a749469a1d60bb). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580995626 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580995630 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117697/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580995630 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117697/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580995626 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
SparkQA commented on issue #27423: [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView URL: https://github.com/apache/spark/pull/27423#issuecomment-580995644 **[Test build #117701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117701/testReport)** for PR 27423 at commit [`b6379c0`](https://github.com/apache/spark/commit/b6379c08e9f1060f1c94791667a749469a1d60bb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
SparkQA removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580965775 **[Test build #117697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117697/testReport)** for PR 24938 at commit [`27c76b3`](https://github.com/apache/spark/commit/27c76b3b5e9106f0fe7de1ccb2c8576064e625da). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
SparkQA commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580995453 **[Test build #117697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117697/testReport)** for PR 24938 at commit [`27c76b3`](https://github.com/apache/spark/commit/27c76b3b5e9106f0fe7de1ccb2c8576064e625da). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
beliefer commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580994373 @dongjoon-hyun Thanks for your help!I got it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
AmplabJenkins removed a comment on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning URL: https://github.com/apache/spark/pull/27232#issuecomment-580994289 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
AmplabJenkins commented on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning URL: https://github.com/apache/spark/pull/27232#issuecomment-580994290 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117708/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
AmplabJenkins commented on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning URL: https://github.com/apache/spark/pull/27232#issuecomment-580994289 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
AmplabJenkins removed a comment on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning URL: https://github.com/apache/spark/pull/27232#issuecomment-580994290 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117708/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
SparkQA removed a comment on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning URL: https://github.com/apache/spark/pull/27232#issuecomment-580980219 **[Test build #117708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117708/testReport)** for PR 27232 at commit [`057a594`](https://github.com/apache/spark/commit/057a59454df6404710cfad6723e9003a2dbfd82f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
SparkQA commented on issue #27232: [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning URL: https://github.com/apache/spark/pull/27232#issuecomment-580994098 **[Test build #117708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117708/testReport)** for PR 27232 at commit [`057a594`](https://github.com/apache/spark/commit/057a59454df6404710cfad6723e9003a2dbfd82f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27417: backport [SPARK-27747][SPARK-27816][SPARK-28344]
dongjoon-hyun commented on issue #27417: backport [SPARK-27747][SPARK-27816][SPARK-28344] URL: https://github.com/apache/spark/pull/27417#issuecomment-580993973 Thank you all for your opinion. Especially, thank you for making this PR, @cloud-fan . I'll remove `Target Version: 2.4.5` for now. If we need this in `branch-2.4`, `Target Version` will be `2.4.6`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #27417: backport [SPARK-27747][SPARK-27816][SPARK-28344]
dongjoon-hyun closed pull request #27417: backport [SPARK-27747][SPARK-27816][SPARK-28344] URL: https://github.com/apache/spark/pull/27417 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
dongjoon-hyun closed pull request #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
dongjoon-hyun commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580992670 It's nothing. Since you are one of the active contributor, I hope you can do more in the community. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580992552 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580992554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117694/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
beliefer commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580992527 @dongjoon-hyun Thanks for your help! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580992552 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
AmplabJenkins commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580992554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117694/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
SparkQA removed a comment on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580962096 **[Test build #117694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117694/testReport)** for PR 24938 at commit [`9882932`](https://github.com/apache/spark/commit/9882932bb275d287dfc3cea3d52bb7903e25e73f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
SparkQA commented on issue #24938: [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" URL: https://github.com/apache/spark/pull/24938#issuecomment-580992435 **[Test build #117694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117694/testReport)** for PR 24938 at commit [`9882932`](https://github.com/apache/spark/commit/9882932bb275d287dfc3cea3d52bb7903e25e73f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
dongjoon-hyun edited a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580992304 Usually, you can take a look at the release note and mention a few notable bug lists from the release note. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer edited a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
beliefer edited a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580992189 @dongjoon-hyun I'm sorry! What should write here?:) I write some description here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
dongjoon-hyun commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580992304 Usually, you can take a look the release note and mention a few notable bug lists from the release note. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
beliefer commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580992189 @dongjoon-hyun I'm sorry! What should write here?:) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
dongjoon-hyun commented on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580991804 Ur, @beliefer . If you say `No`, there is no way to merge this. :) ``` ### Why are the changes needed? No ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29.
AmplabJenkins removed a comment on issue #27426: [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29. URL: https://github.com/apache/spark/pull/27426#issuecomment-580991498 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22470/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org