[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83862/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19742: [SPARK-22511][BUILD] Update maven central repo address
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19742 Merged to master/2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19742: [SPARK-22511][BUILD] Update maven central repo ad...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19742 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19727: [WIP][SPARK-22497][SQL] Project reuse
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/19727 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19272 **[Test build #83863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83863/testReport)** for PR 19272 at commit [`18d77ff`](https://github.com/apache/spark/commit/18d77ff5a3aa29c9e60538ae87b6d654c229bdfe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19272 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83863/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19272 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19708: [SPARK-22479][SQL] Exclude credentials from SaveintoData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19708 **[Test build #83864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83864/testReport)** for PR 19708 at commit [`db565f6`](https://github.com/apache/spark/commit/db565f6b0a57f436f85c32fe7d05b027908c7a9b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SaveIntoDataSourceCommandSuite extends SharedSQLContext ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19208 Awesome, thanks for the updates and for checking backwards compatibility! LGTM Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19708: [SPARK-22479][SQL] Exclude credentials from SaveintoData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19708 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19708: [SPARK-22479][SQL] Exclude credentials from SaveintoData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19708 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83864/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19208 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19588 @WeichenXu123 when you create the JIRA for Python, can you please link it to this task's JIRA? Thanks! LGTM Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19588 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [WIP] [SPARK-21984] Join estimation based on equi-height...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19594 **[Test build #83870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83870/testReport)** for PR 19594 at commit [`67bd651`](https://github.com/apache/spark/commit/67bd65153bd0afc30c6ef4799caa02a05a19). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19708: [SPARK-22479][SQL] Exclude credentials from Savei...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19708#discussion_r151010772 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommandSuite.scala --- @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.SparkConf +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.test.SharedSQLContext + +class SaveIntoDataSourceCommandSuite extends SharedSQLContext { + + override protected def sparkConf: SparkConf = super.sparkConf + .set("spark.redaction.string.regex", "(?i)password|url") + + test("treeString is redacted") { --- End diff -- old test name? we're not modifying the treeString anymore, it's just the `SaveIntoDataSourceCommand` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151011913 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging { } /** - * Calculate the impurity statistics for a given (feature, split) based upon left/right - * aggregates. - * - * @param stats the recycle impurity statistics for this feature's all splits, - * only 'impurity' and 'impurityCalculator' are valid between each iteration - * @param leftImpurityCalculator left node aggregates for this (feature, split) - * @param rightImpurityCalculator right node aggregate for this (feature, split) - * @param metadata learning and dataset metadata for DecisionTree - * @return Impurity statistics for this (feature, split) + * Return a list of pairs (featureIndexIdx, featureIndex) where featureIndex is the global + * (across all trees) index of a feature and featureIndexIdx is the index of a feature within the + * list of features for a given node. Filters out constant features (features with 0 splits) */ - private def calculateImpurityStats( - stats: ImpurityStats, - leftImpurityCalculator: ImpurityCalculator, - rightImpurityCalculator: ImpurityCalculator, - metadata: DecisionTreeMetadata): ImpurityStats = { - -val parentImpurityCalculator: ImpurityCalculator = if (stats == null) { - leftImpurityCalculator.copy.add(rightImpurityCalculator) -} else { - stats.impurityCalculator -} - -val impurity: Double = if (stats == null) { - parentImpurityCalculator.calculate() -} else { - stats.impurity -} - -val leftCount = leftImpurityCalculator.count -val rightCount = rightImpurityCalculator.count - -val totalCount = leftCount + rightCount - -// If left child or right child doesn't satisfy minimum instances per node, -// then this split is invalid, return invalid information gain stats. -if ((leftCount < metadata.minInstancesPerNode) || - (rightCount < metadata.minInstancesPerNode)) { - return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator) -} - -val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0 -val rightImpurity = rightImpurityCalculator.calculate() - -val leftWeight = leftCount / totalCount.toDouble -val rightWeight = rightCount / totalCount.toDouble - -val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity - -// if information gain doesn't satisfy minimum information gain, -// then this split is invalid, return invalid information gain stats. -if (gain < metadata.minInfoGain) { - return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator) + private[impl] def getNonConstantFeatures( + metadata: DecisionTreeMetadata, + featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = { +Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx => --- End diff -- At some point when refactoring I was hitting errors caused by a stateful operation within a `map` over the output of this method (IIRC the result of the `map` was accessed repeatedly, causing the stateful operation to inadvertently be run multiple times). However using `withFilter` and `view` now seems to work, I'll change it back :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151011879 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala --- @@ -112,7 +113,7 @@ private[spark] object ImpurityStats { * minimum number of instances per node. */ def getInvalidImpurityStats(impurityCalculator: ImpurityCalculator): ImpurityStats = { -new ImpurityStats(Double.MinValue, impurityCalculator.calculate(), +new ImpurityStats(Double.MinValue, impurity = -1, --- End diff -- I changed this to be -1 here since node impurity would eventually get set to -1 anyways when `LearningNodes` with invalid `ImpurityStats` were converted into decision tree leaf nodes (see [`LearningNode.toNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L279)) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19752: [SPARK-22520][SQL] Support code generation for large Cas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19752 **[Test build #83866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83866/testReport)** for PR 19752 at commit [`98eaae9`](https://github.com/apache/spark/commit/98eaae9436adf63ec3023ee077f2fff8e23dfa35). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class CaseWhen(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19752: [SPARK-22520][SQL] Support code generation for large Cas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19752 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19752: [SPARK-22520][SQL] Support code generation for large Cas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19752 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83866/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19750 **[Test build #83865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83865/testReport)** for PR 19750 at commit [`406779b`](https://github.com/apache/spark/commit/406779bf05cbab7afd8c632ebb7035fb0f2cbd28). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [WIP] [SPARK-21984] Join estimation based on equi-height...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19594 **[Test build #83871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83871/testReport)** for PR 19594 at commit [`8b2084a`](https://github.com/apache/spark/commit/8b2084a4bec8fdd58cca809b2d2b26bdc939436d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83865/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19751 **[Test build #83867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83867/testReport)** for PR 19751 at commit [`8c346a1`](https://github.com/apache/spark/commit/8c346a148d7be78b0f53aadb9c8ca78098b0ea6c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19751 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19751 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83867/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151017375 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree.{CategoricalSplit, Split} +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Utility methods for choosing splits during local & distributed tree training. */ +private[impl] object SplitUtils { + + /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */ + private def sortByCentroid( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int): List[Int] = { +/* Each bin is one category (feature value). + * The bins are ordered based on centroidForCategories, and this ordering determines which + * splits are considered. (With K categories, we consider K - 1 possible splits.) + * + * centroidForCategories is a list: (category, centroid) + */ +val numCategories = binAggregates.metadata.numBins(featureIndex) +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + +val centroidForCategories = Range(0, numCategories).map { featureValue => + val categoryStats = +binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue) + val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats) + (featureValue, centroid) +} +// TODO(smurching): How to handle logging statements like these? +// logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(",")) +// bins sorted by centroids +val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1) +// logDebug("Sorted centroids for categorical variable = " + +// categoriesSortedByCentroid.mkString(",")) +categoriesSortedByCentroid + } + + /** + * Find the best split for an unordered categorical feature at a single node. + * + * Algorithm: + * - Considers all possible subsets (exponentially many) + * + * @param featureIndex Global index of feature being split. + * @param featureIndexIdx Index of feature being split within subset of features for current node. + * @param featureSplits Array of splits for the current feature + * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node + * @return (best split, statistics for split) If no valid split was found, the returned + * ImpurityStats instance will be invalid (have member valid = false). + */ + private[impl] def chooseUnorderedCategoricalSplit( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int, + featureSplits: Array[Split], + parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = { +// Unordered categorical feature +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) +val numSplits = binAggregates.metadata.numSplits(featureIndex) +var parentCalc = parentCalculator +val (bestFeatureSplitIndex, bestFeatureGainStats) = + Range(0, numSplits).map { splitIndex => +val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex) +val rightChildStats = binAggregates.getParentImpurityCalculator() + .subtract(leftChildStats) +val gainAndImpurityStats = ImpurityUtils.calculateImpurityStats(parentCalc, + leftChildStats, rightChildStats, binAggregates.metadata) +// Compute parent stats once, when considering first split for current feature +if (pare
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19621 **[Test build #83872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83872/testReport)** for PR 19621 at commit [`b0b14b0`](https://github.com/apache/spark/commit/b0b14b0971a7b941abbadf52d03dbb7d77e93adc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @viirya @MLnick Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r151015745 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -412,8 +412,6 @@ class SparkContext(config: SparkConf) extends Logging { } } -if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true") --- End diff -- Not sure why this is not required anymore? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r151017494 --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala --- @@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging { if (driverConf.contains("spark.yarn.credentials.file")) { logInfo("Will periodically update credentials from: " + driverConf.get("spark.yarn.credentials.file")) -SparkHadoopUtil.get.startCredentialUpdater(driverConf) + Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil") --- End diff -- I see, thanks for explanation, this kind of reflection seems not so elegant. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r151018454 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -745,15 +739,20 @@ private[spark] class Client( // Save the YARN configuration into a separate file that will be overlayed on top of the // cluster's Hadoop conf. confStream.putNextEntry(new ZipEntry(SPARK_HADOOP_CONF_FILE)) - yarnConf.writeXml(confStream) + hadoopConf.writeXml(confStream) confStream.closeEntry() // Save Spark configuration to a file in the archive. val props = new Properties() sparkConf.getAll.foreach { case (k, v) => props.setProperty(k, v) } // Override spark.yarn.key to point to the location in distributed cache which will be used // by AM. - Option(amKeytabFileName).foreach { k => props.setProperty(KEYTAB.key, k) } + Option(amKeytabFileName).foreach { k => +// Do not propagate the app's secret using the config file. +if (k != SecurityManager.SPARK_AUTH_SECRET_CONF) { --- End diff -- Is it necessary to add a check here? I'm not sure how this could happen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19621 @WeichenXu123 I will try to look into this today. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19588 Python API jira created here: https://issues.apache.org/jira/browse/SPARK-22521 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r151019591 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging { } /** + * Find the best split for a node. + * + * @param binAggregates Bin statistics. + * @return tuple for best split: (Split, information gain, prediction at node) + */ + private[tree] def binsToBestSplit( + binAggregates: DTStatsAggregator, + splits: Array[Array[Split]], + featuresForNode: Option[Array[Int]], + node: LearningNode): (Split, ImpurityStats) = { +val validFeatureSplits = getNonConstantFeatures(binAggregates.metadata, featuresForNode) +// For each (feature, split), calculate the gain, and select the best (feature, split). +val parentImpurityCalc = if (node.stats == null) None else Some(node.stats.impurityCalculator) --- End diff -- I believe so, the nodes at the top level are created ([RandomForest.scala:178](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L178)) with [`LearningNode.emptyNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L341), which sets `node.stats = null`. I could change this to check node depth (via node index), but if we're planning on deprecating node indices in the future it might be best not to. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #83873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83873/testReport)** for PR 19433 at commit [`0b27c56`](https://github.com/apache/spark/commit/0b27c56d1ea4e1108a62b77e9eca8ae160740756). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19621 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19621 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83872/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19621 **[Test build #83872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83872/testReport)** for PR 19621 at commit [`b0b14b0`](https://github.com/apache/spark/commit/b0b14b0971a7b941abbadf52d03dbb7d77e93adc). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r151020337 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -412,8 +412,6 @@ class SparkContext(config: SparkConf) extends Logging { } } -if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true") --- End diff -- This change is removing all references to `SPARK_YARN_MODE`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r151020385 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -745,15 +739,20 @@ private[spark] class Client( // Save the YARN configuration into a separate file that will be overlayed on top of the // cluster's Hadoop conf. confStream.putNextEntry(new ZipEntry(SPARK_HADOOP_CONF_FILE)) - yarnConf.writeXml(confStream) + hadoopConf.writeXml(confStream) confStream.closeEntry() // Save Spark configuration to a file in the archive. val props = new Properties() sparkConf.getAll.foreach { case (k, v) => props.setProperty(k, v) } // Override spark.yarn.key to point to the location in distributed cache which will be used // by AM. - Option(amKeytabFileName).foreach { k => props.setProperty(KEYTAB.key, k) } + Option(amKeytabFileName).foreach { k => +// Do not propagate the app's secret using the config file. +if (k != SecurityManager.SPARK_AUTH_SECRET_CONF) { --- End diff -- Oh, I think this is the wrong place for the check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #83874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83874/testReport)** for PR 19433 at commit [`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r151020666 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GBTRegressorSuite.scala --- @@ -19,15 +19,16 @@ package org.apache.spark.ml.regression import org.apache.spark.SparkFunSuite import org.apache.spark.ml.feature.LabeledPoint -import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.tree.impl.TreeTests import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} +import org.apache.spark.ml.util.TestingUtils._ --- End diff -- Nit: unused import --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r151020618 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala --- @@ -19,14 +19,16 @@ package org.apache.spark.ml.regression import org.apache.spark.SparkFunSuite import org.apache.spark.ml.feature.LabeledPoint +import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.tree.impl.TreeTests import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} +import org.apache.spark.ml.util.TestingUtils._ --- End diff -- Nit: unused import --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83873/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #83873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83873/testReport)** for PR 19433 at commit [`0b27c56`](https://github.com/apache/spark/commit/0b27c56d1ea4e1108a62b77e9eca8ae160740756). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19640: [SPARK-16986][WEB-UI] Converter Started, Completed and L...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19640 **[Test build #83868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83868/testReport)** for PR 19640 at commit [`4a78965`](https://github.com/apache/spark/commit/4a78965c22f11fbda7c9ba843ee266048bf6d319). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19640: [SPARK-16986][WEB-UI] Converter Started, Completed and L...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19640 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19640: [SPARK-16986][WEB-UI] Converter Started, Completed and L...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19640 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83868/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19631 **[Test build #83875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83875/testReport)** for PR 19631 at commit [`121bcf8`](https://github.com/apache/spark/commit/121bcf8a2758858cad8e88e7fb7d78566494765b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19433 **[Test build #83874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83874/testReport)** for PR 19433 at commit [`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19433 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83874/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19746 **[Test build #83869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83869/testReport)** for PR 19746 at commit [`73fe1d8`](https://github.com/apache/spark/commit/73fe1d8087cfc2d59ac5b9af48b4cf5f5b86f920). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class InvalidEntryException(msg: String) extends Exception(msg) ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19746 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83869/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19746 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19631 **[Test build #83876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83876/testReport)** for PR 19631 at commit [`08f47ca`](https://github.com/apache/spark/commit/08f47ca3fb54315c537b1134e31e0a1a912c285e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19381 **[Test build #83877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83877/testReport)** for PR 19381 at commit [`de84ca5`](https://github.com/apache/spark/commit/de84ca501d17b44f9153577ad2118e1254d80d34). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19621 **[Test build #83878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83878/testReport)** for PR 19621 at commit [`77bea32`](https://github.com/apache/spark/commit/77bea32984b167894be79736f56601a44b99). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r151026919 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -961,6 +961,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data.") + .booleanConf + .createWithDefault(true) --- End diff -- Giving it more though, I believe it's prudent to choose correctness over performance. I will change the default to `false`. "Power users" will set it to `true` and (hopefully) report a problem if they run into one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19747 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19747 **[Test build #83879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83879/testReport)** for PR 19747 at commit [`6267033`](https://github.com/apache/spark/commit/626703310aa269a9351a2cf7b6ce23f8e4ab095a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19381 **[Test build #83877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83877/testReport)** for PR 19381 at commit [`de84ca5`](https://github.com/apache/spark/commit/de84ca501d17b44f9153577ad2118e1254d80d34). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19381 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19381 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83877/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r151029101 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] ( // TODO: Check more carefully about whether this whole class will be included in a closure. /** Per-vector transform function */ - private val transformFunc: Vector => Vector = { + private lazy val transformFunc: Vector => Vector = { val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted val localVectorMap = categoryMaps val localNumFeatures = numFeatures +val localHandleInvalid = getHandleInvalid val f: Vector => Vector = { (v: Vector) => assert(v.size == localNumFeatures, "VectorIndexerModel expected vector of length" + s" $numFeatures but found length ${v.size}") v match { case dv: DenseVector => + var hasInvalid = false val tmpv = dv.copy localVectorMap.foreach { case (featureIndex: Int, categoryMap: Map[Double, Int]) => -tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex)) +try { --- End diff -- The try part is fast, yet the catch part can be very slow comparably. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19594 **[Test build #83871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83871/testReport)** for PR 19594 at commit [`8b2084a`](https://github.com/apache/spark/commit/8b2084a4bec8fdd58cca809b2d2b26bdc939436d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class OverlappedRange(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19692: [SPARK-22469][SQL] Accuracy problem in comparison with s...
Github user liutang123 commented on the issue: https://github.com/apache/spark/pull/19692 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19594 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19594 **[Test build #83870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83870/testReport)** for PR 19594 at commit [`67bd651`](https://github.com/apache/spark/commit/67bd65153bd0afc30c6ef4799caa02a05a19). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * ` case class OverlappedRange(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19594 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83870/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19640: [SPARK-16986][WEB-UI] Converter Started, Complete...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19640#discussion_r151033625 --- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js --- @@ -46,3 +46,25 @@ function formatBytes(bytes, type) { var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } + +function padZeroes(num) { + return ("0" + num).slice(-2); --- End diff -- [TagNameQuery("a") ](https://github.com/apache/spark/blob/4a78965c22f11fbda7c9ba843ee266048bf6d319/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala#L349) gets result from [WebBrowser](https://github.com/apache/spark/blob/4a78965c22f11fbda7c9ba843ee266048bf6d319/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala#L43), This browser doesn't support this function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19640: [SPARK-16986][WEB-UI] Converter Started, Complete...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19640#discussion_r151034910 --- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js --- @@ -46,3 +46,25 @@ function formatBytes(bytes, type) { var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } + +function padZeroes(num) { + return ("0" + num).slice(-2); +} + +function formatTimeMillis(timeMillis) { + if (timeMillis <= 0) { +return "-"; + } else { +var dt = new Date(timeMillis); +return dt.getFullYear() + "-" + + padZeroes(dt.getMonth() + 1) + "-" + + padZeroes(dt.getDate()) + " " + + padZeroes(dt.getHours()) + ":" + + padZeroes(dt.getMinutes()) + ":" + + padZeroes(dt.getSeconds()); + } +} + +function getTimeZone() { + return new Date().toString().match(/\(([A-Za-z\s].*)\)/)[1]; --- End diff -- This timeZone seems incorrect. Safari gets `Asia/Shanghai`, but Chrome gets `America/Chicago` on the same computer. How about just include this change to release notes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19560 I also hint this issues: ```sql select * from A join B on a.key = b.key ``` table A is small but table B is big and table B's stats are incorrect. so It will Broadcast table B. I try to use Broadcast hint to solve this issues: ```sql select /*+ MAPJOIN(A) */ * from A join B on a.key = b.key ``` But it doesn't work. I create a pr to fix it: https://github.com/apache/spark/pull/19714 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18641 **[Test build #83880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83880/testReport)** for PR 18641 at commit [`e69f126`](https://github.com/apache/spark/commit/e69f12636bee5f3496421d70f764976f4cb687b7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151037948 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java --- @@ -40,7 +42,7 @@ */ public class AggregateHashMap { - private OnHeapColumnVector[] columnVectors; + private WritableColumnVector[] columnVectors; --- End diff -- Thanks, I realized it this morning. I will revert changes in `AggregateHashMap`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18853: [SPARK-21646][SQL] Add new type coercion to compa...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18853#discussion_r151037844 --- Diff: docs/sql-programming-guide.md --- @@ -1490,6 +1490,13 @@ that these options will be deprecated in future release as more optimizations ar Configures the number of partitions to use when shuffling data for joins or aggregations. + +spark.sql.typeCoercion.mode +default + +The default type coercion mode was used in spark prior to 2.3, and so it continues to be the default to avoid breaking behavior. However, it has logical inconsistencies. The hive mode is preferred for most new applications, though it may require additional manual casting. --- End diff -- 2.3 -> 2.3.0 to be clear? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19621 **[Test build #83878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83878/testReport)** for PR 19621 at commit [`77bea32`](https://github.com/apache/spark/commit/77bea32984b167894be79736f56601a44b99). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19621 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83878/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19621 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19753 [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API ## What changes were proposed in this pull request? Add python api for VectorIndexerModel support handle unseen categories via handleInvalid. ## How was this patch tested? doctest added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark vector_indexer_invalid_py Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19753.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19753 commit 108ce2b1daa6b9b908c8791654433e90c666 Author: WeichenXu Date: 2017-11-15T06:04:56Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19753 **[Test build #83881 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83881/testReport)** for PR 19753 at commit [`108ce2b`](https://github.com/apache/spark/commit/108ce2b1daa6b9b908c8791654433e90c666). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19631 **[Test build #83875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83875/testReport)** for PR 19631 at commit [`121bcf8`](https://github.com/apache/spark/commit/121bcf8a2758858cad8e88e7fb7d78566494765b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19631 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19631 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83875/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19753 **[Test build #83881 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83881/testReport)** for PR 19753 at commit [`108ce2b`](https://github.com/apache/spark/commit/108ce2b1daa6b9b908c8791654433e90c666). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19753 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19753 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83881/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19631 **[Test build #83876 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83876/testReport)** for PR 19631 at commit [`08f47ca`](https://github.com/apache/spark/commit/08f47ca3fb54315c537b1134e31e0a1a912c285e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class ClientSuite extends SparkFunSuite with Matchers ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19631 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19631 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83876/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19740: [SPARK-22514][SQL] move ColumnVector.Array and Co...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19740#discussion_r151041177 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorBasedRow.java --- @@ -0,0 +1,328 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import java.math.BigDecimal; + +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.catalyst.expressions.UnsafeRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.catalyst.util.MapData; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.types.CalendarInterval; +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Row abstraction in {@link ColumnVector}. The instance of this class is intended + * to be reused, callers should copy the data out if it needs to be stored. + */ +public final class VectorBasedRow extends InternalRow { --- End diff -- How about `ColumnarRow`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19740: [SPARK-22514][SQL] move ColumnVector.Array and Co...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19740#discussion_r151041303 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorBasedArray.java --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.catalyst.util.MapData; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.types.CalendarInterval; +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Array abstraction in {@link ColumnVector}. The instance of this class is intended + * to be reused, callers should copy the data out if it needs to be stored. + */ +public final class VectorBasedArray extends ArrayData { --- End diff -- `ColumnarArray`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18641 **[Test build #83880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83880/testReport)** for PR 18641 at commit [`e69f126`](https://github.com/apache/spark/commit/e69f12636bee5f3496421d70f764976f4cb687b7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org