[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69270641 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3960#issuecomment-69272469 [Test build #25277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25277/consoleFull) for PR 3960 at commit [`49bf1ac`](https://github.com/apache/spark/commit/49bf1acc700d454f894edf55cd8fa88aee4d63da). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693139 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) +} + +/** + * Euclidean distance measure, expedited by pre-computing vector norms + */ +class FastEuclideanOps extends PointOps[FastEUPoint, FastEUPoint] with Serializable { + + type C = FastEUPoint + type P = FastEUPoint + + val epsilon = 1e-4 + + /* compute a lower bound on the euclidean distance distance */ + + def distance(p: P, c: C, upperBound: Double): Double = { +val d = if (p.weight == Zero || c.weight == Zero) { + p.norm + c.norm +} else { + val x = p.raw.dot(c.raw) / (p.weight * c.weight) --- End diff -- same question about using `weight` in `distance` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693112 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- Should `weight` be only used in aggregation rather than distance computation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69279963 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25279/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69279959 [Test build #25279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25279/consoleFull) for PR 3431 at commit [`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DefaultSource extends SchemaRelationProvider ` * `case class ParquetRelation2(` * `trait SchemaRelationProvider ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4989][CORE] avoid wrong eventlog conf c...
Github user liyezhang556520 commented on the pull request: https://github.com/apache/spark/pull/3824#issuecomment-69280380 @andrewor14 , I received an email of your comment about creating other PRs to fix this issue for other older branches, but not found on this page. I think you might have removed that comment, so do I still need to make new PRs or just ignore that message? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3939 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4697][YARN]System properties should ove...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/3557#issuecomment-69282649 @vanzin Note what I note :-) Note: In test cases I didn't use SparkConf.setAppName in application code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3431#discussion_r22697822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala --- @@ -83,10 +118,73 @@ private[sql] class DDLParser extends StandardTokenParsers with PackratParsers wi protected lazy val className: Parser[String] = repsep(ident, .) ^^ { case s = s.mkString(.)} protected lazy val pair: Parser[(String, String)] = ident ~ stringLit ^^ { case k ~ v = (k,v) } + + protected lazy val column: Parser[StructField] = +ident ~ dataType ^^ { case columnName ~ typ = + StructField(cleanIdentifier(columnName), typ) +} + + protected lazy val primitiveType: Parser[DataType] = +STRING ^^^ StringType | +BINARY ^^^ BinaryType | +BOOLEAN ^^^ BooleanType | +TINYINT ^^^ ByteType | +SMALLINT ^^^ ShortType | +INT ^^^ IntegerType | +BIGINT ^^^ LongType | +FLOAT ^^^ FloatType | +DOUBLE ^^^ DoubleType | +fixedDecimalType | // decimal with precision/scale +DECIMAL ^^^ DecimalType.Unlimited | // decimal with no precision/scale +DATE ^^^ DateType | +TIMESTAMP ^^^ TimestampType | +VARCHAR ~ ( ~ numericLit ~ ) ^^^ StringType + + protected lazy val fixedDecimalType: Parser[DataType] = +(DECIMAL ~ ( ~ numericLit) ~ (, ~ numericLit ~ )) ^^ { + case precision ~ scale = DecimalType(precision.toInt, scale.toInt) +} + + protected lazy val arrayType: Parser[DataType] = +ARRAY ~ ~ dataType ~ ^^ { + case tpe = ArrayType(tpe) +} + + protected lazy val mapType: Parser[DataType] = +MAP ~ ~ dataType ~ , ~ dataType ~ ^^ { + case t1 ~ _ ~ t2 = MapType(t1, t2) +} + + protected lazy val structField: Parser[StructField] = +ident ~ : ~ dataType ^^ { + case fieldName ~ _ ~ tpe = StructField(cleanIdentifier(fieldName), tpe, nullable = true) +} + + protected lazy val structType: Parser[DataType] = +(STRUCT ~ ~ repsep(structField, ,) ~ ^^ { +case fields = new StructType(fields) +}) | +(STRUCT ~ ^^ { + case fields = new StructType(Nil) +}) + + private[sql] lazy val dataType: Parser[DataType] = +arrayType | +mapType | +structType | +primitiveType + + protected val escapedIdentifier = `([^`]+)`.r + /** Strips backticks from ident if present */ + protected def cleanIdentifier(ident: String): String = ident match { +case escapedIdentifier(i) = i +case plainIdent = plainIdent + } --- End diff -- Thank you:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-69282680 [Test build #25281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25281/consoleFull) for PR 3638 at commit [`5267929`](https://github.com/apache/spark/commit/5267929054cce06dd1c422a6a010e82b81b22a13). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-69282684 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25281/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693295 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends Estimator[LogisticRegressionModel] with Logisti def setRegParam(value: Double): this.type = set(regParam, value) def setMaxIter(value: Int): this.type = set(maxIter, value) - def setLabelCol(value: String): this.type = set(labelCol, value) def setThreshold(value: Double): this.type = set(threshold, value) - def setFeaturesCol(value: String): this.type = set(featuresCol, value) - def setScoreCol(value: String): this.type = set(scoreCol, value) - def setPredictionCol(value: String): this.type = set(predictionCol, value) override def fit(dataset: SchemaRDD, paramMap: ParamMap): LogisticRegressionModel = { +// Check schema transformSchema(dataset.schema, paramMap, logging = true) -import dataset.sqlContext._ + +// Extract columns from data. If dataset is persisted, do not persist oldDataset. +val oldDataset = extractLabeledPoints(dataset, paramMap) val map = this.paramMap ++ paramMap -val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr) - .map { case Row(label: Double, features: Vector) = -LabeledPoint(label, features) - }.persist(StorageLevel.MEMORY_AND_DISK) +val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + oldDataset.persist(StorageLevel.MEMORY_AND_DISK) +} + +// Train model val lr = new LogisticRegressionWithLBFGS lr.optimizer .setRegParam(map(regParam)) .setNumIterations(map(maxIter)) -val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights) -instances.unpersist() +val oldModel = lr.run(oldDataset) +val lrm = new LogisticRegressionModel(this, map, oldModel.weights, oldModel.intercept) + +if (handlePersistence) { + oldDataset.unpersist() +} + // copy model params Params.inheritValues(map, this, lrm) lrm } - private[ml] override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { -validateAndTransformSchema(schema, paramMap, fitting = true) - } + override protected def featuresDataType: DataType = new VectorUDT --- End diff -- Ehh... nevermind.. i think i got it. Feels very strange - if we must have this can't we make VectorUDT the default? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22693876 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -456,10 +459,18 @@ private[spark] class TaskSetManager( } // Serialize and return the task val startTime = clock.getTime() - // We rely on the DAGScheduler to catch non-serializable closures and RDDs, so in here - // we assume the task can be serialized without exceptions. - val serializedTask = Task.serializeWithDependencies( -task, sched.sc.addedFiles, sched.sc.addedJars, ser) + val serializedTask: ByteBuffer = try { +Task.serializeWithDependencies(task, sched.sc.addedFiles, +sched.sc.addedJars, ser) --- End diff -- bump this up 1 line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69275198 [Test build #25279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25279/consoleFull) for PR 3431 at commit [`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark 3299 add to SQLContext API to show table...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3872#issuecomment-69278134 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25276/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark 3299 add to SQLContext API to show table...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3872#issuecomment-69278128 [Test build #25276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25276/consoleFull) for PR 3872 at commit [`c5609fa`](https://github.com/apache/spark/commit/c5609faec0647332243151ab7513ccdc04893f46). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3961#issuecomment-69279153 [Test build #25280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25280/consoleFull) for PR 3961 at commit [`8644997`](https://github.com/apache/spark/commit/8644997624af1739890ec902f7e2e36278d158fa). * This patch **fails some tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-69279237 Ah never mind, I found the abort [here](https://github.com/mccheah/spark/blob/5267929054cce06dd1c422a6a010e82b81b22a13/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L470). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3961#issuecomment-69279159 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25280/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69281768 The wiki location seems fine. Maybe others disagree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-69286150 [Test build #25284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25284/consoleFull) for PR 3607 at commit [`6c1b264`](https://github.com/apache/spark/commit/6c1b264efe76483ffa0c2c589c51b4c42de18c59). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-69286154 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25284/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4990][Deploy]to find default properties...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3823#issuecomment-69287254 [Test build #25286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25286/consoleFull) for PR 3823 at commit [`4cc7f34`](https://github.com/apache/spark/commit/4cc7f3467ed78bb4b3a1a404c0b1daf1bd009c35). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4990][Deploy]to find default properties...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3823#issuecomment-69287260 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25286/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69287974 Okie doke, thank you @andrewor14. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3586][streaming]Support nested director...
Github user wangxiaojing commented on the pull request: https://github.com/apache/spark/pull/2765#issuecomment-69288062 @tdas rebase the latest master and update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4033][Examples]Input of the SparkPi too...
Github user SaintBacchus commented on the pull request: https://github.com/apache/spark/pull/2874#issuecomment-69289017 @andrewor14 I had explained why it can not use `Long` instead of `Int`. Not only the `Range` but also the `Partition` only can be appropriate with `Int`, and can't converse to a `Long`. Can we restrict the input and log an error to exit from the process? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3962#issuecomment-69289902 [Test build #25291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25291/consoleFull) for PR 3962 at commit [`2164ea8`](https://github.com/apache/spark/commit/2164ea88edd33c833fbbd0c7baa86426ef3534c0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` protected class YarnSchedulerActor(isDriver: Boolean) extends Actor ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3962#issuecomment-69289908 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25291/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4048] Enhance and extend hadoop-provide...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2982#discussion_r22691343 --- Diff: yarn/pom.xml --- @@ -131,13 +131,6 @@ skiptrue/skip /configuration /plugin - plugin -groupIdorg.apache.maven.plugins/groupId -artifactIdmaven-install-plugin/artifactId -configuration - skiptrue/skip --- End diff -- https://github.com/vanzin/spark/commit/1adf91c401890d6a93d3950d98f951db11304cb3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3910] Remove pyspark/mllib/ from sys.pa...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3940#issuecomment-69270182 @mengxr I think so, it's better to back port that into 1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3541][MLLIB] New ALS implementation wit...
Github user coderxiang commented on a diff in the pull request: https://github.com/apache/spark/pull/3720#discussion_r22692999 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -0,0 +1,964 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.recommendation + +import java.{util = javaUtil} + +import scala.collection.mutable + +import com.github.fommil.netlib.BLAS.{getInstance = blas} +import com.github.fommil.netlib.LAPACK.{getInstance = lapack} +import org.netlib.util.intW + +import org.apache.spark.{HashPartitioner, Logging, Partitioner} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{SchemaRDD, StructType} +import org.apache.spark.sql.catalyst.dsl._ +import org.apache.spark.sql.catalyst.expressions.Cast +import org.apache.spark.sql.catalyst.plans.LeftOuter +import org.apache.spark.sql.catalyst.types.{DoubleType, FloatType, IntegerType, StructField} +import org.apache.spark.util.Utils +import org.apache.spark.util.collection.{OpenHashMap, OpenHashSet, SortDataFormat, Sorter} +import org.apache.spark.util.random.XORShiftRandom + +/** + * Common params for ALS. + */ +private[recommendation] trait ALSParams extends Params with HasMaxIter with HasRegParam + with HasPredictionCol { + + /** Param for rank of the matrix factorization. */ + val rank = new IntParam(this, rank, rank of the factorization, Some(10)) + def getRank: Int = get(rank) + + /** Param for number of user blocks. */ + val numUserBlocks = new IntParam(this, numUserBlocks, number of user blocks, Some(10)) + def getNumUserBlocks: Int = get(numUserBlocks) + + /** Param for number of item blocks. */ + val numItemBlocks = +new IntParam(this, numItemBlocks, number of item blocks, Some(10)) + def getNumItemBlocks: Int = get(numItemBlocks) + + /** Param to decide whether to use implicit preference. */ + val implicitPrefs = +new BooleanParam(this, implicitPrefs, whether to use implicit preference, Some(false)) + def getImplicitPrefs: Boolean = get(implicitPrefs) + + /** Param for the alpha parameter in the implicit preference formulation. */ + val alpha = new DoubleParam(this, alpha, alpha for implicit preference, Some(1.0)) + def getAlpha: Double = get(alpha) + + /** Param for the column name for user ids. */ + val userCol = new Param[String](this, userCol, column name for user ids, Some(user)) + def getUserCol: String = get(userCol) + + /** Param for the column name for item ids. */ + val itemCol = +new Param[String](this, itemCol, column name for item ids, Some(item)) + def getItemCol: String = get(itemCol) + + /** Param for the column name for ratings. */ + val ratingCol = new Param[String](this, ratingCol, column name for ratings, Some(rating)) + def getRatingCol: String = get(ratingCol) + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @param paramMap extra params + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = { +val map = this.paramMap ++ paramMap +assert(schema(map(userCol)).dataType == IntegerType) +assert(schema(map(itemCol)).dataType== IntegerType) +val ratingType = schema(map(ratingCol)).dataType +assert(ratingType == FloatType || ratingType == DoubleType) +val predictionColName = map(predictionCol) +assert(!schema.fieldNames.contains(predictionColName), + sPrediction column $predictionColName already exists.) +val newFields = schema.fields :+ StructField(map(predictionCol), FloatType, nullable = false) +StructType(newFields) + } +} + +/** + * Model fitted by ALS. + */ +class ALSModel
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22692953 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import org.apache.spark.annotation.{DeveloperApi, AlphaComponent} +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, PredictorParams} +import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol} +import org.apache.spark.mllib.linalg.{Vector, VectorUDT} +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.analysis.Star + +/** + * :: DeveloperApi :: + * Params for classification. + */ +@DeveloperApi +trait ClassifierParams extends PredictorParams + with HasRawPredictionCol { + + override protected def validateAndTransformSchema( + schema: StructType, + paramMap: ParamMap, + fitting: Boolean, + featuresDataType: DataType): StructType = { +val parentSchema = super.validateAndTransformSchema(schema, paramMap, fitting, featuresDataType) +val map = this.paramMap ++ paramMap +addOutputColumn(parentSchema, map(rawPredictionCol), new VectorUDT) + } +} + +/** + * :: AlphaComponent :: + * Single-label binary or multiclass classification. + * Classes are indexed {0, 1, ..., numClasses - 1}. + * + * @tparam FeaturesType Type of input features. E.g., [[Vector]] + * @tparam Learner Concrete Estimator type + * @tparam M Concrete Model type + */ +@AlphaComponent +abstract class Classifier[ --- End diff -- I don't have a concrete suggestion here, but these abstract types are starting to get complicated/look redundant. Is Learner only there to make subclassing cleaner? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-69272196 @derrickburns I like the improvements implemented in this PR. But as @srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (6x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes. 1. We replaced breeze operations by our own implementation. The latter is about 2-3x faster. 1. Running k-means++ distributively has noticeable overhead with small k and feature dimension. I think it is still feasible to include features through separate PRs: 1. remember previously computed best distances in k-means++ initialization 1. allow fixing the random seed (addressed in #3610) 1. variable number of clusters. We should discuss whether we want to have less than k clusters or split the biggest one if there are more than k points. 1. parallelize k-means++. I think whether we should replace local k-means++ or make it configurable requires some discussion and performance comparison. 1. support Bregman divergences Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind. For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors: ~~~ d(x, y) = f(x) - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, g(y) ~~~ where `f(x)`, `g(y)`, and `f(y) - y, g(y)` could be pre-computed and cached, and `x, g(y)` can take advantage of sparse `x`. But I'm not sure whether this formulation is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the points cannot be sparse. Besides those comments, I'm going to make some minor comments inline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/3960 [WIP][SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api This one subsumes #3752. It current contains changes made in #3431. Will clean it up once #3431 is in. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark persistantTablesWithSchema2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3960.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3960 commit d7da491713a83f25de5c07639de7985a96c801a6 Author: Michael Armbrust mich...@databricks.com Date: 2014-12-20T20:45:28Z First draft of persistent tables. commit 6edc71026c4a10cce338adaa7b807fef0ee2857b Author: Michael Armbrust mich...@databricks.com Date: 2014-12-20T21:03:59Z Add tests. commit 1ea6e7bbf04c04f7c51884ca0ec819cddfaac10b Author: Michael Armbrust mich...@databricks.com Date: 2014-12-21T22:23:34Z Don't fail when trying to uncache a table that doesn't exist commit c00bb1bf25b8f9875fc3e8b58d007d67496f1b2f Author: Michael Armbrust mich...@databricks.com Date: 2014-12-22T19:05:46Z Don't use reflection to read options commit 2b5972353a47ca1577a0ddcd3aab5c9dbd1d10d4 Author: Michael Armbrust mich...@databricks.com Date: 2014-12-22T19:08:13Z Set external when creating tables commit 8f8f1a167360bfab3198b086d4608f5b3517f249 Author: Yin Huai yh...@databricks.com Date: 2015-01-08T00:53:02Z [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431 commit f47fda1f5e34dd73d7e5db9949eceb21cdd1ce89 Author: Yin Huai yh...@databricks.com Date: 2015-01-08T01:58:00Z Unit tests. commit 49bf1acc700d454f894edf55cd8fa88aee4d63da Author: Yin Huai yh...@databricks.com Date: 2015-01-08T01:58:00Z Unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693428 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import org.apache.spark.annotation.{AlphaComponent, DeveloperApi} +import org.apache.spark.ml.param.{HasProbabilityCol, ParamMap, Params} +import org.apache.spark.mllib.linalg.{Vector, VectorUDT} +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.analysis.Star + +/** + * Params for probabilistic classification. + */ +private[classification] trait ProbabilisticClassifierParams + extends ClassifierParams with HasProbabilityCol { + + override protected def validateAndTransformSchema( + schema: StructType, + paramMap: ParamMap, + fitting: Boolean, + featuresDataType: DataType): StructType = { +val parentSchema = super.validateAndTransformSchema(schema, paramMap, fitting, featuresDataType) +val map = this.paramMap ++ paramMap +addOutputColumn(parentSchema, map(probabilityCol), new VectorUDT) + } +} + + +/** + * :: AlphaComponent :: + * + * Single-label binary or multiclass classifier which can output class conditional probabilities. + * + * @tparam FeaturesType Type of input features. E.g., [[Vector]] + * @tparam Learner Concrete Estimator type + * @tparam M Concrete Model type + */ +@AlphaComponent +abstract class ProbabilisticClassifier[ +FeaturesType, +Learner : ProbabilisticClassifier[FeaturesType, Learner, M], +M : ProbabilisticClassificationModel[FeaturesType, M]] + extends Classifier[FeaturesType, Learner, M] with ProbabilisticClassifierParams { + + def setProbabilityCol(value: String): Learner = set(probabilityCol, value).asInstanceOf[Learner] +} + + +/** + * :: AlphaComponent :: + * + * Model produced by a [[ProbabilisticClassifier]]. + * Classes are indexed {0, 1, ..., numClasses - 1}. + * + * @tparam FeaturesType Type of input features. E.g., [[Vector]] + * @tparam M Concrete Model type + */ +@AlphaComponent +abstract class ProbabilisticClassificationModel[ +FeaturesType, +M : ProbabilisticClassificationModel[FeaturesType, M]] + extends ClassificationModel[FeaturesType, M] with ProbabilisticClassifierParams { + + def setProbabilityCol(value: String): M = set(probabilityCol, value).asInstanceOf[M] + + /** + * Transforms dataset by reading from [[featuresCol]], and appending new columns as specified by + * parameters: + * - predicted labels as [[predictionCol]] of type [[Double]] + * - raw predictions (confidences) as [[rawPredictionCol]] of type [[Vector]] + * - probability of each class as [[probabilityCol]] of type [[Vector]]. + * + * @param dataset input dataset + * @param paramMap additional parameters, overwrite embedded params + * @return transformed dataset + */ + override def transform(dataset: SchemaRDD, paramMap: ParamMap): SchemaRDD = { +// This default implementation should be overridden as needed. +import dataset.sqlContext._ +import org.apache.spark.sql.catalyst.dsl._ + +// Check schema +transformSchema(dataset.schema, paramMap, logging = true) +val map = this.paramMap ++ paramMap + +// Prepare model +val tmpModel = if (paramMap.size != 0) { + val tmpModel = this.copy() + Params.inheritValues(paramMap, parent, tmpModel) + tmpModel +} else { + this +} + +val (numColsOutput, outputData) = + ClassificationModel.transformColumnsImpl[FeaturesType](dataset, tmpModel, map) + +// Output selected columns only. +if (map(probabilityCol) != ) { + //
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693403 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -80,69 +50,157 @@ class LogisticRegression extends Estimator[LogisticRegressionModel] with Logisti def setRegParam(value: Double): this.type = set(regParam, value) def setMaxIter(value: Int): this.type = set(maxIter, value) - def setLabelCol(value: String): this.type = set(labelCol, value) def setThreshold(value: Double): this.type = set(threshold, value) - def setFeaturesCol(value: String): this.type = set(featuresCol, value) - def setScoreCol(value: String): this.type = set(scoreCol, value) - def setPredictionCol(value: String): this.type = set(predictionCol, value) override def fit(dataset: SchemaRDD, paramMap: ParamMap): LogisticRegressionModel = { +// Check schema transformSchema(dataset.schema, paramMap, logging = true) -import dataset.sqlContext._ + +// Extract columns from data. If dataset is persisted, do not persist oldDataset. +val oldDataset = extractLabeledPoints(dataset, paramMap) val map = this.paramMap ++ paramMap -val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr) - .map { case Row(label: Double, features: Vector) = -LabeledPoint(label, features) - }.persist(StorageLevel.MEMORY_AND_DISK) +val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + oldDataset.persist(StorageLevel.MEMORY_AND_DISK) +} + +// Train model val lr = new LogisticRegressionWithLBFGS lr.optimizer .setRegParam(map(regParam)) .setNumIterations(map(maxIter)) -val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights) -instances.unpersist() +val oldModel = lr.run(oldDataset) +val lrm = new LogisticRegressionModel(this, map, oldModel.weights, oldModel.intercept) + +if (handlePersistence) { + oldDataset.unpersist() +} + // copy model params Params.inheritValues(map, this, lrm) lrm } - private[ml] override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { -validateAndTransformSchema(schema, paramMap, fitting = true) - } + override protected def featuresDataType: DataType = new VectorUDT } + /** * :: AlphaComponent :: + * * Model produced by [[LogisticRegression]]. */ @AlphaComponent class LogisticRegressionModel private[ml] ( override val parent: LogisticRegression, override val fittingParamMap: ParamMap, -weights: Vector) - extends Model[LogisticRegressionModel] with LogisticRegressionParams { +val weights: Vector, +val intercept: Double) + extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel] + with LogisticRegressionParams { + + setThreshold(0.5) def setThreshold(value: Double): this.type = set(threshold, value) - def setFeaturesCol(value: String): this.type = set(featuresCol, value) - def setScoreCol(value: String): this.type = set(scoreCol, value) - def setPredictionCol(value: String): this.type = set(predictionCol, value) - private[ml] override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { -validateAndTransformSchema(schema, paramMap, fitting = false) + private val margin: Vector = Double = (features) = { +BLAS.dot(features, weights) + intercept + } + + private val score: Vector = Double = (features) = { +val m = margin(features) +1.0 / (1.0 + math.exp(-m)) } override def transform(dataset: SchemaRDD, paramMap: ParamMap): SchemaRDD = { +// Check schema transformSchema(dataset.schema, paramMap, logging = true) + import dataset.sqlContext._ val map = this.paramMap ++ paramMap -val score: Vector = Double = (v) = { - val margin = BLAS.dot(v, weights) - 1.0 / (1.0 + math.exp(-margin)) + +// Output selected columns only. +// This is a bit complicated since it tries to avoid repeated computation. +// rawPrediction (-margin, margin) +// probability (1.0-score, score) +// prediction (max margin) +var tmpData = dataset +var numColsOutput = 0 +if (map(rawPredictionCol) != ) { + val features2raw: Vector = Vector = predictRaw + tmpData = tmpData.select(Star(None), +features2raw.call(map(featuresCol).attr) as map(rawPredictionCol)) + numColsOutput += 1 +} +if
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22693927 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -456,10 +459,18 @@ private[spark] class TaskSetManager( } // Serialize and return the task val startTime = clock.getTime() - // We rely on the DAGScheduler to catch non-serializable closures and RDDs, so in here - // we assume the task can be serialized without exceptions. - val serializedTask = Task.serializeWithDependencies( -task, sched.sc.addedFiles, sched.sc.addedJars, ser) + val serializedTask: ByteBuffer = try { +Task.serializeWithDependencies(task, sched.sc.addedFiles, +sched.sc.addedJars, ser) + } catch { +// If the task cannot be serialized, then there's no point to re-attempt the task, +// as it will always fail. So just abort the whole task-set. +case NonFatal(e) = + logError(sFailed to serialize task $taskId, not attempting to retry it., e) + abort(sFailed to serialize task $taskId, not attempt to retry it. Exception + +sduring serialization is: $e) --- End diff -- Looks like there's some duplication here. Can you put this in a val: ``` val msg = sFailed to serialize task $taskId, not attempting to retry it. logError(msg, e) abort(s$msg Exception during serialization: $e) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3960#issuecomment-69276273 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25277/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3960#issuecomment-69276271 [Test build #25277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25277/consoleFull) for PR 3960 at commit [`49bf1ac`](https://github.com/apache/spark/commit/49bf1acc700d454f894edf55cd8fa88aee4d63da). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DefaultSource extends SchemaRelationProvider ` * `case class ParquetRelation2(` * `trait SchemaRelationProvider ` * ` case class TableIdent(database: String, name: String) ` * `case class CreateMetastoreDataSource(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69277310 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25275/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69277301 [Test build #25275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25275/consoleFull) for PR 3939 at commit [`66e0841`](https://github.com/apache/spark/commit/66e0841132331d0283ffdbd7a8e8203a67bd9d77). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3921#discussion_r22695485 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -520,6 +520,7 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C TOK_TBLTEXTFILE, // Stored as TextFile TOK_TBLRCFILE, // Stored as RCFile TOK_TBLORCFILE, // Stored as ORC File +TOK_TBLPARQUETFILE, // Stored as PARQUET --- End diff -- This token was introduced with Hive 13. What will happen if a user is using Hive 12? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69280879 [Test build #25283 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25283/consoleFull) for PR 3431 at commit [`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user WangTaoTheTonic commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22697617 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -87,6 +92,21 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) throw new IllegalArgumentException( You must specify at least 1 executor!\n + getUsageMessage()) } +if (isClusterMode) { + for (key - Seq(amMemKey, amMemOverheadKey)) { +if (sparkConf.getOption(key).isDefined) { + println(s$key is set but does not apply in cluster mode.) --- End diff -- As `ClientArguments.scala` didn't extends Logging class, only `println` can be used here. Yep, if user set the config values that never be used in that mode, we should give a prompt. BTW, `spark.driver.memory` is used in both modes, so I deleted the meesage about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3962#issuecomment-69285225 [Test build #25290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25290/consoleFull) for PR 3962 at commit [`6dfeeec`](https://github.com/apache/spark/commit/6dfeeecd4a206b9a82952e3b9f78128a0013d3c9). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` protected class YarnSchedulerActor(isDriver: Boolean) extends Actor ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3962#issuecomment-69285227 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25290/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-69286990 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25285/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-69286987 [Test build #25285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25285/consoleFull) for PR 3607 at commit [`d5ceb1b`](https://github.com/apache/spark/commit/d5ceb1b2f181628fe0096202ffb31d95f0afcef8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...
Github user guowei2 commented on the pull request: https://github.com/apache/spark/pull/3921#issuecomment-69290028 i think i should remove the test case for `stored as parquet ` only can pass in hive-0.13 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5145][Mllib] Add BLAS.dsyr and use it i...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3949#issuecomment-69290026 @jkbradley Thanks. The unit test is added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5123] Expose only one version of the da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3958#issuecomment-69267461 [Test build #25272 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25272/consoleFull) for PR 3958 at commit [`b4f9649`](https://github.com/apache/spark/commit/b4f96490f5044873aa593c6178a75d446f923493). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3541][MLLIB] New ALS implementation wit...
Github user coderxiang commented on a diff in the pull request: https://github.com/apache/spark/pull/3720#discussion_r22692604 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -0,0 +1,964 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.recommendation + +import java.{util = javaUtil} + +import scala.collection.mutable + +import com.github.fommil.netlib.BLAS.{getInstance = blas} +import com.github.fommil.netlib.LAPACK.{getInstance = lapack} +import org.netlib.util.intW + +import org.apache.spark.{HashPartitioner, Logging, Partitioner} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{SchemaRDD, StructType} +import org.apache.spark.sql.catalyst.dsl._ +import org.apache.spark.sql.catalyst.expressions.Cast +import org.apache.spark.sql.catalyst.plans.LeftOuter +import org.apache.spark.sql.catalyst.types.{DoubleType, FloatType, IntegerType, StructField} +import org.apache.spark.util.Utils +import org.apache.spark.util.collection.{OpenHashMap, OpenHashSet, SortDataFormat, Sorter} +import org.apache.spark.util.random.XORShiftRandom + +/** + * Common params for ALS. + */ +private[recommendation] trait ALSParams extends Params with HasMaxIter with HasRegParam + with HasPredictionCol { + + /** Param for rank of the matrix factorization. */ + val rank = new IntParam(this, rank, rank of the factorization, Some(10)) + def getRank: Int = get(rank) + + /** Param for number of user blocks. */ + val numUserBlocks = new IntParam(this, numUserBlocks, number of user blocks, Some(10)) + def getNumUserBlocks: Int = get(numUserBlocks) + + /** Param for number of item blocks. */ + val numItemBlocks = +new IntParam(this, numItemBlocks, number of item blocks, Some(10)) + def getNumItemBlocks: Int = get(numItemBlocks) + + /** Param to decide whether to use implicit preference. */ + val implicitPrefs = +new BooleanParam(this, implicitPrefs, whether to use implicit preference, Some(false)) + def getImplicitPrefs: Boolean = get(implicitPrefs) + + /** Param for the alpha parameter in the implicit preference formulation. */ + val alpha = new DoubleParam(this, alpha, alpha for implicit preference, Some(1.0)) + def getAlpha: Double = get(alpha) + + /** Param for the column name for user ids. */ + val userCol = new Param[String](this, userCol, column name for user ids, Some(user)) + def getUserCol: String = get(userCol) + + /** Param for the column name for item ids. */ + val itemCol = +new Param[String](this, itemCol, column name for item ids, Some(item)) + def getItemCol: String = get(itemCol) + + /** Param for the column name for ratings. */ + val ratingCol = new Param[String](this, ratingCol, column name for ratings, Some(rating)) + def getRatingCol: String = get(ratingCol) + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @param paramMap extra params + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = { +val map = this.paramMap ++ paramMap +assert(schema(map(userCol)).dataType == IntegerType) +assert(schema(map(itemCol)).dataType== IntegerType) +val ratingType = schema(map(ratingCol)).dataType +assert(ratingType == FloatType || ratingType == DoubleType) +val predictionColName = map(predictionCol) +assert(!schema.fieldNames.contains(predictionColName), + sPrediction column $predictionColName already exists.) +val newFields = schema.fields :+ StructField(map(predictionCol), FloatType, nullable = false) +StructType(newFields) + } +} + +/** + * Model fitted by ALS. + */ +class ALSModel
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693062 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib --- End diff -- Should it be `mllib.clustering` as the file is under `clustering/`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693073 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import org.apache.spark.annotation.{DeveloperApi, AlphaComponent} +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, PredictorParams} +import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol} +import org.apache.spark.mllib.linalg.{Vector, VectorUDT} +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.analysis.Star + +/** + * :: DeveloperApi :: + * Params for classification. + */ +@DeveloperApi +trait ClassifierParams extends PredictorParams + with HasRawPredictionCol { + + override protected def validateAndTransformSchema( + schema: StructType, + paramMap: ParamMap, + fitting: Boolean, + featuresDataType: DataType): StructType = { +val parentSchema = super.validateAndTransformSchema(schema, paramMap, fitting, featuresDataType) +val map = this.paramMap ++ paramMap +addOutputColumn(parentSchema, map(rawPredictionCol), new VectorUDT) + } +} + +/** + * :: AlphaComponent :: + * Single-label binary or multiclass classification. + * Classes are indexed {0, 1, ..., numClasses - 1}. + * + * @tparam FeaturesType Type of input features. E.g., [[Vector]] + * @tparam Learner Concrete Estimator type + * @tparam M Concrete Model type + */ +@AlphaComponent +abstract class Classifier[ +FeaturesType, +Learner : Classifier[FeaturesType, Learner, M], +M : ClassificationModel[FeaturesType, M]] + extends Predictor[FeaturesType, Learner, M] + with ClassifierParams { + + def setRawPredictionCol(value: String): Learner = +set(rawPredictionCol, value).asInstanceOf[Learner] + + // TODO: defaultEvaluator (follow-up PR) +} + +/** + * :: AlphaComponent :: + * Model produced by a [[Classifier]]. + * Classes are indexed {0, 1, ..., numClasses - 1}. + * + * @tparam FeaturesType Type of input features. E.g., [[Vector]] + * @tparam M Concrete Model type + */ +@AlphaComponent +abstract class ClassificationModel[FeaturesType, M : ClassificationModel[FeaturesType, M]] + extends PredictionModel[FeaturesType, M] with ClassifierParams { + + def setRawPredictionCol(value: String): M = set(rawPredictionCol, value).asInstanceOf[M] + + /** Number of classes (values which the label can take). */ + def numClasses: Int --- End diff -- How hard/weird would it be to make labels an Enumeration? This class could be inferred from the training set at run-time or supplied by the user, and then the user doesn't pass the number of classes to the model, but instead what the set of labels actually are. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r22693524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/sharedParams.scala --- @@ -17,6 +17,10 @@ package org.apache.spark.ml.param +/* NOTE TO DEVELOPERS: + * If you add these parameter traits into your algorithm, you need to add a setter method as well. --- End diff -- Maybe we should update this comment and explain *why* the setter must be added? Code will still compile, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22693713 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -249,13 +250,12 @@ private[spark] class TaskSchedulerImpl( // of locality levels so that it gets a chance to launch local tasks on all of them. // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY var launchedTask = false -for (taskSet - sortedTaskSets; maxLocality - taskSet.myLocalityLevels) { - do { -launchedTask = false -for (i - 0 until shuffledOffers.size) { - val execId = shuffledOffers(i).executorId - val host = shuffledOffers(i).host - if (availableCpus(i) = CPUS_PER_TASK) { +def resourceOfferSingleTaskSet(taskSet: TaskSetManager, maxLocality: TaskLocality) : Unit = { --- End diff -- no space before `:` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22693694 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl( assert(availableCpus(i) = 0) launchedTask = true } + } catch { +case e: TaskNotSerializableException = { + logError(sResource offer failed, task set ${taskSet.name} was not serializable) + // Do not offer resources for this task, but don't throw an error to allow other + // task sets to be submitted. + return +} } } + } +} --- End diff -- can you define this function as a `private def` outside of `resourceOffers`? The nesting here makes this hard to read. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22695144 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -865,26 +865,6 @@ class DAGScheduler( } if (tasks.size 0) { - // Preemptively serialize a task to make sure it can be serialized. We are catching this - // exception here because it would be fairly hard to catch the non-serializable exception - // down the road, where we have several different implementations for local scheduler and - // cluster schedulers. - // - // We've already serialized RDDs and closures in taskBinary, but here we check for all other - // objects such as Partition. - try { -closureSerializer.serialize(tasks.head) - } catch { -case e: NotSerializableException = - abortStage(stage, Task not serializable: + e.toString) - runningStages -= stage - return -case NonFatal(e) = // Other exceptions, such as IllegalArgumentException from Kryo. - abortStage(stage, sTask serialization failed: $e\n${e.getStackTraceString}) - runningStages -= stage - return - } - --- End diff -- This is the main addition in the patch - to make it so that task serialization error handling is only done when the serialization actually occurs. It turns out there are many scenarios where this selective sampling does not actually work. For example, when you create an RDD from an in-memory collection, perhaps some of the items are serializable but others are not. E.g. consider a list of containers, where the first item in the list is an empty container, and the second item in the list is a non-empty container with non-serializable things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22695628 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl( assert(availableCpus(i) = 0) launchedTask = true } + } catch { +case e: TaskNotSerializableException = { + logError(sResource offer failed, task set ${taskSet.name} was not serializable) --- End diff -- yeah you're right --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22696100 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -209,6 +210,42 @@ private[spark] class TaskSchedulerImpl( .format(manager.taskSet.id, manager.parent.name)) } + private def resourceOfferSingleTaskSet( + taskSet: TaskSetManager, + maxLocality: TaskLocality, + shuffledOffers: Seq[WorkerOffer], + availableCpus: Array[Int], + tasks: Seq[ArrayBuffer[TaskDescription]]) +: Boolean = + { --- End diff -- small nit: ``` tasks: Seq[...]): Boolean = { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-69278693 @mccheah @JoshRosen high level question. So what happens now when a task is not serializable? Before it would throw a loud exception and fail the task, but now we catch the task not serializable exception and silently not schedule it. I may be missing something, but do we ever abort the stage or fail the task? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user loachli commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-69278665 Hi jkbradley: Could you tell the jira number related to ânew spark.ml package and its design docâ å件人: jkbradley [mailto:notificati...@github.com] åéæ¶é´: 2015å¹´1æ9æ¥ 3:51 æ¶ä»¶äºº: apache/spark æé: Lizhengbing (bing, BIPA) 主é¢: Re: [spark] [MLLIB] [spark-2352] Implementation of an Artificial Neural Network (ANN) (#1290) @bgreevenhttps://github.com/bgreeven Iâm not too surprised that the majority vote (a.k.a. one vs. all) did not do very well; it does not scale well with the number of classes. A tree (or better yet, error-corrected output codes) generally work better, in my experience. @avulanovhttps://github.com/avulanov True, we try for consistency with APIs, except where weâre changing the norm. There is not a clear write-up about the ânorm,â although the new spark.ml package andHc (in the JIRA) give an overview of some parts. Basically, weâre aiming to make things more pluggable and extensible, while minimizing API change. If that requires short-term API changes (such as switching away from ANNWithX method names), that can be acceptable. @bgreevenhttps://github.com/bgreeven @avulanovhttps://github.com/avulanov The test results look pretty good, though Iâm not sure what to expect for accuracy. I think the main item remaining is figuring out the public API. Itâs tough since neural networks / deep learning are a rapidly evolving field, and there are a lot of model algorithm variants out there. Ideally, we could put together a design doc (to be linked from the JIRA) for this big feature which would: * Design a public API for neural networks and deep learning * Comparison of other major librariesâ APIs * Minimum viable product API for an initial PR * Path for the future: * What extensions might we need to do, and can we keep the public API stable for these? * What extensions might users want to do? Is the API easily extensible and/or pluggable, or can we make it so in the future without changing the existing public API? * Briefly discuss the algorithm * Alg sketch, limitations, etc. * Alternative algorithms, and a path for making the optimization algorithm pluggable in the future (as weâve discussed a bit in the PR conversation) I realize it takes quite a while to get a big new feature ready. If youâd like to encourage early adoption, you could also post this for now as a package for Spark, while the PR is made fully ready. CC: @mengxrhttps://github.com/mengxr â Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/1290#issuecomment-69237765. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22696128 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -251,23 +288,8 @@ private[spark] class TaskSchedulerImpl( var launchedTask = false for (taskSet - sortedTaskSets; maxLocality - taskSet.myLocalityLevels) { do { -launchedTask = false -for (i - 0 until shuffledOffers.size) { - val execId = shuffledOffers(i).executorId - val host = shuffledOffers(i).host - if (availableCpus(i) = CPUS_PER_TASK) { -for (task - taskSet.resourceOffer(execId, host, maxLocality)) { - tasks(i) += task - val tid = task.taskId - taskIdToTaskSetId(tid) = taskSet.taskSet.id - taskIdToExecutorId(tid) = execId - executorsByHost(host) += execId - availableCpus(i) -= CPUS_PER_TASK - assert(availableCpus(i) = 0) - launchedTask = true -} - } -} +launchedTask = resourceOfferSingleTaskSet(taskSet, maxLocality, shuffledOffers, + availableCpus, tasks) --- End diff -- another small style nit ``` launchedTask = resourceOfferSingleTaskSet( taskSet, maxLocality ... tasks) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4989][CORE] avoid wrong eventlog conf c...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3824#issuecomment-69280965 Yes that would be great. It seems that not all of the changes in this PR are applicable there, however. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22697722 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -68,8 +68,6 @@ private[spark] class YarnClientSchedulerBackend( // List of (target Client argument, environment variable, Spark property) val optionTuples = List( -(--driver-memory, SPARK_MASTER_MEMORY, spark.master.memory), -(--driver-memory, SPARK_DRIVER_MEMORY, spark.driver.memory), --- End diff -- ah ok. Also it doesn't really make sense to pass driver memory on in client mode anyway, because the driver by definition has already started when `YarnClientSchedulerBackend` is created. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-69282417 [Test build #25285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25285/consoleFull) for PR 3607 at commit [`d5ceb1b`](https://github.com/apache/spark/commit/d5ceb1b2f181628fe0096202ffb31d95f0afcef8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4989][CORE] avoid wrong eventlog conf c...
Github user liyezhang556520 commented on the pull request: https://github.com/apache/spark/pull/3824#issuecomment-69283439 ok, I'll make new PRs for those old branches 1.0, 1.1, and 1.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3783#issuecomment-69284237 Can you explain how (1) is related to SPARK-4951? It seems to me that (2) is sufficient in triggering the issue. The original implementation will mark an exeuctor idle when receiving `SparkListenerBlockManagerAdded`. So if `SparkListenerTaskStart` is received before `SparkListenerBlockManagerAdded`, when receiving `SparkListenerBlockManagerAdded`, the executor will be marked idle even if there is a task running in it. Therefore, the executor will be killed when it's expired. That's why I said it's related. Of cause, we can also say (1) is a special case of (2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69285760 [Test build #25283 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25283/consoleFull) for PR 3431 at commit [`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DefaultSource extends SchemaRelationProvider ` * `case class ParquetRelation2(` * `trait SchemaRelationProvider ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69285764 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25283/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4048] Enhance and extend hadoop-provide...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2982#issuecomment-69269314 [Test build #25273 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25273/consoleFull) for PR 2982 at commit [`82eb688`](https://github.com/apache/spark/commit/82eb688f44d2df63a7b7ff311e5d40970f67fc43). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3959#issuecomment-69270938 [Test build #25274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25274/consoleFull) for PR 3959 at commit [`5425314`](https://github.com/apache/spark/commit/542531483312b77ed941c277f3e05c4ef1867534). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69270905 [Test build #25275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25275/consoleFull) for PR 3939 at commit [`66e0841`](https://github.com/apache/spark/commit/66e0841132331d0283ffdbd7a8e8203a67bd9d77). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22694937 --- Diff: core/src/main/scala/org/apache/spark/TaskNotSerializableException.scala --- @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import org.apache.spark.annotation.DeveloperApi + +/** + * :: DeveloperApi :: + * Exception thrown when a task cannot be serialized + */ +@DeveloperApi +class TaskNotSerializableException(error: Throwable) extends Exception(error) --- End diff -- I perhaps misunderstood the semantics of DeveloperApi - what I believed it meant is that the class should not be used by end-users, but is only to be thrown from Spark. However the exception class would be visible when we log it... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22694961 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -865,26 +865,6 @@ class DAGScheduler( } if (tasks.size 0) { - // Preemptively serialize a task to make sure it can be serialized. We are catching this - // exception here because it would be fairly hard to catch the non-serializable exception - // down the road, where we have several different implementations for local scheduler and - // cluster schedulers. - // - // We've already serialized RDDs and closures in taskBinary, but here we check for all other - // objects such as Partition. - try { -closureSerializer.serialize(tasks.head) - } catch { -case e: NotSerializableException = - abortStage(stage, Task not serializable: + e.toString) - runningStages -= stage - return -case NonFatal(e) = // Other exceptions, such as IllegalArgumentException from Kryo. - abortStage(stage, sTask serialization failed: $e\n${e.getStackTraceString}) - runningStages -= stage - return - } - --- End diff -- Can you explain why this is removed? It used to provide a way to fail fast if the task is not serializable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3921#discussion_r22695098 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -520,6 +520,7 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C TOK_TBLTEXTFILE, // Stored as TextFile TOK_TBLRCFILE, // Stored as RCFile TOK_TBLORCFILE, // Stored as ORC File +TOK_TBLPARQUETFILE, // Stored as PARQUET --- End diff -- Seems this line is added only for the completeness of the parser for Hive 0.13. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3921#discussion_r22695310 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -520,6 +520,7 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C TOK_TBLTEXTFILE, // Stored as TextFile TOK_TBLRCFILE, // Stored as RCFile TOK_TBLORCFILE, // Stored as ORC File +TOK_TBLPARQUETFILE, // Stored as PARQUET --- End diff -- For these tokens, only `TOK_TABNAME`, `TOK_QUERY`, and `TOK_IFNOTEXISTS` are actually used by `HiveQl`. Since we ask Hive's SemanticAnalyzer to create the `CreateTableDesc`, we basically ignore other tokens at here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3959#issuecomment-69277801 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25274/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3959#issuecomment-69277796 [Test build #25274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25274/consoleFull) for PR 3959 at commit [`5425314`](https://github.com/apache/spark/commit/542531483312b77ed941c277f3e05c4ef1867534). * This patch **fails some tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3960#issuecomment-69280036 [Test build #25282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25282/consoleFull) for PR 3960 at commit [`172db80`](https://github.com/apache/spark/commit/172db80cf71ba4a853a42993e87bc52e6c08b94f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-69281668 [Test build #25284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25284/consoleFull) for PR 3607 at commit [`6c1b264`](https://github.com/apache/spark/commit/6c1b264efe76483ffa0c2c589c51b4c42de18c59). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user WangTaoTheTonic commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22697386 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -68,8 +68,6 @@ private[spark] class YarnClientSchedulerBackend( // List of (target Client argument, environment variable, Spark property) val optionTuples = List( -(--driver-memory, SPARK_MASTER_MEMORY, spark.master.memory), -(--driver-memory, SPARK_DRIVER_MEMORY, spark.driver.memory), --- End diff -- Since in `Client.scala` the `--driver-memory` passed by spark-submit is not used anymore. I thought we've disscussed it before. @andrewor14 Ok, I got what you mean. I think I have a misunderstanding before. To solve this problem, should we just delete (--driver-memory, SPARK_MASTER_MEMORY, spark.master.memory), (--driver-memory, SPARK_DRIVER_MEMORY, spark.driver.memory), . in YarnClientSchedulerBackend.scala? @WangTaoTheTonic that would fix it, but I think in addition to that we should also add a check in ClientArguments itself in case the user calls into the Client main class and specifying --driver-memory manually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3939#issuecomment-69281611 Ok LGTM I'm merging this into master thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/3963 [SPARK-5163] [CORE] Load properties from configuration file for example spark-defaults.conf when creating SparkConf object I create and run a Spark program which does not use SparkSubmit. When I create a SparkConf object with `new SparkConf()`, it will not automatically load properties from configuration file for example spark-defaults.conf. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-5163 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3963 commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-06T13:07:08Z Merge pull request #1 from apache/master update commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-08-20T13:14:08Z Merge pull request #3 from apache/master Update commit 8a0010691b669495b4c327cf83124cabb7da1405 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-12T06:54:58Z Merge pull request #6 from apache/master Update commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-09-16T12:03:22Z Merge pull request #7 from apache/master Update commit 76d40277d51f709247df1d3734093bf2c047737d Author: YanTangZhai hakeemz...@tencent.com Date: 2014-10-20T12:52:22Z Merge pull request #8 from apache/master update commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-04T09:00:31Z Merge pull request #9 from apache/master Update commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a Author: YanTangZhai hakeemz...@tencent.com Date: 2014-11-11T03:18:24Z Merge pull request #10 from apache/master Update commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-01T11:23:56Z Merge pull request #11 from apache/master Update commit 718afebe364bd54ac33be425e24183eb1c76b5d3 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-05T11:08:31Z Merge pull request #12 from apache/master update commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-24T03:15:22Z Merge pull request #15 from apache/master update commit d4bca32bf4b06d3694a5de3cf5b69bac606dda39 Author: YanTangZhai hakeemz...@tencent.com Date: 2014-12-31T03:50:26Z Merge pull request #19 from apache/master Update commit ac9579ca434f559bf173ad219bd04b48a7db226f Author: yantangzhai tyz0...@163.com Date: 2015-01-09T03:17:51Z [SPARK-5163] [CORE] Load properties from configuration file for example spark-defaults.conf when creating SparkConf object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69288718 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25287/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3431#issuecomment-69288714 [Test build #25287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25287/consoleFull) for PR 3431 at commit [`a852b10`](https://github.com/apache/spark/commit/a852b100b5fc6ddd6a19271f01c8df12c00553a6). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DefaultSource extends SchemaRelationProvider ` * `case class ParquetRelation2(` * `trait SchemaRelationProvider ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...
Github user guowei2 commented on a diff in the pull request: https://github.com/apache/spark/pull/3921#discussion_r22700232 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -520,6 +520,7 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C TOK_TBLTEXTFILE, // Stored as TextFile TOK_TBLRCFILE, // Stored as RCFile TOK_TBLORCFILE, // Stored as ORC File +TOK_TBLPARQUETFILE, // Stored as PARQUET --- End diff -- @yhuai Thanks for reply, The token can be parsed in Hive 12 too, but the ddl create table ... stored as parquet will be failed when call hive api. This is why i do not know how to add a test case that only running in hive-0.13.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5088] Use spark-class for running execu...
Github user jongyoul commented on the pull request: https://github.com/apache/spark/pull/3897#issuecomment-69289262 @JoshRosen @tgravescs @andrewor14 Could anyone review this PR? That makes mesos codes clean --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4284] BinaryClassificationMetrics preci...
Github user Lewuathe commented on the pull request: https://github.com/apache/spark/pull/3933#issuecomment-69273220 @mengxr @srowen Thank you for reviewing. I agree to the reason why `pr` method is also reasonable in terms of drawing curve. I'll keep it as-is. But anyway I want to make it explicit in [official document](https://spark.apache.org/docs/latest/mllib-guide.html) although there seems to be no `mllib.evaluation` item here. Is there already documentation about `mllib.evaluation`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22694007 --- Diff: core/src/test/scala/org/apache/spark/scheduler/NotSerializableFakeTask.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.io.{ObjectInputStream, ObjectOutputStream, IOException} + +import org.apache.spark.TaskContext + +/** + * A Task implementation that fails to serialize. + */ +class NotSerializableFakeTask(myId: Int, stageId: Int) extends Task[Array[Byte]](stageId, 0) { --- End diff -- can you make this `private[spark]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22693981 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -887,6 +891,23 @@ class RDDSuite extends FunSuite with SharedSparkContext { assert(ancestors6.count(_.isInstanceOf[CyclicalDependencyRDD[_]]) === 3) } + test(parallelize with exception thrown on serialization should not hang) { +class BadSerializable extends Serializable { + @throws(classOf[IOException]) + private def writeObject(out: ObjectOutputStream) : Unit = throw new KryoException(Bad serialization) + + @throws(classOf[IOException]) + private def readObject(in: ObjectInputStream) : Unit = {} --- End diff -- no space before `:` here and in L897 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22693969 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -887,6 +891,23 @@ class RDDSuite extends FunSuite with SharedSparkContext { assert(ancestors6.count(_.isInstanceOf[CyclicalDependencyRDD[_]]) === 3) } + test(parallelize with exception thrown on serialization should not hang) { --- End diff -- this name is a little too specific. I'd leave out the parallelize and just call this something like ``` serialization exception should not hang scheduler ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22694761 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl( assert(availableCpus(i) = 0) launchedTask = true } + } catch { +case e: TaskNotSerializableException = { + logError(sResource offer failed, task set ${taskSet.name} was not serializable) --- End diff -- Do we expect the exception to contain any useful information? It might be good to `logError(..., e)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22694839 --- Diff: core/src/main/scala/org/apache/spark/TaskNotSerializableException.scala --- @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import org.apache.spark.annotation.DeveloperApi + +/** + * :: DeveloperApi :: + * Exception thrown when a task cannot be serialized + */ +@DeveloperApi +class TaskNotSerializableException(error: Throwable) extends Exception(error) --- End diff -- any reason why this is exposed as `DeveloperApi`? IIUC we don't throw this to the user --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22694827 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl( assert(availableCpus(i) = 0) launchedTask = true } + } catch { +case e: TaskNotSerializableException = { + logError(sResource offer failed, task set ${taskSet.name} was not serializable) --- End diff -- The place where the error is thrown is already logging it (TaskSetManager). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-69277042 [Test build #25281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25281/consoleFull) for PR 3638 at commit [`5267929`](https://github.com/apache/spark/commit/5267929054cce06dd1c422a6a010e82b81b22a13). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3961#issuecomment-69277048 [Test build #25280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25280/consoleFull) for PR 3961 at commit [`8644997`](https://github.com/apache/spark/commit/8644997624af1739890ec902f7e2e36278d158fa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/3921#discussion_r22695453 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -68,6 +68,17 @@ class SQLQuerySuite extends QueryTest { CREATE TABLE IF NOT EXISTS ctas4 AS | SELECT key, value FROM src ORDER BY key, value.stripMargin).collect +sql( + CREATE TABLE ctas5 +| ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' +| STORED AS +| INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' +| OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' +| AS +| SELECT key, value +| FROM src +| ORDER BY key, value.stripMargin).collect + --- End diff -- I guess you want to test `STORED AS PARQUET`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org