[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69270641
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3960#issuecomment-69272469
  
  [Test build #25277 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25277/consoleFull)
 for   PR 3960 at commit 
[`49bf1ac`](https://github.com/apache/spark/commit/49bf1acc700d454f894edf55cd8fa88aee4d63da).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693139
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
+}
+
+/**
+ * Euclidean distance measure, expedited by pre-computing vector norms
+ */
+class FastEuclideanOps extends PointOps[FastEUPoint, FastEUPoint] with 
Serializable {
+
+  type C = FastEUPoint
+  type P = FastEUPoint
+
+  val epsilon = 1e-4
+
+  /* compute a lower bound on the euclidean distance distance */
+
+  def distance(p: P, c: C, upperBound: Double): Double = {
+val d = if (p.weight == Zero || c.weight == Zero) {
+  p.norm + c.norm
+} else {
+  val x = p.raw.dot(c.raw) / (p.weight * c.weight)
--- End diff --

same question about using `weight` in `distance`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693112
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

Should `weight` be only used in aggregation rather than distance 
computation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69279963
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25279/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69279959
  
  [Test build #25279 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25279/consoleFull)
 for   PR 3431 at commit 
[`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DefaultSource extends SchemaRelationProvider `
  * `case class ParquetRelation2(`
  * `trait SchemaRelationProvider `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4989][CORE] avoid wrong eventlog conf c...

2015-01-08 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/3824#issuecomment-69280380
  
@andrewor14 , I received an email of your comment about creating other PRs 
to fix this issue for other older branches, but not found on this page. I think 
you might have removed that comment, so do I still need to make new PRs or just 
ignore that message?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3939


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4697][YARN]System properties should ove...

2015-01-08 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on the pull request:

https://github.com/apache/spark/pull/3557#issuecomment-69282649
  
@vanzin
Note what I note :-)

Note: In test cases I didn't use SparkConf.setAppName in application code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3431#discussion_r22697822
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala ---
@@ -83,10 +118,73 @@ private[sql] class DDLParser extends 
StandardTokenParsers with PackratParsers wi
   protected lazy val className: Parser[String] = repsep(ident, .) ^^ { 
case s = s.mkString(.)}
 
   protected lazy val pair: Parser[(String, String)] = ident ~ stringLit ^^ 
{ case k ~ v = (k,v) }
+
+  protected lazy val column: Parser[StructField] =
+ident ~ dataType ^^ { case columnName ~ typ =
+  StructField(cleanIdentifier(columnName), typ)
+}
+
+  protected lazy val primitiveType: Parser[DataType] =
+STRING ^^^ StringType |
+BINARY ^^^ BinaryType |
+BOOLEAN ^^^ BooleanType |
+TINYINT ^^^ ByteType |
+SMALLINT ^^^ ShortType |
+INT ^^^ IntegerType |
+BIGINT ^^^ LongType |
+FLOAT ^^^ FloatType |
+DOUBLE ^^^ DoubleType |
+fixedDecimalType |   // decimal with precision/scale
+DECIMAL ^^^ DecimalType.Unlimited |  // decimal with no precision/scale
+DATE ^^^ DateType |
+TIMESTAMP ^^^ TimestampType |
+VARCHAR ~ ( ~ numericLit ~ ) ^^^ StringType
+
+  protected lazy val fixedDecimalType: Parser[DataType] =
+(DECIMAL ~ ( ~ numericLit) ~ (, ~ numericLit ~ )) ^^ {
+  case precision ~ scale = DecimalType(precision.toInt, scale.toInt)
+}
+
+  protected lazy val arrayType: Parser[DataType] =
+ARRAY ~  ~ dataType ~  ^^ {
+  case tpe = ArrayType(tpe)
+}
+
+  protected lazy val mapType: Parser[DataType] =
+MAP ~  ~ dataType ~ , ~ dataType ~  ^^ {
+  case t1 ~ _ ~ t2 = MapType(t1, t2)
+}
+
+  protected lazy val structField: Parser[StructField] =
+ident ~ : ~ dataType ^^ {
+  case fieldName ~ _ ~ tpe = StructField(cleanIdentifier(fieldName), 
tpe, nullable = true)
+}
+
+  protected lazy val structType: Parser[DataType] =
+(STRUCT ~  ~ repsep(structField, ,) ~  ^^ {
+case fields = new StructType(fields)
+}) |
+(STRUCT ~  ^^ {
+  case fields = new StructType(Nil)
+})
+
+  private[sql] lazy val dataType: Parser[DataType] =
+arrayType |
+mapType |
+structType |
+primitiveType
+
+  protected val escapedIdentifier = `([^`]+)`.r
+  /** Strips backticks from ident if present */
+  protected def cleanIdentifier(ident: String): String = ident match {
+case escapedIdentifier(i) = i
+case plainIdent = plainIdent
+  }
--- End diff --

Thank you:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-69282680
  
  [Test build #25281 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25281/consoleFull)
 for   PR 3638 at commit 
[`5267929`](https://github.com/apache/spark/commit/5267929054cce06dd1c422a6a010e82b81b22a13).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-69282684
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25281/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22693295
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -80,69 +50,157 @@ class LogisticRegression extends 
Estimator[LogisticRegressionModel] with Logisti
 
   def setRegParam(value: Double): this.type = set(regParam, value)
   def setMaxIter(value: Int): this.type = set(maxIter, value)
-  def setLabelCol(value: String): this.type = set(labelCol, value)
   def setThreshold(value: Double): this.type = set(threshold, value)
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-  def setScoreCol(value: String): this.type = set(scoreCol, value)
-  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
   override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
+// Check schema
 transformSchema(dataset.schema, paramMap, logging = true)
-import dataset.sqlContext._
+
+// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+val oldDataset = extractLabeledPoints(dataset, paramMap)
 val map = this.paramMap ++ paramMap
-val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
-  .map { case Row(label: Double, features: Vector) =
-LabeledPoint(label, features)
-  }.persist(StorageLevel.MEMORY_AND_DISK)
+val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  oldDataset.persist(StorageLevel.MEMORY_AND_DISK)
+}
+
+// Train model
 val lr = new LogisticRegressionWithLBFGS
 lr.optimizer
   .setRegParam(map(regParam))
   .setNumIterations(map(maxIter))
-val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
-instances.unpersist()
+val oldModel = lr.run(oldDataset)
+val lrm = new LogisticRegressionModel(this, map, oldModel.weights, 
oldModel.intercept)
+
+if (handlePersistence) {
+  oldDataset.unpersist()
+}
+
 // copy model params
 Params.inheritValues(map, this, lrm)
 lrm
   }
 
-  private[ml] override def transformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
-validateAndTransformSchema(schema, paramMap, fitting = true)
-  }
+  override protected def featuresDataType: DataType = new VectorUDT
--- End diff --

Ehh... nevermind.. i think i got it. Feels very strange - if we must have 
this can't we make VectorUDT the default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22693876
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -456,10 +459,18 @@ private[spark] class TaskSetManager(
   }
   // Serialize and return the task
   val startTime = clock.getTime()
-  // We rely on the DAGScheduler to catch non-serializable 
closures and RDDs, so in here
-  // we assume the task can be serialized without exceptions.
-  val serializedTask = Task.serializeWithDependencies(
-task, sched.sc.addedFiles, sched.sc.addedJars, ser)
+  val serializedTask: ByteBuffer = try {
+Task.serializeWithDependencies(task, sched.sc.addedFiles,
+sched.sc.addedJars, ser)
--- End diff --

bump this up 1 line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69275198
  
  [Test build #25279 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25279/consoleFull)
 for   PR 3431 at commit 
[`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark 3299 add to SQLContext API to show table...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3872#issuecomment-69278134
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25276/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark 3299 add to SQLContext API to show table...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3872#issuecomment-69278128
  
  [Test build #25276 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25276/consoleFull)
 for   PR 3872 at commit 
[`c5609fa`](https://github.com/apache/spark/commit/c5609faec0647332243151ab7513ccdc04893f46).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3961#issuecomment-69279153
  
  [Test build #25280 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25280/consoleFull)
 for   PR 3961 at commit 
[`8644997`](https://github.com/apache/spark/commit/8644997624af1739890ec902f7e2e36278d158fa).
 * This patch **fails some tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-69279237
  
Ah never mind, I found the abort 
[here](https://github.com/mccheah/spark/blob/5267929054cce06dd1c422a6a010e82b81b22a13/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L470).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3961#issuecomment-69279159
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25280/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69281768
  
The wiki location seems fine. Maybe others disagree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-69286150
  
  [Test build #25284 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25284/consoleFull)
 for   PR 3607 at commit 
[`6c1b264`](https://github.com/apache/spark/commit/6c1b264efe76483ffa0c2c589c51b4c42de18c59).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-69286154
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25284/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4990][Deploy]to find default properties...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3823#issuecomment-69287254
  
  [Test build #25286 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25286/consoleFull)
 for   PR 3823 at commit 
[`4cc7f34`](https://github.com/apache/spark/commit/4cc7f3467ed78bb4b3a1a404c0b1daf1bd009c35).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4990][Deploy]to find default properties...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3823#issuecomment-69287260
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25286/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69287974
  
Okie doke, thank you @andrewor14.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3586][streaming]Support nested director...

2015-01-08 Thread wangxiaojing
Github user wangxiaojing commented on the pull request:

https://github.com/apache/spark/pull/2765#issuecomment-69288062
  
@tdas rebase the latest master and update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4033][Examples]Input of the SparkPi too...

2015-01-08 Thread SaintBacchus
Github user SaintBacchus commented on the pull request:

https://github.com/apache/spark/pull/2874#issuecomment-69289017
  
@andrewor14 I had explained why it can not use `Long` instead of `Int`. Not 
only the `Range` but also the `Partition` only can be appropriate with `Int`, 
and can't converse to a `Long`.
Can we restrict the input and log an error to exit from the process?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3962#issuecomment-69289902
  
  [Test build #25291 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25291/consoleFull)
 for   PR 3962 at commit 
[`2164ea8`](https://github.com/apache/spark/commit/2164ea88edd33c833fbbd0c7baa86426ef3534c0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  protected class YarnSchedulerActor(isDriver: Boolean)  extends Actor 
`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3962#issuecomment-69289908
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25291/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4048] Enhance and extend hadoop-provide...

2015-01-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2982#discussion_r22691343
  
--- Diff: yarn/pom.xml ---
@@ -131,13 +131,6 @@
   skiptrue/skip
 /configuration
   /plugin
-  plugin
-groupIdorg.apache.maven.plugins/groupId
-artifactIdmaven-install-plugin/artifactId
-configuration
-  skiptrue/skip
--- End diff --


https://github.com/vanzin/spark/commit/1adf91c401890d6a93d3950d98f951db11304cb3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3910] Remove pyspark/mllib/ from sys.pa...

2015-01-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3940#issuecomment-69270182
  
@mengxr I think so, it's better to back port that into 1.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3541][MLLIB] New ALS implementation wit...

2015-01-08 Thread coderxiang
Github user coderxiang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3720#discussion_r22692999
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
@@ -0,0 +1,964 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.recommendation
+
+import java.{util = javaUtil}
+
+import scala.collection.mutable
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+import com.github.fommil.netlib.LAPACK.{getInstance = lapack}
+import org.netlib.util.intW
+
+import org.apache.spark.{HashPartitioner, Logging, Partitioner}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SchemaRDD, StructType}
+import org.apache.spark.sql.catalyst.dsl._
+import org.apache.spark.sql.catalyst.expressions.Cast
+import org.apache.spark.sql.catalyst.plans.LeftOuter
+import org.apache.spark.sql.catalyst.types.{DoubleType, FloatType, 
IntegerType, StructField}
+import org.apache.spark.util.Utils
+import org.apache.spark.util.collection.{OpenHashMap, OpenHashSet, 
SortDataFormat, Sorter}
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * Common params for ALS.
+ */
+private[recommendation] trait ALSParams extends Params with HasMaxIter 
with HasRegParam
+  with HasPredictionCol {
+
+  /** Param for rank of the matrix factorization. */
+  val rank = new IntParam(this, rank, rank of the factorization, 
Some(10))
+  def getRank: Int = get(rank)
+
+  /** Param for number of user blocks. */
+  val numUserBlocks = new IntParam(this, numUserBlocks, number of user 
blocks, Some(10))
+  def getNumUserBlocks: Int = get(numUserBlocks)
+
+  /** Param for number of item blocks. */
+  val numItemBlocks =
+new IntParam(this, numItemBlocks, number of item blocks, Some(10))
+  def getNumItemBlocks: Int = get(numItemBlocks)
+
+  /** Param to decide whether to use implicit preference. */
+  val implicitPrefs =
+new BooleanParam(this, implicitPrefs, whether to use implicit 
preference, Some(false))
+  def getImplicitPrefs: Boolean = get(implicitPrefs)
+
+  /** Param for the alpha parameter in the implicit preference 
formulation. */
+  val alpha = new DoubleParam(this, alpha, alpha for implicit 
preference, Some(1.0))
+  def getAlpha: Double = get(alpha)
+
+  /** Param for the column name for user ids. */
+  val userCol = new Param[String](this, userCol, column name for user 
ids, Some(user))
+  def getUserCol: String = get(userCol)
+
+  /** Param for the column name for item ids. */
+  val itemCol =
+new Param[String](this, itemCol, column name for item ids, 
Some(item))
+  def getItemCol: String = get(itemCol)
+
+  /** Param for the column name for ratings. */
+  val ratingCol = new Param[String](this, ratingCol, column name for 
ratings, Some(rating))
+  def getRatingCol: String = get(ratingCol)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @param paramMap extra params
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
+val map = this.paramMap ++ paramMap
+assert(schema(map(userCol)).dataType == IntegerType)
+assert(schema(map(itemCol)).dataType== IntegerType)
+val ratingType = schema(map(ratingCol)).dataType
+assert(ratingType == FloatType || ratingType == DoubleType)
+val predictionColName = map(predictionCol)
+assert(!schema.fieldNames.contains(predictionColName),
+  sPrediction column $predictionColName already exists.)
+val newFields = schema.fields :+ StructField(map(predictionCol), 
FloatType, nullable = false)
+StructType(newFields)
+  }
+}
+
+/**
+ * Model fitted by ALS.
+ */
+class ALSModel 

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22692953
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, 
PredictorParams}
+import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.analysis.Star
+
+/**
+ * :: DeveloperApi ::
+ * Params for classification.
+ */
+@DeveloperApi
+trait ClassifierParams extends PredictorParams
+  with HasRawPredictionCol {
+
+  override protected def validateAndTransformSchema(
+  schema: StructType,
+  paramMap: ParamMap,
+  fitting: Boolean,
+  featuresDataType: DataType): StructType = {
+val parentSchema = super.validateAndTransformSchema(schema, paramMap, 
fitting, featuresDataType)
+val map = this.paramMap ++ paramMap
+addOutputColumn(parentSchema, map(rawPredictionCol), new VectorUDT)
+  }
+}
+
+/**
+ * :: AlphaComponent ::
+ * Single-label binary or multiclass classification.
+ * Classes are indexed {0, 1, ..., numClasses - 1}.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class Classifier[
--- End diff --

I don't have a concrete suggestion here, but these abstract types are 
starting to get complicated/look redundant. Is Learner only there to make 
subclassing cleaner?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-69272196
  
@derrickburns I like the improvements implemented in this PR. But as 
@srowen mentioned, we have to resolve conflicts with the master branch before 
we can merge any PR. I compared the performance of this PR with master on 
minist-digits (6x784, sparse, 10 clusters) locally and found the master 
runs 2-3x faster. I guess this is majorly caused by two changes.

1. We replaced breeze operations by our own implementation. The latter is 
about 2-3x faster.
1. Running k-means++ distributively has noticeable overhead with small k 
and feature dimension.

I think it is still feasible to include features through separate PRs:

1. remember previously computed best distances in k-means++ initialization
1. allow fixing the random seed (addressed in #3610)
1. variable number of clusters. We should discuss whether we want to have 
less than k clusters or split the biggest one if there are more than k points. 
1. parallelize k-means++. I think whether we should replace local k-means++ 
or make it configurable requires some discussion and performance comparison.
1. support Bregman divergences

Putting all of them together would certainly delay the review process and 
require resolving conflicts. I may have some time to prepare PRs for some of 
the features here, if you don't mind.

For Bregman divergences, I'm thinking we can alter the formulation to 
support sparse vectors:

~~~
d(x, y) = f(x)  - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, 
g(y)
~~~

where `f(x)`, `g(y)`, and `f(y) - y, g(y)` could be pre-computed and 
cached, and `x, g(y)` can take advantage of sparse `x`. But I'm not sure 
whether this formulation is really useful on any Bregman divergence rather than 
the squared distance and the Mahalanobis distance. For KL-divergence and 
generalized I-divergence, the domain is R^d_+ and hence the points cannot be 
sparse.

Besides those comments, I'm going to make some minor comments inline.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...

2015-01-08 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/3960

[WIP][SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api

This one subsumes #3752. It current contains changes made in #3431. Will 
clean it up once #3431 is in.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark persistantTablesWithSchema2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3960


commit d7da491713a83f25de5c07639de7985a96c801a6
Author: Michael Armbrust mich...@databricks.com
Date:   2014-12-20T20:45:28Z

First draft of persistent tables.

commit 6edc71026c4a10cce338adaa7b807fef0ee2857b
Author: Michael Armbrust mich...@databricks.com
Date:   2014-12-20T21:03:59Z

Add tests.

commit 1ea6e7bbf04c04f7c51884ca0ec819cddfaac10b
Author: Michael Armbrust mich...@databricks.com
Date:   2014-12-21T22:23:34Z

Don't fail when trying to uncache a table that doesn't exist

commit c00bb1bf25b8f9875fc3e8b58d007d67496f1b2f
Author: Michael Armbrust mich...@databricks.com
Date:   2014-12-22T19:05:46Z

Don't use reflection to read options

commit 2b5972353a47ca1577a0ddcd3aab5c9dbd1d10d4
Author: Michael Armbrust mich...@databricks.com
Date:   2014-12-22T19:08:13Z

Set external when creating tables

commit 8f8f1a167360bfab3198b086d4608f5b3517f249
Author: Yin Huai yh...@databricks.com
Date:   2015-01-08T00:53:02Z

[SPARK-4574][SQL] Adding support for defining schema in foreign DDL 
commands. #3431

commit f47fda1f5e34dd73d7e5db9949eceb21cdd1ce89
Author: Yin Huai yh...@databricks.com
Date:   2015-01-08T01:58:00Z

Unit tests.

commit 49bf1acc700d454f894edf55cd8fa88aee4d63da
Author: Yin Huai yh...@databricks.com
Date:   2015-01-08T01:58:00Z

Unit tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22693428
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import org.apache.spark.annotation.{AlphaComponent, DeveloperApi}
+import org.apache.spark.ml.param.{HasProbabilityCol, ParamMap, Params}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.analysis.Star
+
+/**
+ * Params for probabilistic classification.
+ */
+private[classification] trait ProbabilisticClassifierParams
+  extends ClassifierParams with HasProbabilityCol {
+
+  override protected def validateAndTransformSchema(
+  schema: StructType,
+  paramMap: ParamMap,
+  fitting: Boolean,
+  featuresDataType: DataType): StructType = {
+val parentSchema = super.validateAndTransformSchema(schema, paramMap, 
fitting, featuresDataType)
+val map = this.paramMap ++ paramMap
+addOutputColumn(parentSchema, map(probabilityCol), new VectorUDT)
+  }
+}
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Single-label binary or multiclass classifier which can output class 
conditional probabilities.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class ProbabilisticClassifier[
+FeaturesType,
+Learner : ProbabilisticClassifier[FeaturesType, Learner, M],
+M : ProbabilisticClassificationModel[FeaturesType, M]]
+  extends Classifier[FeaturesType, Learner, M] with 
ProbabilisticClassifierParams {
+
+  def setProbabilityCol(value: String): Learner = set(probabilityCol, 
value).asInstanceOf[Learner]
+}
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Model produced by a [[ProbabilisticClassifier]].
+ * Classes are indexed {0, 1, ..., numClasses - 1}.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class ProbabilisticClassificationModel[
+FeaturesType,
+M : ProbabilisticClassificationModel[FeaturesType, M]]
+  extends ClassificationModel[FeaturesType, M] with 
ProbabilisticClassifierParams {
+
+  def setProbabilityCol(value: String): M = set(probabilityCol, 
value).asInstanceOf[M]
+
+  /**
+   * Transforms dataset by reading from [[featuresCol]], and appending new 
columns as specified by
+   * parameters:
+   *  - predicted labels as [[predictionCol]] of type [[Double]]
+   *  - raw predictions (confidences) as [[rawPredictionCol]] of type 
[[Vector]]
+   *  - probability of each class as [[probabilityCol]] of type [[Vector]].
+   *
+   * @param dataset input dataset
+   * @param paramMap additional parameters, overwrite embedded params
+   * @return transformed dataset
+   */
+  override def transform(dataset: SchemaRDD, paramMap: ParamMap): 
SchemaRDD = {
+// This default implementation should be overridden as needed.
+import dataset.sqlContext._
+import org.apache.spark.sql.catalyst.dsl._
+
+// Check schema
+transformSchema(dataset.schema, paramMap, logging = true)
+val map = this.paramMap ++ paramMap
+
+// Prepare model
+val tmpModel = if (paramMap.size != 0) {
+  val tmpModel = this.copy()
+  Params.inheritValues(paramMap, parent, tmpModel)
+  tmpModel
+} else {
+  this
+}
+
+val (numColsOutput, outputData) =
+  ClassificationModel.transformColumnsImpl[FeaturesType](dataset, 
tmpModel, map)
+
+// Output selected columns only.
+if (map(probabilityCol) != ) {
+  // 

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22693403
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -80,69 +50,157 @@ class LogisticRegression extends 
Estimator[LogisticRegressionModel] with Logisti
 
   def setRegParam(value: Double): this.type = set(regParam, value)
   def setMaxIter(value: Int): this.type = set(maxIter, value)
-  def setLabelCol(value: String): this.type = set(labelCol, value)
   def setThreshold(value: Double): this.type = set(threshold, value)
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-  def setScoreCol(value: String): this.type = set(scoreCol, value)
-  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
   override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
+// Check schema
 transformSchema(dataset.schema, paramMap, logging = true)
-import dataset.sqlContext._
+
+// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+val oldDataset = extractLabeledPoints(dataset, paramMap)
 val map = this.paramMap ++ paramMap
-val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
-  .map { case Row(label: Double, features: Vector) =
-LabeledPoint(label, features)
-  }.persist(StorageLevel.MEMORY_AND_DISK)
+val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  oldDataset.persist(StorageLevel.MEMORY_AND_DISK)
+}
+
+// Train model
 val lr = new LogisticRegressionWithLBFGS
 lr.optimizer
   .setRegParam(map(regParam))
   .setNumIterations(map(maxIter))
-val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
-instances.unpersist()
+val oldModel = lr.run(oldDataset)
+val lrm = new LogisticRegressionModel(this, map, oldModel.weights, 
oldModel.intercept)
+
+if (handlePersistence) {
+  oldDataset.unpersist()
+}
+
 // copy model params
 Params.inheritValues(map, this, lrm)
 lrm
   }
 
-  private[ml] override def transformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
-validateAndTransformSchema(schema, paramMap, fitting = true)
-  }
+  override protected def featuresDataType: DataType = new VectorUDT
 }
 
+
 /**
  * :: AlphaComponent ::
+ *
  * Model produced by [[LogisticRegression]].
  */
 @AlphaComponent
 class LogisticRegressionModel private[ml] (
 override val parent: LogisticRegression,
 override val fittingParamMap: ParamMap,
-weights: Vector)
-  extends Model[LogisticRegressionModel] with LogisticRegressionParams {
+val weights: Vector,
+val intercept: Double)
+  extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
+  with LogisticRegressionParams {
+
+  setThreshold(0.5)
 
   def setThreshold(value: Double): this.type = set(threshold, value)
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-  def setScoreCol(value: String): this.type = set(scoreCol, value)
-  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
-  private[ml] override def transformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
-validateAndTransformSchema(schema, paramMap, fitting = false)
+  private val margin: Vector = Double = (features) = {
+BLAS.dot(features, weights) + intercept
+  }
+
+  private val score: Vector = Double = (features) = {
+val m = margin(features)
+1.0 / (1.0 + math.exp(-m))
   }
 
   override def transform(dataset: SchemaRDD, paramMap: ParamMap): 
SchemaRDD = {
+// Check schema
 transformSchema(dataset.schema, paramMap, logging = true)
+
 import dataset.sqlContext._
 val map = this.paramMap ++ paramMap
-val score: Vector = Double = (v) = {
-  val margin = BLAS.dot(v, weights)
-  1.0 / (1.0 + math.exp(-margin))
+
+// Output selected columns only.
+// This is a bit complicated since it tries to avoid repeated 
computation.
+//   rawPrediction (-margin, margin)
+//   probability (1.0-score, score)
+//   prediction (max margin)
+var tmpData = dataset
+var numColsOutput = 0
+if (map(rawPredictionCol) != ) {
+  val features2raw: Vector = Vector = predictRaw
+  tmpData = tmpData.select(Star(None),
+features2raw.call(map(featuresCol).attr) as map(rawPredictionCol))
+  numColsOutput += 1
+}
+if 

[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22693927
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -456,10 +459,18 @@ private[spark] class TaskSetManager(
   }
   // Serialize and return the task
   val startTime = clock.getTime()
-  // We rely on the DAGScheduler to catch non-serializable 
closures and RDDs, so in here
-  // we assume the task can be serialized without exceptions.
-  val serializedTask = Task.serializeWithDependencies(
-task, sched.sc.addedFiles, sched.sc.addedJars, ser)
+  val serializedTask: ByteBuffer = try {
+Task.serializeWithDependencies(task, sched.sc.addedFiles,
+sched.sc.addedJars, ser)
+  } catch {
+// If the task cannot be serialized, then there's no point to 
re-attempt the task,
+// as it will always fail. So just abort the whole task-set.
+case NonFatal(e) =
+  logError(sFailed to serialize task $taskId, not attempting 
to retry it., e)
+  abort(sFailed to serialize task $taskId, not attempt to 
retry it. Exception  +
+sduring serialization is: $e)
--- End diff --

Looks like there's some duplication here. Can you put this in a val:
```
val msg = sFailed to serialize task $taskId, not attempting to retry it.
logError(msg, e)
abort(s$msg Exception during serialization: $e)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3960#issuecomment-69276273
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25277/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3960#issuecomment-69276271
  
  [Test build #25277 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25277/consoleFull)
 for   PR 3960 at commit 
[`49bf1ac`](https://github.com/apache/spark/commit/49bf1acc700d454f894edf55cd8fa88aee4d63da).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DefaultSource extends SchemaRelationProvider `
  * `case class ParquetRelation2(`
  * `trait SchemaRelationProvider `
  * `  case class TableIdent(database: String, name: String) `
  * `case class CreateMetastoreDataSource(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69277310
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25275/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69277301
  
  [Test build #25275 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25275/consoleFull)
 for   PR 3939 at commit 
[`66e0841`](https://github.com/apache/spark/commit/66e0841132331d0283ffdbd7a8e8203a67bd9d77).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...

2015-01-08 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3921#discussion_r22695485
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -520,6 +520,7 @@ 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
 TOK_TBLTEXTFILE, // Stored as TextFile
 TOK_TBLRCFILE, // Stored as RCFile
 TOK_TBLORCFILE, // Stored as ORC File
+TOK_TBLPARQUETFILE, // Stored as PARQUET
--- End diff --

This token was introduced with Hive 13. What will happen if a user is using 
Hive 12?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69280879
  
  [Test build #25283 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25283/consoleFull)
 for   PR 3431 at commit 
[`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22697617
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ---
@@ -87,6 +92,21 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   throw new IllegalArgumentException(
 You must specify at least 1 executor!\n + getUsageMessage())
 }
+if (isClusterMode) {
+  for (key - Seq(amMemKey, amMemOverheadKey)) {
+if (sparkConf.getOption(key).isDefined) {
+  println(s$key is set but does not apply in cluster mode.)
--- End diff --

As `ClientArguments.scala` didn't extends Logging class, only `println` can 
be used here.
Yep, if user set the config values that never be used in that mode, we 
should give a prompt.

BTW, `spark.driver.memory` is used in both modes, so I deleted the meesage 
about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3962#issuecomment-69285225
  
  [Test build #25290 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25290/consoleFull)
 for   PR 3962 at commit 
[`6dfeeec`](https://github.com/apache/spark/commit/6dfeeecd4a206b9a82952e3b9f78128a0013d3c9).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  protected class YarnSchedulerActor(isDriver: Boolean)  extends Actor 
`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3962#issuecomment-69285227
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25290/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-69286990
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25285/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-69286987
  
  [Test build #25285 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25285/consoleFull)
 for   PR 3607 at commit 
[`d5ceb1b`](https://github.com/apache/spark/commit/d5ceb1b2f181628fe0096202ffb31d95f0afcef8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...

2015-01-08 Thread guowei2
Github user guowei2 commented on the pull request:

https://github.com/apache/spark/pull/3921#issuecomment-69290028
  
i think i should remove the test case for `stored as parquet ` only can 
pass in hive-0.13


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5145][Mllib] Add BLAS.dsyr and use it i...

2015-01-08 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3949#issuecomment-69290026
  
@jkbradley Thanks. The unit test is added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5123] Expose only one version of the da...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3958#issuecomment-69267461
  
  [Test build #25272 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25272/consoleFull)
 for   PR 3958 at commit 
[`b4f9649`](https://github.com/apache/spark/commit/b4f96490f5044873aa593c6178a75d446f923493).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3541][MLLIB] New ALS implementation wit...

2015-01-08 Thread coderxiang
Github user coderxiang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3720#discussion_r22692604
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
@@ -0,0 +1,964 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.recommendation
+
+import java.{util = javaUtil}
+
+import scala.collection.mutable
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+import com.github.fommil.netlib.LAPACK.{getInstance = lapack}
+import org.netlib.util.intW
+
+import org.apache.spark.{HashPartitioner, Logging, Partitioner}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SchemaRDD, StructType}
+import org.apache.spark.sql.catalyst.dsl._
+import org.apache.spark.sql.catalyst.expressions.Cast
+import org.apache.spark.sql.catalyst.plans.LeftOuter
+import org.apache.spark.sql.catalyst.types.{DoubleType, FloatType, 
IntegerType, StructField}
+import org.apache.spark.util.Utils
+import org.apache.spark.util.collection.{OpenHashMap, OpenHashSet, 
SortDataFormat, Sorter}
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * Common params for ALS.
+ */
+private[recommendation] trait ALSParams extends Params with HasMaxIter 
with HasRegParam
+  with HasPredictionCol {
+
+  /** Param for rank of the matrix factorization. */
+  val rank = new IntParam(this, rank, rank of the factorization, 
Some(10))
+  def getRank: Int = get(rank)
+
+  /** Param for number of user blocks. */
+  val numUserBlocks = new IntParam(this, numUserBlocks, number of user 
blocks, Some(10))
+  def getNumUserBlocks: Int = get(numUserBlocks)
+
+  /** Param for number of item blocks. */
+  val numItemBlocks =
+new IntParam(this, numItemBlocks, number of item blocks, Some(10))
+  def getNumItemBlocks: Int = get(numItemBlocks)
+
+  /** Param to decide whether to use implicit preference. */
+  val implicitPrefs =
+new BooleanParam(this, implicitPrefs, whether to use implicit 
preference, Some(false))
+  def getImplicitPrefs: Boolean = get(implicitPrefs)
+
+  /** Param for the alpha parameter in the implicit preference 
formulation. */
+  val alpha = new DoubleParam(this, alpha, alpha for implicit 
preference, Some(1.0))
+  def getAlpha: Double = get(alpha)
+
+  /** Param for the column name for user ids. */
+  val userCol = new Param[String](this, userCol, column name for user 
ids, Some(user))
+  def getUserCol: String = get(userCol)
+
+  /** Param for the column name for item ids. */
+  val itemCol =
+new Param[String](this, itemCol, column name for item ids, 
Some(item))
+  def getItemCol: String = get(itemCol)
+
+  /** Param for the column name for ratings. */
+  val ratingCol = new Param[String](this, ratingCol, column name for 
ratings, Some(rating))
+  def getRatingCol: String = get(ratingCol)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @param paramMap extra params
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
+val map = this.paramMap ++ paramMap
+assert(schema(map(userCol)).dataType == IntegerType)
+assert(schema(map(itemCol)).dataType== IntegerType)
+val ratingType = schema(map(ratingCol)).dataType
+assert(ratingType == FloatType || ratingType == DoubleType)
+val predictionColName = map(predictionCol)
+assert(!schema.fieldNames.contains(predictionColName),
+  sPrediction column $predictionColName already exists.)
+val newFields = schema.fields :+ StructField(map(predictionCol), 
FloatType, nullable = false)
+StructType(newFields)
+  }
+}
+
+/**
+ * Model fitted by ALS.
+ */
+class ALSModel 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693062
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
--- End diff --

Should it be `mllib.clustering` as the file is under `clustering/`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22693073
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, 
PredictorParams}
+import org.apache.spark.ml.param.{Params, ParamMap, HasRawPredictionCol}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.analysis.Star
+
+/**
+ * :: DeveloperApi ::
+ * Params for classification.
+ */
+@DeveloperApi
+trait ClassifierParams extends PredictorParams
+  with HasRawPredictionCol {
+
+  override protected def validateAndTransformSchema(
+  schema: StructType,
+  paramMap: ParamMap,
+  fitting: Boolean,
+  featuresDataType: DataType): StructType = {
+val parentSchema = super.validateAndTransformSchema(schema, paramMap, 
fitting, featuresDataType)
+val map = this.paramMap ++ paramMap
+addOutputColumn(parentSchema, map(rawPredictionCol), new VectorUDT)
+  }
+}
+
+/**
+ * :: AlphaComponent ::
+ * Single-label binary or multiclass classification.
+ * Classes are indexed {0, 1, ..., numClasses - 1}.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class Classifier[
+FeaturesType,
+Learner : Classifier[FeaturesType, Learner, M],
+M : ClassificationModel[FeaturesType, M]]
+  extends Predictor[FeaturesType, Learner, M]
+  with ClassifierParams {
+
+  def setRawPredictionCol(value: String): Learner =
+set(rawPredictionCol, value).asInstanceOf[Learner]
+
+  // TODO: defaultEvaluator (follow-up PR)
+}
+
+/**
+ * :: AlphaComponent ::
+ * Model produced by a [[Classifier]].
+ * Classes are indexed {0, 1, ..., numClasses - 1}.
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., [[Vector]]
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class ClassificationModel[FeaturesType, M : 
ClassificationModel[FeaturesType, M]]
+  extends PredictionModel[FeaturesType, M] with ClassifierParams {
+
+  def setRawPredictionCol(value: String): M = set(rawPredictionCol, 
value).asInstanceOf[M]
+
+  /** Number of classes (values which the label can take). */
+  def numClasses: Int
--- End diff --

How hard/weird would it be to make labels an Enumeration? This class could 
be inferred from the training set at run-time or supplied by the user, and then 
the user doesn't pass the number of classes to the model, but instead what the 
set of labels actually are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-08 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r22693524
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/sharedParams.scala 
---
@@ -17,6 +17,10 @@
 
 package org.apache.spark.ml.param
 
+/* NOTE TO DEVELOPERS:
+ * If you add these parameter traits into your algorithm, you need to add 
a setter method as well.
--- End diff --

Maybe we should update this comment and explain *why* the setter must be 
added? Code will still compile, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22693713
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -249,13 +250,12 @@ private[spark] class TaskSchedulerImpl(
 // of locality levels so that it gets a chance to launch local tasks 
on all of them.
 // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, 
NO_PREF, RACK_LOCAL, ANY
 var launchedTask = false
-for (taskSet - sortedTaskSets; maxLocality - 
taskSet.myLocalityLevels) {
-  do {
-launchedTask = false
-for (i - 0 until shuffledOffers.size) {
-  val execId = shuffledOffers(i).executorId
-  val host = shuffledOffers(i).host
-  if (availableCpus(i) = CPUS_PER_TASK) {
+def resourceOfferSingleTaskSet(taskSet: TaskSetManager, maxLocality: 
TaskLocality) : Unit = {
--- End diff --

no space before `:`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22693694
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl(
   assert(availableCpus(i) = 0)
   launchedTask = true
 }
+  } catch {
+case e: TaskNotSerializableException = {
+  logError(sResource offer failed, task set ${taskSet.name} 
was not serializable)
+  // Do not offer resources for this task, but don't throw an 
error to allow other
+  // task sets to be submitted.
+  return
+}
   }
 }
+  }
+}
--- End diff --

can you define this function as a `private def` outside of 
`resourceOffers`? The nesting here makes this hard to read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22695144
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -865,26 +865,6 @@ class DAGScheduler(
 }
 
 if (tasks.size  0) {
-  // Preemptively serialize a task to make sure it can be serialized. 
We are catching this
-  // exception here because it would be fairly hard to catch the 
non-serializable exception
-  // down the road, where we have several different implementations 
for local scheduler and
-  // cluster schedulers.
-  //
-  // We've already serialized RDDs and closures in taskBinary, but 
here we check for all other
-  // objects such as Partition.
-  try {
-closureSerializer.serialize(tasks.head)
-  } catch {
-case e: NotSerializableException =
-  abortStage(stage, Task not serializable:  + e.toString)
-  runningStages -= stage
-  return
-case NonFatal(e) = // Other exceptions, such as 
IllegalArgumentException from Kryo.
-  abortStage(stage, sTask serialization failed: 
$e\n${e.getStackTraceString})
-  runningStages -= stage
-  return
-  }
-
--- End diff --

This is the main addition in the patch - to make it so that task 
serialization error handling is only done when the serialization actually 
occurs.

It turns out there are many scenarios where this selective sampling does 
not actually work. For example, when you create an RDD from an in-memory 
collection, perhaps some of the items are serializable but others are not. E.g. 
consider a list of containers, where the first item in the list is an empty 
container, and the second item in the list is a non-empty container with 
non-serializable things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22695628
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl(
   assert(availableCpus(i) = 0)
   launchedTask = true
 }
+  } catch {
+case e: TaskNotSerializableException = {
+  logError(sResource offer failed, task set ${taskSet.name} 
was not serializable)
--- End diff --

yeah you're right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22696100
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -209,6 +210,42 @@ private[spark] class TaskSchedulerImpl(
   .format(manager.taskSet.id, manager.parent.name))
   }
 
+  private def resourceOfferSingleTaskSet(
+  taskSet: TaskSetManager,
+  maxLocality: TaskLocality,
+  shuffledOffers: Seq[WorkerOffer],
+  availableCpus: Array[Int],
+  tasks: Seq[ArrayBuffer[TaskDescription]])
+: Boolean =
+  {
--- End diff --

small nit:
```
tasks: Seq[...]): Boolean = {
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-69278693
  
@mccheah @JoshRosen high level question. So what happens now when a task is 
not serializable? Before it would throw a loud exception and fail the task, but 
now we catch the task not serializable exception and silently not schedule it. 
I may be missing something, but do we ever abort the stage or fail the task?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2015-01-08 Thread loachli
Github user loachli commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-69278665
  
Hi jkbradley:
  Could you tell the jira number  related  to  ”new spark.ml package and 
its design doc”


发件人: jkbradley [mailto:notificati...@github.com]
发送时间: 2015年1月9日 3:51
收件人: apache/spark
抄送: Lizhengbing (bing, BIPA)
主题: Re: [spark] [MLLIB] [spark-2352] Implementation of an Artificial 
Neural Network (ANN) (#1290)


@bgreevenhttps://github.com/bgreeven I’m not too surprised that the 
majority vote (a.k.a. one vs. all) did not do very well; it does not scale well 
with the number of classes. A tree (or better yet, error-corrected output 
codes) generally work better, in my experience.

@avulanovhttps://github.com/avulanov True, we try for consistency with 
APIs, except where we’re changing the norm. There is not a clear write-up 
about the “norm,” although the new spark.ml package andHc (in the JIRA) 
give an overview of some parts. Basically, we’re aiming to make things more 
pluggable and extensible, while minimizing API change. If that requires 
short-term API changes (such as switching away from ANNWithX method names), 
that can be acceptable.

@bgreevenhttps://github.com/bgreeven 
@avulanovhttps://github.com/avulanov The test results look pretty good, 
though I’m not sure what to expect for accuracy. I think the main item 
remaining is figuring out the public API. It’s tough since neural networks / 
deep learning are a rapidly evolving field, and there are a lot of model  
algorithm variants out there. Ideally, we could put together a design doc (to 
be linked from the JIRA) for this big feature which would:

  *   Design a public API for neural networks and deep learning
 *   Comparison of other major libraries’ APIs
 *   Minimum viable product API for an initial PR
 *   Path for the future:
*   What extensions might we need to do, and can we keep the public 
API stable for these?
*   What extensions might users want to do? Is the API easily 
extensible and/or pluggable, or can we make it so in the future without 
changing the existing public API?
  *   Briefly discuss the algorithm
 *   Alg sketch, limitations, etc.
 *   Alternative algorithms, and a path for making the optimization 
algorithm pluggable in the future (as we’ve discussed a bit in the PR 
conversation)

I realize it takes quite a while to get a big new feature ready. If you’d 
like to encourage early adoption, you could also post this for now as a package 
for Spark, while the PR is made fully ready.

CC: @mengxrhttps://github.com/mengxr

—
Reply to this email directly or view it on 
GitHubhttps://github.com/apache/spark/pull/1290#issuecomment-69237765.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22696128
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -251,23 +288,8 @@ private[spark] class TaskSchedulerImpl(
 var launchedTask = false
 for (taskSet - sortedTaskSets; maxLocality - 
taskSet.myLocalityLevels) {
   do {
-launchedTask = false
-for (i - 0 until shuffledOffers.size) {
-  val execId = shuffledOffers(i).executorId
-  val host = shuffledOffers(i).host
-  if (availableCpus(i) = CPUS_PER_TASK) {
-for (task - taskSet.resourceOffer(execId, host, maxLocality)) 
{
-  tasks(i) += task
-  val tid = task.taskId
-  taskIdToTaskSetId(tid) = taskSet.taskSet.id
-  taskIdToExecutorId(tid) = execId
-  executorsByHost(host) += execId
-  availableCpus(i) -= CPUS_PER_TASK
-  assert(availableCpus(i) = 0)
-  launchedTask = true
-}
-  }
-}
+launchedTask = resourceOfferSingleTaskSet(taskSet, maxLocality, 
shuffledOffers,
+  availableCpus, tasks)
--- End diff --

another small style nit
```
launchedTask = resourceOfferSingleTaskSet(
  taskSet, maxLocality ... tasks)
```  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4989][CORE] avoid wrong eventlog conf c...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3824#issuecomment-69280965
  
Yes that would be great. It seems that not all of the changes in this PR 
are applicable there, however.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22697722
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
 ---
@@ -68,8 +68,6 @@ private[spark] class YarnClientSchedulerBackend(
 // List of (target Client argument, environment variable, Spark 
property)
 val optionTuples =
   List(
-(--driver-memory, SPARK_MASTER_MEMORY, spark.master.memory),
-(--driver-memory, SPARK_DRIVER_MEMORY, spark.driver.memory),
--- End diff --

ah ok. Also it doesn't really make sense to pass driver memory on in client 
mode anyway, because the driver by definition has already started when 
`YarnClientSchedulerBackend` is created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-69282417
  
  [Test build #25285 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25285/consoleFull)
 for   PR 3607 at commit 
[`d5ceb1b`](https://github.com/apache/spark/commit/d5ceb1b2f181628fe0096202ffb31d95f0afcef8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4989][CORE] avoid wrong eventlog conf c...

2015-01-08 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/3824#issuecomment-69283439
  
ok, I'll make new PRs for those old branches 1.0, 1.1, and 1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

2015-01-08 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3783#issuecomment-69284237
  
 Can you explain how (1) is related to SPARK-4951? It seems to me that (2) 
is sufficient in triggering the issue.

The original implementation will mark an exeuctor idle when receiving 
`SparkListenerBlockManagerAdded`.

So if `SparkListenerTaskStart` is received before 
`SparkListenerBlockManagerAdded`, when receiving 
`SparkListenerBlockManagerAdded`, the executor will be marked idle even if 
there is a task running in it. Therefore, the executor will be killed when it's 
expired.

That's why I said it's related. Of cause, we can also say (1) is a special 
case of (2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69285760
  
  [Test build #25283 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25283/consoleFull)
 for   PR 3431 at commit 
[`f336a16`](https://github.com/apache/spark/commit/f336a16c4b1e6241d160d2c149cdb13dba4b9263).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DefaultSource extends SchemaRelationProvider `
  * `case class ParquetRelation2(`
  * `trait SchemaRelationProvider `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69285764
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25283/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4048] Enhance and extend hadoop-provide...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2982#issuecomment-69269314
  
  [Test build #25273 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25273/consoleFull)
 for   PR 2982 at commit 
[`82eb688`](https://github.com/apache/spark/commit/82eb688f44d2df63a7b7ff311e5d40970f67fc43).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3959#issuecomment-69270938
  
  [Test build #25274 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25274/consoleFull)
 for   PR 3959 at commit 
[`5425314`](https://github.com/apache/spark/commit/542531483312b77ed941c277f3e05c4ef1867534).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69270905
  
  [Test build #25275 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25275/consoleFull)
 for   PR 3939 at commit 
[`66e0841`](https://github.com/apache/spark/commit/66e0841132331d0283ffdbd7a8e8203a67bd9d77).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22694937
  
--- Diff: 
core/src/main/scala/org/apache/spark/TaskNotSerializableException.scala ---
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * Exception thrown when a task cannot be serialized
+ */
+@DeveloperApi
+class TaskNotSerializableException(error: Throwable) extends 
Exception(error)
--- End diff --

I perhaps misunderstood the semantics of DeveloperApi - what I believed it 
meant is that the class should not be used by end-users, but is only to be 
thrown from Spark. However the exception class would be visible when we log 
it...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22694961
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -865,26 +865,6 @@ class DAGScheduler(
 }
 
 if (tasks.size  0) {
-  // Preemptively serialize a task to make sure it can be serialized. 
We are catching this
-  // exception here because it would be fairly hard to catch the 
non-serializable exception
-  // down the road, where we have several different implementations 
for local scheduler and
-  // cluster schedulers.
-  //
-  // We've already serialized RDDs and closures in taskBinary, but 
here we check for all other
-  // objects such as Partition.
-  try {
-closureSerializer.serialize(tasks.head)
-  } catch {
-case e: NotSerializableException =
-  abortStage(stage, Task not serializable:  + e.toString)
-  runningStages -= stage
-  return
-case NonFatal(e) = // Other exceptions, such as 
IllegalArgumentException from Kryo.
-  abortStage(stage, sTask serialization failed: 
$e\n${e.getStackTraceString})
-  runningStages -= stage
-  return
-  }
-
--- End diff --

Can you explain why this is removed? It used to provide a way to fail fast 
if the task is not serializable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...

2015-01-08 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3921#discussion_r22695098
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -520,6 +520,7 @@ 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
 TOK_TBLTEXTFILE, // Stored as TextFile
 TOK_TBLRCFILE, // Stored as RCFile
 TOK_TBLORCFILE, // Stored as ORC File
+TOK_TBLPARQUETFILE, // Stored as PARQUET
--- End diff --

Seems this line is added only for the completeness of the parser for Hive 
0.13. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...

2015-01-08 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3921#discussion_r22695310
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -520,6 +520,7 @@ 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
 TOK_TBLTEXTFILE, // Stored as TextFile
 TOK_TBLRCFILE, // Stored as RCFile
 TOK_TBLORCFILE, // Stored as ORC File
+TOK_TBLPARQUETFILE, // Stored as PARQUET
--- End diff --

For these tokens, only `TOK_TABNAME`, `TOK_QUERY`, and `TOK_IFNOTEXISTS` 
are actually used by `HiveQl`. Since we ask Hive's SemanticAnalyzer to create 
the `CreateTableDesc`, we basically ignore other tokens at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3959#issuecomment-69277801
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25274/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3959#issuecomment-69277796
  
  [Test build #25274 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25274/consoleFull)
 for   PR 3959 at commit 
[`5425314`](https://github.com/apache/spark/commit/542531483312b77ed941c277f3e05c4ef1867534).
 * This patch **fails some tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4912][SQL] Persistent tables for t...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3960#issuecomment-69280036
  
  [Test build #25282 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25282/consoleFull)
 for   PR 3960 at commit 
[`172db80`](https://github.com/apache/spark/commit/172db80cf71ba4a853a42993e87bc52e6c08b94f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-69281668
  
  [Test build #25284 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25284/consoleFull)
 for   PR 3607 at commit 
[`6c1b264`](https://github.com/apache/spark/commit/6c1b264efe76483ffa0c2c589c51b4c42de18c59).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2015-01-08 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22697386
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
 ---
@@ -68,8 +68,6 @@ private[spark] class YarnClientSchedulerBackend(
 // List of (target Client argument, environment variable, Spark 
property)
 val optionTuples =
   List(
-(--driver-memory, SPARK_MASTER_MEMORY, spark.master.memory),
-(--driver-memory, SPARK_DRIVER_MEMORY, spark.driver.memory),
--- End diff --

Since in `Client.scala` the `--driver-memory` passed by spark-submit is not 
used anymore.
I thought we've disscussed it before.

@andrewor14 Ok, I got what you mean. I think I have a misunderstanding 
before. 
To solve this problem, should we just delete

(--driver-memory, SPARK_MASTER_MEMORY, spark.master.memory),
(--driver-memory, SPARK_DRIVER_MEMORY, spark.driver.memory), .
in YarnClientSchedulerBackend.scala?

@WangTaoTheTonic that would fix it, but I think in addition to that we 
should also add a check in ClientArguments itself in case the user calls into 
the Client main class and specifying --driver-memory manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5122] Remove Shark from spark-ec2

2015-01-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3939#issuecomment-69281611
  
Ok LGTM I'm merging this into master thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5163] [CORE] Load properties from confi...

2015-01-08 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3963

[SPARK-5163] [CORE] Load properties from configuration file for example 
spark-defaults.conf when creating SparkConf object

I create and run a Spark program which does not use SparkSubmit.
When I create a SparkConf object with `new SparkConf()`, it will not 
automatically load properties from configuration file for example 
spark-defaults.conf.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-5163

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3963


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit d4bca32bf4b06d3694a5de3cf5b69bac606dda39
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-12-31T03:50:26Z

Merge pull request #19 from apache/master

Update

commit ac9579ca434f559bf173ad219bd04b48a7db226f
Author: yantangzhai tyz0...@163.com
Date:   2015-01-09T03:17:51Z

[SPARK-5163] [CORE] Load properties from configuration file for example 
spark-defaults.conf when creating SparkConf object




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69288718
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25287/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4574][SQL] Adding support for defining ...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3431#issuecomment-69288714
  
  [Test build #25287 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25287/consoleFull)
 for   PR 3431 at commit 
[`a852b10`](https://github.com/apache/spark/commit/a852b100b5fc6ddd6a19271f01c8df12c00553a6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DefaultSource extends SchemaRelationProvider `
  * `case class ParquetRelation2(`
  * `trait SchemaRelationProvider `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...

2015-01-08 Thread guowei2
Github user guowei2 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3921#discussion_r22700232
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -520,6 +520,7 @@ 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
 TOK_TBLTEXTFILE, // Stored as TextFile
 TOK_TBLRCFILE, // Stored as RCFile
 TOK_TBLORCFILE, // Stored as ORC File
+TOK_TBLPARQUETFILE, // Stored as PARQUET
--- End diff --

@yhuai  Thanks for reply, The token can  be parsed in Hive 12  too,  but 
the ddl create table ... stored as parquet  will be failed when call hive api.
This is why i do not know how to add a test case that only running in 
hive-0.13.1



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5088] Use spark-class for running execu...

2015-01-08 Thread jongyoul
Github user jongyoul commented on the pull request:

https://github.com/apache/spark/pull/3897#issuecomment-69289262
  
@JoshRosen @tgravescs @andrewor14 Could anyone review this PR? That makes 
mesos codes clean 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4284] BinaryClassificationMetrics preci...

2015-01-08 Thread Lewuathe
Github user Lewuathe commented on the pull request:

https://github.com/apache/spark/pull/3933#issuecomment-69273220
  
@mengxr @srowen Thank you for reviewing. I agree to the reason why `pr` 
method is also reasonable in terms of drawing curve. I'll keep it as-is. 
But anyway I want to make it explicit in [official 
document](https://spark.apache.org/docs/latest/mllib-guide.html) although there 
seems to be no `mllib.evaluation` item here. Is there already documentation 
about `mllib.evaluation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22694007
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/NotSerializableFakeTask.scala ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.io.{ObjectInputStream, ObjectOutputStream, IOException}
+
+import org.apache.spark.TaskContext
+
+/**
+ * A Task implementation that fails to serialize.
+ */
+class NotSerializableFakeTask(myId: Int, stageId: Int) extends 
Task[Array[Byte]](stageId, 0) {
--- End diff --

can you make this `private[spark]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22693981
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -887,6 +891,23 @@ class RDDSuite extends FunSuite with 
SharedSparkContext {
 assert(ancestors6.count(_.isInstanceOf[CyclicalDependencyRDD[_]]) === 
3)
   }
 
+  test(parallelize with exception thrown on serialization should not 
hang) {
+class BadSerializable extends Serializable {
+  @throws(classOf[IOException])
+  private def writeObject(out: ObjectOutputStream) : Unit = throw new 
KryoException(Bad serialization)
+
+  @throws(classOf[IOException])
+  private def readObject(in: ObjectInputStream) : Unit = {}
--- End diff --

no space before `:` here and in L897


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22693969
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -887,6 +891,23 @@ class RDDSuite extends FunSuite with 
SharedSparkContext {
 assert(ancestors6.count(_.isInstanceOf[CyclicalDependencyRDD[_]]) === 
3)
   }
 
+  test(parallelize with exception thrown on serialization should not 
hang) {
--- End diff --

this name is a little too specific. I'd leave out the parallelize and just 
call this something like
```
serialization exception should not hang scheduler
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22694761
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl(
   assert(availableCpus(i) = 0)
   launchedTask = true
 }
+  } catch {
+case e: TaskNotSerializableException = {
+  logError(sResource offer failed, task set ${taskSet.name} 
was not serializable)
--- End diff --

Do we expect the exception to contain any useful information? It might be 
good to `logError(..., e)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22694839
  
--- Diff: 
core/src/main/scala/org/apache/spark/TaskNotSerializableException.scala ---
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * Exception thrown when a task cannot be serialized
+ */
+@DeveloperApi
+class TaskNotSerializableException(error: Throwable) extends 
Exception(error)
--- End diff --

any reason why this is exposed as `DeveloperApi`? IIUC we don't throw this 
to the user


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/3638#discussion_r22694827
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -266,8 +266,21 @@ private[spark] class TaskSchedulerImpl(
   assert(availableCpus(i) = 0)
   launchedTask = true
 }
+  } catch {
+case e: TaskNotSerializableException = {
+  logError(sResource offer failed, task set ${taskSet.name} 
was not serializable)
--- End diff --

The place where the error is thrown is already logging it (TaskSetManager).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-69277042
  
  [Test build #25281 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25281/consoleFull)
 for   PR 3638 at commit 
[`5267929`](https://github.com/apache/spark/commit/5267929054cce06dd1c422a6a010e82b81b22a13).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests (backpo...

2015-01-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3961#issuecomment-69277048
  
  [Test build #25280 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25280/consoleFull)
 for   PR 3961 at commit 
[`8644997`](https://github.com/apache/spark/commit/8644997624af1739890ec902f7e2e36278d158fa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5118][SQL] Fix: create table test store...

2015-01-08 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3921#discussion_r22695453
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -68,6 +68,17 @@ class SQLQuerySuite extends QueryTest {
   CREATE TABLE IF NOT EXISTS ctas4 AS
 | SELECT key, value FROM src ORDER BY key, 
value.stripMargin).collect
 
+sql(
+  CREATE TABLE ctas5
+| ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
+| STORED AS
+| INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
+| OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
+| AS
+|   SELECT key, value
+|   FROM src
+|   ORDER BY key, value.stripMargin).collect
+
--- End diff --

I guess you want to test `STORED AS PARQUET`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >