date:20150201

[GitHub] spark pull request: [SPARK-5478][UI][Minor] Add missing right pare...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4267


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5492. Thread statistics can break with o...

2015-02-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4305#issuecomment-72417193
  
LGTM pending tests. Thanks, Sandy


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-01 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-72416892
  
@shivaram  In the weakly typed API, fit() will take a DataFrame (containing 
attributes info) + Params.  In the strongly typed API, train() would take an 
RDD[LabeledPoint], separate attributes info, + Params.  Since the weakly typed 
API takes Params, it would be best not to duplicate the attributes info in the 
Params.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...

2015-02-01 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/4014#discussion_r23910289
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ScriptTransformation.scala
 ---
@@ -25,9 +25,18 @@ import 
org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
  * @param input the set of expression that should be passed to the script.
  * @param script the command that should be executed.
  * @param output the attributes that are produced by the script.
+ * @param ioschema the input and output schema applied in the execution of 
the script.
  */
 case class ScriptTransformation(
 input: Seq[Expression],
 script: String,
 output: Seq[Attribute],
-child: LogicalPlan) extends UnaryNode
+child: LogicalPlan,
+ioschema: Option[ScriptInputOutputSchema]) extends UnaryNode
--- End diff --

In Hive case, it is not. But I think it may be for other cases?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3670#issuecomment-72416125
  
  [Test build #26496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26496/consoleFull)
 for   PR 3670 at commit 
[`21504f9`](https://github.com/apache/spark/commit/21504f9381fc7c73486dfdbc51be023023213e91).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5492. Thread statistics can break with o...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4305#issuecomment-72416110
  
  [Test build #26495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26495/consoleFull)
 for   PR 4305 at commit 
[`b7d4497`](https://github.com/apache/spark/commit/b7d4497cf3a62d5c289a5b0e31148619162d2e14).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...

2015-02-01 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/4014#discussion_r23910154
  
--- Diff: 
sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim12.scala ---
@@ -241,8 +241,14 @@ private[hive] object HiveShim {
   Decimal(hdoi.getPrimitiveJavaObject(data).bigDecimalValue())
 }
   }
+
+  implicit def prepareWritable(shimW: ShimWritable): Writable = {
+shimW.writable
+  }
 }
 
+case class ShimWritable(writable: Writable)
--- End diff --

If we skip `ShimWriteable`, we then need to remove `implicit` from 
`prepareWriteable` and explicitly call it to do the fixing. Is it better? If 
so, I can do it in this way.

It does not break Hive 12 because we just pass the underlying writable 
object without touching it. We only do the fixing on Hive 13.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5492. Thread statistics can break with o...

2015-02-01 Thread sryza

GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/4305

SPARK-5492. Thread statistics can break with older Hadoop versions



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-5492

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4305.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4305


commit b7d4497cf3a62d5c289a5b0e31148619162d2e14
Author: Sandy Ryza 
Date:   2015-02-02T07:29:27Z

SPARK-5492. Thread statistics can break with older Hadoop versions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23910101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None => score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + "/metadata")
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + "/data")
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + "/metadata")
--- End diff --

(I guess these are conflicting since using DataFrame and toJSON will mean 1 
record per text file line, but that's OK with me.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72415530
  
  [Test build #26490 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26490/consoleFull)
 for   PR 3976 at commit 
[`67f8cee`](https://github.com/apache/spark/commit/67f8cee9e25b5bd05c0252705b1f67cb63b0fa01).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72415533
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26490/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5512][Mllib] Run the PIC algorithm with...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4301#issuecomment-72415114
  
  [Test build #26494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26494/consoleFull)
 for   PR 4301 at commit 
[`19cf94e`](https://github.com/apache/spark/commit/19cf94ecfd6d879cbceb52f0abc0a32461e7d871).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5512][Mllib] Run the PIC algorithm with...

2015-02-01 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4301#issuecomment-72414811
  
@mengxr I think it is better to keep both and leave it as an option users 
can switch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-72414442
  
  [Test build #26493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26493/consoleFull)
 for   PR 4215 at commit 
[`c08dc9f`](https://github.com/apache/spark/commit/c08dc9fb8d85a7d9a58f980af99687e13c4d766a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-72414128
  
  [Test build #26492 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26492/consoleFull)
 for   PR 4215 at commit 
[`3ada19a`](https://github.com/apache/spark/commit/3ada19ac3d569e4d5af35c309436be36ba211f94).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23909298
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None => score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + "/metadata")
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + "/data")
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + "/metadata")
--- End diff --

Also, I get the motivation for using json4s directly rather than going 
through DataFrame and DataFrame.toJSON in terms of reducing dependencies.  
However, I like the idea of using DataFrame since it will be helpful when we 
add other types of metadata, such as info about each feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3833#issuecomment-72413396
  
  [Test build #26491 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26491/consoleFull)
 for   PR 3833 at commit 
[`4ce4d33`](https://github.com/apache/spark/commit/4ce4d33f6d8119f4b68d6e436a398e0f975d9b40).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-02-01 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3715#discussion_r23909226
  
--- Diff: 
core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala ---
@@ -23,11 +23,21 @@ import org.scalatest.FunSuite
 
 class PythonRDDSuite extends FunSuite {
 
-test("Writing large strings to the worker") {
-val input: List[String] = List("a"*10)
-val buffer = new DataOutputStream(new ByteArrayOutputStream)
-PythonRDD.writeIteratorToStream(input.iterator, buffer)
-}
+  test("Writing large strings to the worker") {
+val input: List[String] = List("a"*10)
+val buffer = new DataOutputStream(new ByteArrayOutputStream)
+PythonRDD.writeIteratorToStream(input.iterator, buffer)
+  }
 
-}
+  test("Handle nulls gracefully") {
+val buffer = new DataOutputStream(new ByteArrayOutputStream)
+PythonRDD.writeIteratorToStream(List("a", null).iterator, buffer)
+PythonRDD.writeIteratorToStream(List(null, "a").iterator, buffer)
+PythonRDD.writeIteratorToStream(List("a".getBytes, null).iterator, 
buffer)
+PythonRDD.writeIteratorToStream(List(null, "a".getBytes).iterator, 
buffer)
 
+PythonRDD.writeIteratorToStream(List((null, null), ("a", null), (null, 
"b")).iterator, buffer)
--- End diff --

This is a test in Python to verify that (but not cover all the cases).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23909199
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None => score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + "/metadata")
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + "/data")
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + "/metadata")
--- End diff --

That's not quite my question: I think the confusion is mixing "row" (line) 
in a text file vs "row" (or record) in an RDD.  How about we store the metadata 
in a single record in an RDD, but we print that RDD as multi-line JSON to a 
single text file?  It will be easier for humans to read and will be easy to 
load as a single record as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...

2015-02-01 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3833#discussion_r23909157
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -61,20 +79,58 @@ class LogisticRegressionModel (
 
   override protected def predictPoint(dataMatrix: Vector, weightMatrix: 
Vector,
   intercept: Double) = {
-val margin = weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept
-val score = 1.0 / (1.0 + math.exp(-margin))
-threshold match {
-  case Some(t) => if (score > t) 1.0 else 0.0
-  case None => score
+// If dataMatrix and weightMatrix have the same dimension, it's binary 
logistic regression.
+if (dataMatrix.size == weightMatrix.size) {
+  val margin = dot(weights, dataMatrix) + intercept
+  val score = 1.0 / (1.0 + math.exp(-margin))
+  threshold match {
+case Some(t) => if (score > t) 1.0 else 0.0
+case None => score
+  }
+} else {
+  val dataWithBiasSize = weightMatrix.size / (nClasses - 1)
+  val dataWithBias = if (dataWithBiasSize == dataMatrix.size) {
+dataMatrix
+  } else {
+assert(dataMatrix.size + 1 == dataWithBiasSize)
+MLUtils.appendBias(dataMatrix)
--- End diff --

This can be done without creating the temp matrix w. See the updated PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5208][DOC] Add more documentation to Ne...

2015-02-01 Thread sarutak

Github user sarutak closed the pull request at:

https://github.com/apache/spark/pull/4012


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5208][DOC] Add more documentation to Ne...

2015-02-01 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/4012#issuecomment-72411859
  
O.K, I'll close.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72410777
  
  [Test build #26490 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26490/consoleFull)
 for   PR 3976 at commit 
[`67f8cee`](https://github.com/apache/spark/commit/67f8cee9e25b5bd05c0252705b1f67cb63b0fa01).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72410655
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26489/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72410648
  
  [Test build #26489 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26489/consoleFull)
 for   PR 4289 at commit 
[`afc7da5`](https://github.com/apache/spark/commit/afc7da53be4b7bcb9cd5ce8d72b6855544b96596).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Rating[@specialized(Int, Long) ID](user: ID, item: ID, 
rating: Float)`
  * `class StandardScalerModel (`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-01 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-72410436
  
@mengxr - Would the Attribute be per data point or something that is set 
once per the algorithm ? The latter sounds like something the `ParamMap` should 
be able to handle. If its per element, then its like another column in the 
table ? Sorry if I missing something, but it'll be great if you could give an 
example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-02-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/3715#discussion_r23908186
  
--- Diff: 
core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala ---
@@ -23,11 +23,21 @@ import org.scalatest.FunSuite
 
 class PythonRDDSuite extends FunSuite {
 
-test("Writing large strings to the worker") {
-val input: List[String] = List("a"*10)
-val buffer = new DataOutputStream(new ByteArrayOutputStream)
-PythonRDD.writeIteratorToStream(input.iterator, buffer)
-}
+  test("Writing large strings to the worker") {
+val input: List[String] = List("a"*10)
+val buffer = new DataOutputStream(new ByteArrayOutputStream)
+PythonRDD.writeIteratorToStream(input.iterator, buffer)
+  }
 
-}
+  test("Handle nulls gracefully") {
+val buffer = new DataOutputStream(new ByteArrayOutputStream)
+PythonRDD.writeIteratorToStream(List("a", null).iterator, buffer)
+PythonRDD.writeIteratorToStream(List(null, "a").iterator, buffer)
+PythonRDD.writeIteratorToStream(List("a".getBytes, null).iterator, 
buffer)
+PythonRDD.writeIteratorToStream(List(null, "a".getBytes).iterator, 
buffer)
 
+PythonRDD.writeIteratorToStream(List((null, null), ("a", null), (null, 
"b")).iterator, buffer)
--- End diff --

This issue still has not been addressed. There are not asserts to check 
whether the nulls can be read back properly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...

2015-02-01 Thread hhbyyh

Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/4200#issuecomment-72410109
  
Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-02-01 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72409984
  
Hey @MartinWeindel any ideas why the diff for this PR is almost 2k lines? 
Is your IDE changing the line end characters somehow?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5501][SPARK-5420][SQL] Write suppo...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72409068
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26488/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5501][SPARK-5420][SQL] Write suppo...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72409064
  
  [Test build #26488 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26488/consoleFull)
 for   PR 4294 at commit 
[`9203ec2`](https://github.com/apache/spark/commit/9203ec2f5bfca2cdedb7b9042996db5d59edeb34).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) 
extends Serializable`
  * `  class Node[T](val parent: Node[T]) extends Serializable `
  * `protected[sql] class DDLException(message: String) extends 
Exception(message)`
  * `trait TableScan extends BaseRelation `
  * `trait PrunedScan extends BaseRelation `
  * `trait PrunedFilteredScan extends BaseRelation `
  * `trait CatalystScan extends BaseRelation `
  * `trait InsertableRelation extends BaseRelation `
  * `case class CreateMetastoreDataSourceAsSelect(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4130


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-02-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4059#issuecomment-72408935
  
Yes, Array should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-02-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4130#issuecomment-72408916
  
I can merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Add a config option to print DAG.

2015-02-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4257#issuecomment-72408451
  
@rxin I have noticed that very few users know about `toDebugString`. Maybe 
we should open a JIRA to add better documentation for that function (i.e. 
discuss it in the programming guide). Overall, I agree with you and @ScrapCodes 
in that I'm not sure this particular flag is super useufl.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-02-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4130#issuecomment-72408322
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5208][DOC] Add more documentation to Ne...

2015-02-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4012#issuecomment-72408184
  
Okay @sarutak can you close this issue then? Looks like we intentionally 
left these out for now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...

2015-02-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4215#discussion_r23907239
  
--- Diff: core/pom.xml ---
@@ -225,6 +225,16 @@
   test
 
 
+  org.apache.ivy
+  ivy
+  ${ivy.version}
+
+
+  oro
--- End diff --

@brkyvz add a comment here:

```

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3996]: Shade Jetty in Spark deliverable...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4285


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3996]: Shade Jetty in Spark deliverable...

2015-02-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4285#issuecomment-72407319
  
Okay - let's try this for take 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23907051
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,249 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+   * configuration parameters.
+   *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param batch Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U <: Decoder[_]: ClassTag,
+T <: Decoder[_]: ClassTag,
+R: ClassTag] (
--- End diff --

Good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72406797
  
  [Test build #26489 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26489/consoleFull)
 for   PR 4289 at commit 
[`afc7da5`](https://github.com/apache/spark/commit/afc7da53be4b7bcb9cd5ce8d72b6855544b96596).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-02-01 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/4059#issuecomment-72406403
  
So I will go with the current approach. I tried to change Array to 
ArrayBuffer but is ending up in exceptions. So can I go with array itself ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23906731
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+ * configuration parameters.
+ *   Requires "metadata.broker.list" or "bootstrap.servers" to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U <: Decoder[_]: ClassTag,
+  T <: Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
--- End diff --

Actually, this not the desired way to create RDDs. The partition objects 
are generated created by the RDD itself, and not provided from outside. 
Although this is not a written hard rule, it is generally the norm followed by 
all types of RDDs. Example: 

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L65


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4296#issuecomment-72405811
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26487/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-02-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4059#issuecomment-72405813
  
They are not attributes but public methods. Did you try `mu()` and 
`sigma()`? I think the current approach looks good except minor issues 
commented. We can try other approaches in a later PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4296#issuecomment-72405807
  
  [Test build #26487 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26487/consoleFull)
 for   PR 4296 at commit 
[`17f6bae`](https://github.com/apache/spark/commit/17f6bae783362076c977aae834792dc94cffca94).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Column extends DataFrame with ExpressionApi `
  * `class ColumnName(name: String) extends IncomputableColumn(name) `
  * `trait DataFrame extends DataFrameSpecificApi with RDDApi[Row] `
  * `class GroupedDataFrame protected[sql](df: DataFrameImpl, 
groupingExprs: Seq[Expression])`
  * `  protected[sql] class QueryExecution(val logical: LogicalPlan) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5470][Core]use defaultClassLoader to lo...

2015-02-01 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/4258#issuecomment-72405667
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: PCA wrapper for easy transform vectors

2015-02-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4304#issuecomment-72405573
  
@catap This is nice to have. Could you follow the steps in 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for 
contributing to Spark? For example, you need to create a JIRA (and get 
assigned) and put the JIRA number in the PR title.

For the public APIs, please follow other transformers under `mllib.feature`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72405506
  
  [Test build #26488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26488/consoleFull)
 for   PR 4294 at commit 
[`9203ec2`](https://github.com/apache/spark/commit/9203ec2f5bfca2cdedb7b9042996db5d59edeb34).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72405340
  
  [Test build #26483 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26483/consoleFull)
 for   PR 4299 at commit 
[`fe2740b`](https://github.com/apache/spark/commit/fe2740bef2320195a64fbaa7f29d6493cc6337c8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72405341
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26483/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4296#issuecomment-72405245
  
  [Test build #26487 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26487/consoleFull)
 for   PR 4296 at commit 
[`17f6bae`](https://github.com/apache/spark/commit/17f6bae783362076c977aae834792dc94cffca94).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-72405143
  
About the metadata, I'm thinking of creating ML Attribute/VectorAttribute 
classes that stores feature information, which can be load from/saved to Spark 
SQL's metadata. It is similar to Weka's Attribute implementation. Since 
`RDD[LabeledPoint]` doesn't carry this extra information, could we make ML 
attributes as an input argument to the `train` method? For example

~~~
def train(dataset: RDD[LabeledPoint], attributes: (Attribute, 
VectorAttribute))
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: PCA wrapper for easy transform vectors

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4304#issuecomment-72404997
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: PCA wrapper for easy transform vectors

2015-02-01 Thread catap

GitHub user catap opened a pull request:

https://github.com/apache/spark/pull/4304

PCA wrapper for easy transform vectors

I implement a simple PCA wrapper for easy transform of vectors by PCA for 
example LabeledPoint or another complicated structure.

Example of usage:
```
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import org.apache.spark.mllib.feature.PCA

  val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 
').map(_.toDouble)))
  }.cache()

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

  val pca = PCA.create(training.first().features.size/2, 
data.map(_.features))
  val training_pca = training.map(p => p.copy(features = 
pca.transform(p.features)))
  val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))

  val numIterations = 100
  val model = LinearRegressionWithSGD.train(training, numIterations)
  val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)

  val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
  }

  val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
  }

  val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
  val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 
2)}.mean()

  println("Mean Squared Error = " + MSE)
  println("PCA Mean Squared Error = " + MSE_pca)
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/catap/spark pca

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4304.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4304


commit c71af4ad718be60e231bb10e39211f1acb1b04ab
Author: Kirill A. Korinskiy 
Date:   2015-02-02T04:24:52Z

PCA wrapper for easy transform vectors

I implement a simple PCA wrapper for easy transform of vectors by PCA for 
example LabeledPoint or another complicated structure.

Example of usage:
```
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import org.apache.spark.mllib.feature.PCA

  val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 
').map(_.toDouble)))
  }.cache()

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

  val pca = PCA.create(training.first().features.size/2, 
data.map(_.features))
  val training_pca = training.map(p => p.copy(features = 
pca.transform(p.features)))
  val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))

  val numIterations = 100
  val model = LinearRegressionWithSGD.train(training, numIterations)
  val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)

  val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
  }

  val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
  }

  val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
  val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 
2)}.mean()

  println("Mean Squared Error = " + MSE)
  println("PCA Mean Squared Error = " + MSE_pca)
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-02-01 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/4059#issuecomment-72404786
  
Instead of passing mu & sigma as arrays, I tried to directly pass 
"gaussians "(Array[MultivariateGaussian]) from PythonMLLibAPI. But I was not 
able to access the attributes of the MultivariateGaussian class object in 
python. Then I converted "gaussians" to 2 arrays of mu and sigma and returned 
to python. Which method is good? And is it possible to access the attributes mu 
& sigma in python by passing "gaussians" directly?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2847


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

2015-02-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-72404067
  
LGTM. Merged into master. Thanks!! (The failed test is a known flakey test. 
All relevant tests passed.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4249#issuecomment-72404020
  
  [Test build #26485 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26485/consoleFull)
 for   PR 4249 at commit 
[`11559ae`](https://github.com/apache/spark/commit/11559ae5b8356e0b50b1647af1623b04ca42523a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4249#issuecomment-72404022
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26485/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-72403864
  
  [Test build #26486 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26486/consoleFull)
 for   PR 2847 at commit 
[`bee3093`](https://github.com/apache/spark/commit/bee3093daa4c8473a9f531c5fdee353c06cd1bf0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) 
extends Serializable`
  * `  class Node[T](val parent: Node[T]) extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23906177
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None => score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + "/metadata")
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + "/data")
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + "/metadata")
--- End diff --

We want to use RDD to avoid talking to fs directly. If you use json4s, you 
can render single line JSON easily:

~~~
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._

val json = ("a\n" -> "b\n")
println(compact(render(json)))
~~~

outputs

~~~
{"a\n":"b\n"}
~~~

So the metadata won't span multiple lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-72403870
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26486/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4943][SPARK-5251][SQL] Allow table name...

2015-02-01 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/4062#issuecomment-72403143
  
ping 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72403008
  
  [Test build #26482 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26482/consoleFull)
 for   PR 3976 at commit 
[`0319ae3`](https://github.com/apache/spark/commit/0319ae328b2db694684ea586cbb7d49fb2b487c7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72403015
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26482/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...

2015-02-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4200#issuecomment-72402760
  
The changes look good to me. We may want to investigate more on the limits, 
but the current setting is certainly better than master. I've merged it. Thanks 
for testing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4200


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...

2015-02-01 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4014#issuecomment-72402698
  
Thanks for working on this!  It would be great if this could be updated 
soon so we can include it in 1.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72402028
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26480/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72402026
  
  [Test build #26480 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26480/consoleFull)
 for   PR 3976 at commit 
[`2385ef6`](https://github.com/apache/spark/commit/2385ef679638fcb0b544a3de7744c9f4f2c242f0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting

2015-02-01 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/4296#discussion_r23905446
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -28,6 +28,21 @@ import scala.language.postfixOps
 class DataFrameSuite extends QueryTest {
   import org.apache.spark.sql.TestData._
 
+  test("analysis error should be eagerly reported") {
+intercept[Exception] { testData.select('nonExistentName) }
+intercept[Exception] {
+  testData.groupBy('key).agg(Map("nonExistentName" -> "sum"))
+}
+intercept[Exception] {
+  testData.groupBy("nonExistentName").agg(Map("key" -> "sum"))
--- End diff --

Why isn't this `(String, String)*`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5278][SQL] complete the check of ambigu...

2015-02-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4068#discussion_r23905429
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -285,11 +285,22 @@ class Analyzer(catalog: Catalog,
 result
 
   // Resolve field names using the resolver.
-  case f @ GetField(child, fieldName) if !f.resolved && 
child.resolved =>
+  case f @ GetField(child, fieldName) if child.resolved =>
 child.dataType match {
   case StructType(fields) =>
-val resolvedFieldName = 
fields.map(_.name).find(resolver(_, fieldName))
-resolvedFieldName.map(n => f.copy(fieldName = 
n)).getOrElse(f)
+val actualField = fields.filter(f => resolver(f.name, 
fieldName))
+if (actualField.length == 0) {
+  sys.error(
+s"No such struct field $fieldName in 
${fields.map(_.name).mkString(", ")}")
--- End diff --

ping @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread lianhuiwang

Github user lianhuiwang commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72401201
  
for python application, if SPARK_HOME of submission node is different from 
the nodeManager, so it can not work in my test. example:submission node's 
version is 1.2, but spark's version in nodemanger is 1.1, that is can not work 
now. i think there is other problem that is not belong to this PR,because in 
yarn client mode it is also exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-72401147
  
  [Test build #26486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26486/consoleFull)
 for   PR 2847 at commit 
[`bee3093`](https://github.com/apache/spark/commit/bee3093daa4c8473a9f531c5fdee353c06cd1bf0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...

2015-02-01 Thread hhbyyh

Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/4200#issuecomment-72400944
  
@mengxr Sorry to disturb. I know you are probably quite busy with many PR 
in review.

Can you please provide some comments if got a minute? I will close the PR 
if it's regarded as unnecessary for now~ Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72400702
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26484/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72400699
  
  [Test build #26484 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26484/consoleFull)
 for   PR 4289 at commit 
[`b1527d5`](https://github.com/apache/spark/commit/b1527d58349ccdc0b986705b93d7658822211994).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4249#issuecomment-72400616
  
  [Test build #26485 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26485/consoleFull)
 for   PR 4249 at commit 
[`11559ae`](https://github.com/apache/spark/commit/11559ae5b8356e0b50b1647af1623b04ca42523a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5404] [SQL] update the default statisti...

2015-02-01 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4199#issuecomment-72400547
  
I don't think that I agree with this change.  In general it is always safe 
to do a shuffle join where as a broadcast join could possible cause the driver 
to OOM.  However, I'm worried that this change will make us faster for some 
workloads but possibly also unstable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5465] [SQL] Fixes filter push-down for ...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4255


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72400374
  
  [Test build #26484 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26484/consoleFull)
 for   PR 4289 at commit 
[`b1527d5`](https://github.com/apache/spark/commit/b1527d58349ccdc0b986705b93d7658822211994).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...

2015-02-01 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4249#issuecomment-72400392
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5262] [SPARK-5244] [SQL] add coalesce i...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4057


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5262] [SPARK-5244] [SQL] add coalesce i...

2015-02-01 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4057#issuecomment-72400328
  
Thanks!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5515] Build fails with spark-ganglia-lg...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4303#issuecomment-72400202
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26478/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5515] Build fails with spark-ganglia-lg...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4303#issuecomment-72400197
  
  [Test build #26478 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26478/consoleFull)
 for   PR 4303 at commit 
[`5cf455f`](https://github.com/apache/spark/commit/5cf455f08eae005d48b8420d7aeec30520bd30df).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Rating[@specialized(Int, Long) ID](user: ID, item: ID, 
rating: Float)`
  * `class StandardScalerModel (`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72400107
  
  [Test build #26483 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26483/consoleFull)
 for   PR 4299 at commit 
[`fe2740b`](https://github.com/apache/spark/commit/fe2740bef2320195a64fbaa7f29d6493cc6337c8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-02-01 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72400044
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread lianhuiwang

Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3976#discussion_r23905022
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -134,12 +136,29 @@ object SparkSubmit {
   }
 }
 
+val isYarnCluster = clusterManager == YARN && deployMode == CLUSTER
+
+// Require all python files to be local, so we can add them to the 
PYTHONPATH
+// when yarn-cluster, all python files can be non-local
+if (args.isPython && !isYarnCluster) {
+  if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
+SparkSubmit.printErrorAndExit(
--- End diff --

if we move it to SparkSubmitArguments, we need to get clusterManager and 
deployMode. But this work has been did in SparkSubmit. there are some repeated 
works. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5196][SQL] Support `comment` in Create ...

2015-02-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3999


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72399690
  
@lianhuiwang what happens now if the submission node uses a different 
SPARK_HOME from the machines? Does it still work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3976#discussion_r23904936
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ---
@@ -185,6 +192,7 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   |  --jar JAR_PATH   Path to your application's JAR file 
(required in yarn-cluster
   |   mode)
   |  --class CLASS_NAME   Name of your application's main class 
(required)
+  |  --primary-py-fileA primary Python file
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3976#discussion_r23904930
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala
 ---
@@ -81,6 +91,9 @@ class ApplicationMasterArguments(val args: Array[String]) 
{
   |Options:
   |  --jar JAR_PATH   Path to your application's JAR file
   |  --class CLASS_NAME   Name of your application's main class
+  |  --primary-py-fileA primary Python file
--- End diff --

The main python file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3976#discussion_r23904922
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -134,12 +136,29 @@ object SparkSubmit {
   }
 }
 
+val isYarnCluster = clusterManager == YARN && deployMode == CLUSTER
+
+// Require all python files to be local, so we can add them to the 
PYTHONPATH
+// when yarn-cluster, all python files can be non-local
+if (args.isPython && !isYarnCluster) {
+  if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) {
+SparkSubmit.printErrorAndExit(
--- End diff --

Also, not a big deal but I actually think this check belongs better in 
`SparkSubmitArguments`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-02-01 Thread lianhuiwang

Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3976#discussion_r23904929
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -430,6 +430,10 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments,
   private def startUserClass(): Thread = {
 logInfo("Starting the user JAR in a separate Thread")
--- End diff --

ok, i got it. i will update it. thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...

2015-02-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4216#issuecomment-72399572
  
  [Test build #26477 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26477/consoleFull)
 for   PR 4216 at commit 
[`42e5de4`](https://github.com/apache/spark/commit/42e5de43c26806fb36aced9bf70e23e2eadbac41).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class MasterStateResponse(`
  * `class LocalSparkCluster(`
  * `   *   (4) the main class for the child`
  * `  case class BoundPortsResponse(actorPort: Int, webUIPort: Int, 
restPort: Option[Int])`
  * `class DriverStatusRequest extends SubmitRestProtocolRequest `
  * `class DriverStatusResponse extends SubmitRestProtocolResponse `
  * `class ErrorResponse extends SubmitRestProtocolResponse `
  * `class KillDriverRequest extends SubmitRestProtocolRequest `
  * `class KillDriverResponse extends SubmitRestProtocolResponse `
  * `  throw new SubmitRestMissingFieldException("Main class must be 
set in submit request.")`
  * `class SubmitDriverRequest extends SubmitRestProtocolRequest `
  * `class SubmitDriverResponse extends SubmitRestProtocolResponse `
  * `class SubmitRestProtocolException(message: String, cause: Exception = 
null)`
  * `class SubmitRestMissingFieldException(message: String) extends 
SubmitRestProtocolException(message)`
  * `abstract class SubmitRestProtocolMessage `
  * `abstract class SubmitRestProtocolRequest extends 
SubmitRestProtocolMessage `
  * `abstract class SubmitRestProtocolResponse extends 
SubmitRestProtocolMessage `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-01 Thread koeninger

Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23904918
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,249 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+   * configuration parameters.
+   *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param batch Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U <: Decoder[_]: ClassTag,
+T <: Decoder[_]: ClassTag,
+R: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  batch: Array[OffsetRange]
+  ): RDD[(K, V)] with HasOffsetRanges = {
+val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, 
mmd.message)
+val kc = new KafkaCluster(kafkaParams)
+val topics = batch.map(o => TopicAndPartition(o.topic, 
o.partition)).toSet
+val leaderMap = kc.findLeaders(topics).fold(
+  errs => throw new SparkException(errs.mkString("\n")),
+  ok => ok
+)
+val rddParts = batch.zipWithIndex.map { case (o, i) =>
+val tp = TopicAndPartition(o.topic, o.partition)
+val (host, port) = leaderMap(tp)
+new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, 
o.untilOffset, host, port)
+}.toArray
+new KafkaRDD[K, V, U, T, (K, V)](sc, kafkaParams, rddParts, 
messageHandler)
+  }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+   * configuration parameters.
+   *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param batch Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   * @param leaders Kafka leaders for each offset range in batch
+   * @param messageHandler function for translating each message into the 
desired type
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U <: Decoder[_]: ClassTag,
+T <: Decoder[_]: ClassTag,
+R: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  batch: Array[OffsetRange],
+  leaders: Array[Leader],
+  messageHandler: MessageAndMetadata[K, V] => R
+  ): RDD[R] with HasOffsetRanges = {
+val leaderMap = leaders.map(l => (l.topic, l.partition) -> (l.host, 
l.port)).toMap
+val rddParts = batch.zipWithIndex.map { case (o, i) =>
+val (host, port) = leaderMap((o.topic, o.partition))
+new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, 
o.untilOffset, host, port)
+}.toArray
+
+new KafkaRDD[K, V, U, T, R](sc, kafkaParams, rddParts, messageHandler)
+  }
+
+  /**
+   * This stream can guarantee that each message from Kafka is included in 
transformations
+   * (as opposed to output actions) exactly once, even in most failure 
situations.
+   *
+   * Points to note:
+   *
+   * Failure Recovery - You must checkpoint this stream, or save offsets 
yourself and provide them
+   * as the fromOffsets parameter on restart.
+   * Kafka must have sufficient log retention to obtain messages after 
failure.
+   *
+   * Getting offsets from the stream - see programming guide
+   *
+.  * Zookeeper - This does not use Zookeeper to store offsets.  For 
interop with Kafka monitors
+   * that depend on Zookeeper, you must store offsets in ZK yourself.
+   *
+   * End-to-end semantics - This does not guarantee that any output 
operation will push each record
+   * exactly once. To ensure end-to-end exactly-once semantics (that is, 
receiving exactly once and
+   * outputting exactly once), you have to either ensure that the output 
operation is

[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...

2015-02-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4216#issuecomment-72399576
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26477/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 214 matches

Mail list logo