[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81762/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19227: [SPARK-20060][CORE] Support accessing secure Hadoop clus...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19227
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81750/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81762 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81762/testReport)**
 for PR 19186 at commit 
[`3f11c67`](https://github.com/apache/spark/commit/3f11c67630dfc5402e49d7bf43d1ce9a31b400da).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19227: [SPARK-20060][CORE] Support accessing secure Hadoop clus...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19227
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19227: [SPARK-20060][CORE] Support accessing secure Hadoop clus...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19227
  
**[Test build #81750 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81750/testReport)**
 for PR 19227 at commit 
[`2b3d2f2`](https://github.com/apache/spark/commit/2b3d2f24f94a1cee63fff9733b27f479673d7a90).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19185


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19215: [MINOR][SQL] Only populate type metadata for requ...

2017-09-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19215


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81762 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81762/testReport)**
 for PR 19186 at commit 
[`3f11c67`](https://github.com/apache/spark/commit/3f11c67630dfc5402e49d7bf43d1ce9a31b400da).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19215: [MINOR][SQL] Only populate type metadata for required ty...

2017-09-13 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/19215
  
many thanks @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19130: [SPARK-21917][CORE][YARN] Supporting adding http(...

2017-09-13 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19130#discussion_r138801682
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -897,6 +897,80 @@ class SparkSubmitSuite
 sysProps("spark.submit.pyFiles") should (startWith("/"))
   }
 
+  test("handle remote http(s) resources in yarn mode") {
+val hadoopConf = new Configuration()
+updateConfWithFakeS3Fs(hadoopConf)
+
+val tmpDir = Utils.createTempDir()
+val mainResource = File.createTempFile("tmpPy", ".py", tmpDir)
+val tmpJar = TestUtils.createJarWithFiles(Map("test.resource" -> 
"USER"), tmpDir)
+val tmpJarPath = s"s3a://${new File(tmpJar.toURI).getAbsolutePath}"
+// This assumes UT environment could access external network.
--- End diff --

Yes, that's my concern, let me think out another way to handle this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19130: [SPARK-21917][CORE][YARN] Supporting adding http(...

2017-09-13 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19130#discussion_r138801550
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -367,6 +368,53 @@ object SparkSubmit extends CommandLineUtils with 
Logging {
   }.orNull
 }
 
+// When running in YARN cluster manager,
+if (clusterManager == YARN) {
+  sparkConf.setIfMissing(SecurityManager.SPARK_AUTH_SECRET_CONF, 
"unused")
+  val secMgr = new SecurityManager(sparkConf)
+  val forceDownloadSchemes = sparkConf.get(FORCE_DOWNLOAD_SCHEMES)
+
+  // Check the scheme list provided by 
"spark.yarn.dist.forceDownloadSchemes" to see if current
+  // resource's scheme is included in this list, or Hadoop FileSystem 
doesn't support current
+  // scheme, if so Spark will download the resources to local disk and 
upload to Hadoop FS.
+  def shouldDownload(scheme: String): Boolean = {
+val isFsAvailable = Try { FileSystem.getFileSystemClass(scheme, 
hadoopConf) }
+  .map(_ => true).getOrElse(false)
+forceDownloadSchemes.contains(scheme) || !isFsAvailable
+  }
+
+  def downloadResource(resource: String): String = {
+val uri = Utils.resolveURI(resource)
+uri.getScheme match {
+  case "local" | "file" => resource
+  case e if shouldDownload(e) =>
+if (deployMode == CLIENT) {
+  // In client mode, we already download the resources, so 
figuring out the local one
+  // should be enough.
+  val fileName = new Path(uri).getName
+  new File(targetDir, fileName).toURI.toString
+} else {
+  downloadFile(resource, targetDir, sparkConf, hadoopConf, 
secMgr)
+}
+  case _ => uri.toString
+}
+  }
+
+  args.primaryResource = Option(args.primaryResource).map { 
downloadResource }.orNull
+  args.files = Option(args.files).map { files =>
+files.split(",").map(_.trim).filter(_.nonEmpty).map { 
downloadResource }.mkString(",")
+  }.orNull
+  args.pyFiles = Option(args.pyFiles).map { files =>
+files.split(",").map(_.trim).filter(_.nonEmpty).map { 
downloadResource }.mkString(",")
+  }.orNull
+  args.jars = Option(args.jars).map { files =>
+files.split(",").map(_.trim).filter(_.nonEmpty).map { 
downloadResource }.mkString(",")
+  }.orNull
+  args.archives = Option(args.archives).map { files =>
+files.split(",").map(_.trim).filter(_.nonEmpty).map { 
downloadResource }.mkString(",")
+  }.orNull
--- End diff --

From the code `--files` and `--jars` overwrite `spark.yarn.*` long ago 
AFAIK. What I think is that we should make `spark.yarn.*` as an internal 
configurations to reduce the discrepancy.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19215: [MINOR][SQL] Only populate type metadata for required ty...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19215
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19215: [MINOR][SQL] Only populate type metadata for required ty...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19215
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19185
  
**[Test build #81757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81757/testReport)**
 for PR 19185 at commit 
[`6529fa6`](https://github.com/apache/spark/commit/6529fa6ecb7d607d3b38e68c8007bc22d9e27907).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19185
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19185
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81757/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support c...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19223
  
**[Test build #81761 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81761/testReport)**
 for PR 19223 at commit 
[`158140e`](https://github.com/apache/spark/commit/158140e2b9c4adc8906dd25d9ec9fe37306b8436).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19135: [SPARK-21923][CORE]Avoid calling reserveUnrollMemoryForT...

2017-09-13 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19135
  
Hi @jerryshao thanks for your reviewing.

>So it somehow reflects that CPU core contention is the main issue for 
memory pre-occupation
I have modified the code, now it will not request more memory, now it just 
reduce the times of calling `reserveUnrollMemoryForThisTask` followed by 
@cloud-fan comments. And also the method is same as `putIteratorAsValues`.

Yeah, its impact will be small with small cores. In the above test results, 
 it doesn't bring any regressions, and also better for many cores.  For machine 
learning, we need cache the source data to OFF_HEAP in order to reduce the gc 
problem.

For the configuration, I think the different application scenarios may be 
different. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138800677
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -993,7 +996,10 @@ class JDBCSuite extends SparkFunSuite
 Seq(StructField("NAME", StringType, true), StructField("THEID", 
IntegerType, true)))
   val df = sql("select * from people_view")
   assert(df.schema.size === 2)
-  assert(df.schema === schema)
--- End diff --

revert it back. 

Change the following line 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L309
 
to
```
fields(i) = StructField(columnName, columnType, nullable)
```

You also need to update some test cases due to the above change, I think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread goldmedal
Github user goldmedal commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r138800321
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1921,10 +1921,12 @@ def from_json(col, schema, options={}):
 @since(2.1)
 def to_json(col, options={}):
 """
-Converts a column containing a [[StructType]] or [[ArrayType]] of 
[[StructType]]s into a
-JSON string. Throws an exception, in the case of an unsupported type.
+Converts a column containing a [[StructType]], [[ArrayType]] of 
[[StructType]]s,
+a [[MapType]] or [[ArrayType]] of [[MapType]] into a JSON string.
+Throws an exception, in the case of an unsupported type.
--- End diff --

ok Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r13870
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/json-functions.sql.out ---
@@ -26,13 +26,13 @@ Extended Usage:
{"time":"26/08/2015"}
   > SELECT to_json(array(named_struct('a', 1, 'b', 2));
[{"a":1,"b":2}]
-  > SELECT to_json(map('a',named_struct('b',1)));
+  > SELECT to_json(map('a', named_struct('b', 1)));
--- End diff --

Oh. I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread goldmedal
Github user goldmedal commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r138799591
  
--- Diff: R/pkg/R/functions.R ---
@@ -1715,7 +1717,15 @@ setMethod("to_date",
 #'
 #' # Converts an array of structs into a JSON array
 #' df2 <- sql("SELECT array(named_struct('name', 'Bob'), 
named_struct('name', 'Alice')) as people")
-#' df2 <- mutate(df2, people_json = to_json(df2$people))}
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#'
+#' # Converts a map into a JSON object
+#' df2 <- sql("SELECT map('name', 'Bob')) as people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#'
+#' # Converts an array of maps into a JSON array
+#' df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as 
people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
--- End diff --

ok  Thanks for careful review :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread goldmedal
Github user goldmedal commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r138799483
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/json-functions.sql.out ---
@@ -26,13 +26,13 @@ Extended Usage:
{"time":"26/08/2015"}
   > SELECT to_json(array(named_struct('a', 1, 'b', 2));
[{"a":1,"b":2}]
-  > SELECT to_json(map('a',named_struct('b',1)));
+  > SELECT to_json(map('a', named_struct('b', 1)));
--- End diff --

umm. I modified `ExpressionDescription` of `StructsToJson` at @HyukjinKwon 
's suggestions which didn't be merged in last PR. Here's the test for `describe 
function extended to_json`, so I needed to regenerate the golden file for it. 
So this change isn't from `json-functions.sql`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19226
  
**[Test build #81760 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81760/testReport)**
 for PR 19226 at commit 
[`e99ed23`](https://github.com/apache/spark/commit/e99ed23ffa887311b8c77d57733ff005d6987bdb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19226
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81760/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19226
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19230#discussion_r138799219
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -99,73 +100,18 @@ public ArrayData copy() {
 @Override
 public Object[] array() {
   DataType dt = data.dataType();
+  Function getAtMethod = (Function) 
i -> get(i, dt);
   Object[] list = new Object[length];
-
-  if (dt instanceof BooleanType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getBoolean(offset + i);
-  }
-}
-  } else if (dt instanceof ByteType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getByte(offset + i);
-  }
-}
-  } else if (dt instanceof ShortType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getShort(offset + i);
-  }
-}
-  } else if (dt instanceof IntegerType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getInt(offset + i);
-  }
-}
-  } else if (dt instanceof FloatType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getFloat(offset + i);
-  }
-}
-  } else if (dt instanceof DoubleType) {
+  try {
 for (int i = 0; i < length; i++) {
   if (!data.isNullAt(offset + i)) {
-list[i] = data.getDouble(offset + i);
+list[i] = getAtMethod.call(i);
   }
 }
-  } else if (dt instanceof LongType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getLong(offset + i);
-  }
-}
-  } else if (dt instanceof DecimalType) {
-DecimalType decType = (DecimalType)dt;
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = getDecimal(i, decType.precision(), decType.scale());
-  }
-}
-  } else if (dt instanceof StringType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = getUTF8String(i).toString();
--- End diff --

This looks suspicious. Why we get `String` before? Seems we should get 
`UTF8String`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19231#discussion_r138797882
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1333,7 +1333,7 @@ the following case-insensitive options:
   
 customSchema
 
- The custom schema to use for reading data from JDBC connectors. For 
example, "id DECIMAL(38, 0), name STRING"). The column names should be 
identical to the corresponding column names of JDBC table. Users can specify 
the corresponding data types of Spark SQL instead of using the defaults. This 
option applies only to reading.
+ The custom schema to use for reading data from JDBC connectors. For 
example, "id DECIMAL(38, 0), name STRING". You can also specify 
partial fields, others use default values. For example, "id DECIMAL(38, 
0)". The column names should be identical to the corresponding column 
names of JDBC table. Users can specify the corresponding data types of Spark 
SQL instead of using the defaults. This option applies only to reading.
--- End diff --

`others` -> `and the others use the default type mapping`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...

2017-09-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19188


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19226#discussion_r138797273
  
--- Diff: python/pyspark/serializers.py ---
@@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream):
 key_batch_stream = 
self.key_ser._load_stream_without_unbatching(stream)
 val_batch_stream = 
self.val_ser._load_stream_without_unbatching(stream)
 for (key_batch, val_batch) in zip(key_batch_stream, 
val_batch_stream):
+key_batch = list(key_batch)
+val_batch = list(val_batch)
--- End diff --

Should we fix the doc in `Serializer._load_stream_without_unbatching` to 
say, it returns iterator of iterables?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...

2017-09-13 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19188#discussion_r138797271
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmarkArguments.scala
 ---
@@ -29,7 +33,11 @@ class TPCDSQueryBenchmarkArguments(val args: 
Array[String]) {
 while(args.nonEmpty) {
   args match {
 case ("--data-location") :: value :: tail =>
-  dataLocation = value
+  dataLocation = value.toLowerCase(Locale.ROOT)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19226#discussion_r138797113
  
--- Diff: python/pyspark/tests.py ---
@@ -644,6 +644,18 @@ def test_cartesian_chaining(self):
 set([(x, (y, y)) for x in range(10) for y in range(10)])
 )
 
+def test_zip_chaining(self):
+# Tests for SPARK-21985
+rdd = self.sc.parallelize(range(10), 2)
--- End diff --

This test case already passes, doesn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
@zhengruifeng Yeah, it is better. Actually the difference between running 
multiple `withColumn`  and one `withColumns` is mainly in the cost of query 
analysis and plan/dataset initialization. I will re-run the benchmark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19188
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19188
  
Merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...

2017-09-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19188#discussion_r138796870
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmarkArguments.scala
 ---
@@ -29,7 +33,11 @@ class TPCDSQueryBenchmarkArguments(val args: 
Array[String]) {
 while(args.nonEmpty) {
   args match {
 case ("--data-location") :: value :: tail =>
-  dataLocation = value
+  dataLocation = value.toLowerCase(Locale.ROOT)
--- End diff --

I am not sure about that one. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19229
  
In the test code, should we use `model.transform(df).count` instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...

2017-09-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19230
  
Add a test for it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19226
  
**[Test build #81760 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81760/testReport)**
 for PR 19226 at commit 
[`e99ed23`](https://github.com/apache/spark/commit/e99ed23ffa887311b8c77d57733ff005d6987bdb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19231: [SPARK-22002][SQL] Read JDBC table use custom schema sup...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19231
  
**[Test build #81758 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81758/testReport)**
 for PR 19231 at commit 
[`9e7a8a4`](https://github.com/apache/spark/commit/9e7a8a471835d5e93a729c15d166451e79567447).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19230
  
**[Test build #81759 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81759/testReport)**
 for PR 19230 at commit 
[`adbaeab`](https://github.com/apache/spark/commit/adbaeabf18ee1f96611ecbd6ee627bc0a457289d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-13 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/19230

[SPARK-22003][SQL] support array column in vectorized reader with UDF

## What changes were proposed in this pull request?

The UDF needs to deserialize the `UnsafeRow`. When the column type is 
Array, the `get` method from the `ColumnVector`, which is used by the 
vectorized reader, is called, but this method is not implemented. 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix_array_open

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19230.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19230


commit adbaeabf18ee1f96611ecbd6ee627bc0a457289d
Author: Feng Liu 
Date:   2017-09-12T21:56:55Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19231: [SPARK-22002][SQL] Read JDBC table use custom sch...

2017-09-13 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/19231

[SPARK-22002][SQL] Read JDBC table use custom schema support specify 
partial fields.

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/18266 add a new feature to support 
read JDBC table use custom schema, but we must specify all the fields. For 
simplicity, this PR support  specify partial fields.

## How was this patch tested?
unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-22002

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19231.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19231


commit 9e7a8a471835d5e93a729c15d166451e79567447
Author: Yuming Wang 
Date:   2017-09-14T04:26:46Z

Read JDBC table use custom schema support specify partial fields.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19216: [SPARK-21990][SQL] QueryPlanConstraints misses some cons...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19216
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81749/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19216: [SPARK-21990][SQL] QueryPlanConstraints misses some cons...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19216
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19216: [SPARK-21990][SQL] QueryPlanConstraints misses some cons...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19216
  
**[Test build #81749 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81749/testReport)**
 for PR 19216 at commit 
[`e4cffda`](https://github.com/apache/spark/commit/e4cffda91cf9ab3673e12f1427ad1d02c5e5b71e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19185
  
**[Test build #81757 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81757/testReport)**
 for PR 19185 at commit 
[`6529fa6`](https://github.com/apache/spark/commit/6529fa6ecb7d607d3b38e68c8007bc22d9e27907).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r138795133
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/json-functions.sql.out ---
@@ -26,13 +26,13 @@ Extended Usage:
{"time":"26/08/2015"}
   > SELECT to_json(array(named_struct('a', 1, 'b', 2));
[{"a":1,"b":2}]
-  > SELECT to_json(map('a',named_struct('b',1)));
+  > SELECT to_json(map('a', named_struct('b', 1)));
--- End diff --

Or you forget to commit `json-functions.sql`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r138795006
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/json-functions.sql.out ---
@@ -26,13 +26,13 @@ Extended Usage:
{"time":"26/08/2015"}
   > SELECT to_json(array(named_struct('a', 1, 'b', 2));
[{"a":1,"b":2}]
-  > SELECT to_json(map('a',named_struct('b',1)));
+  > SELECT to_json(map('a', named_struct('b', 1)));
--- End diff --

I think you committed unrelated change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19213: [SPARK-17642] [SQL] [FOLLOWUP] drop test tables and impr...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19213
  
**[Test build #81747 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81747/testReport)**
 for PR 19213 at commit 
[`d922c85`](https://github.com/apache/spark/commit/d922c85fe6e462df122450ed015c0a7e722d2e2c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
FYI, the `withColumns` API was proposed in #17819.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
Ran the similar benchmark as 
https://github.com/apache/spark/pull/18902#issuecomment-321727416:



numColums | Old Mean | Old Median | New Mean | New Median
-- | -- | -- | -- | --
1 | 0.1290674059002 | 0.087246649 | 0.1263591766 | 0.05826856929996
10 | 0.4222436709003 | 0.2957120874 | 0.1382999133002 | 0.0752307166
100 | 6.93127441728 | 7.2270134943 | 0.3018686074 | 0.2554692345

The test code is the same basically but measuring transforming time now:

import org.apache.spark.ml.feature._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random

val seed = 123l
val random = new Random(seed)
val n = 1
val m = 100
val rows = sc.parallelize(1 to n).map(i=> 
Row(Array.fill(m)(random.nextDouble): _*))
val struct = new StructType(Array.range(0,m,1).map(i => 
StructField(s"c$i",DoubleType,true)))
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()

for (strategy <- Seq("mean", "median"); k <- Seq(1,10,100)) {
  val imputer = new 
Imputer().setStrategy(strategy).setInputCols(Array.range(0,k,1).map(i=>s"c$i")).setOutputCols(Array.range(0,k,1).map(i=>s"o$i"))
  var duration = 0.0
  for (i<- 0 until 10) {
val model = imputer.fit(df)
val start = System.nanoTime()
model.transform(df)
val end = System.nanoTime()
duration += (end - start) / 1e9
  }
  println((strategy, k, duration/10))
}


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...

2017-09-13 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19222#discussion_r138794370
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java ---
@@ -59,6 +60,18 @@ public static int hashUnsafeWords(Object base, long 
offset, int lengthInBytes, i
 return fmix(h1, lengthInBytes);
   }
 
+  public static int hashUnsafeBytes(MemoryBlock base, long offset, int 
lengthInBytes, int seed) {
--- End diff --

It makes sense. Is it better to add postfix `MB` to another version of the 
method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19229
  
**[Test build #81756 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81756/testReport)**
 for PR 19229 at commit 
[`4b47709`](https://github.com/apache/spark/commit/4b477093737e9d9fae16c82836e421b5e0e7c63e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
cc @MLnick @zhengruifeng @yanboliang 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19213: [SPARK-17642] [SQL] [FOLLOWUP] drop test tables and impr...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19213
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81747/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19211: [SPARK-18838][core] Add separate listener queues to Live...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19211
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...

2017-09-13 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19222#discussion_r138794058
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -75,67 +76,131 @@ public static boolean unaligned() {
 return unaligned;
   }
 
+  public static int getInt(MemoryBlock object, long offset) {
--- End diff --

Do you want to move them (i.e. methods with `MemoryBlock` argument)  into 
`unsafe/memory/MemoryBlock.java`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19213: [SPARK-17642] [SQL] [FOLLOWUP] drop test tables and impr...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19213
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19223: [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json su...

2017-09-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19223#discussion_r138794052
  
--- Diff: R/pkg/R/functions.R ---
@@ -1715,7 +1717,15 @@ setMethod("to_date",
 #'
 #' # Converts an array of structs into a JSON array
 #' df2 <- sql("SELECT array(named_struct('name', 'Bob'), 
named_struct('name', 'Alice')) as people")
-#' df2 <- mutate(df2, people_json = to_json(df2$people))}
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#'
+#' # Converts a map into a JSON object
+#' df2 <- sql("SELECT map('name', 'Bob')) as people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
+#'
+#' # Converts an array of maps into a JSON array
+#' df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as 
people")
+#' df2 <- mutate(df2, people_json = to_json(df2$people))
--- End diff --

... meaning `}`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19211: [SPARK-18838][core] Add separate listener queues to Live...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19211
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81744/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19211: [SPARK-18838][core] Add separate listener queues to Live...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19211
  
**[Test build #81744 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81744/testReport)**
 for PR 19211 at commit 
[`20b8382`](https://github.com/apache/spark/commit/20b83826a70ac8574e289db9fdcae37c305c01bd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19229
  
**[Test build #81755 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81755/testReport)**
 for PR 19229 at commit 
[`4efb643`](https://github.com/apache/spark/commit/4efb64374b7c93bae3e9b0d2fc0ebc4f5ad1e1d5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19228
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81753/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19228
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19228
  
**[Test build #81753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81753/testReport)**
 for PR 19228 at commit 
[`0703b67`](https://github.com/apache/spark/commit/0703b67405fa721230af80421509a55eb88c5763).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-09-13 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/19229

[SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns 
at one pass

## What changes were proposed in this pull request?

SPARK-21690 makes one-pass `Imputer` by parallelizing the computation of 
all input columns. When we transform dataset with `ImputerModel`, we do 
`withColumn` on all input columns sequentially. We can also do this on all 
input columns at once by adding a `withColumns` API to `Dataset`.

The new `withColumns` API is for internal use only now.

## How was this patch tested?

Existing tests for `ImputerModel`'s change. Added tests for `withColumns` 
API.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-22001

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19229


commit 4efb64374b7c93bae3e9b0d2fc0ebc4f5ad1e1d5
Author: Liang-Chi Hsieh 
Date:   2017-09-14T03:49:16Z

Do withColumn on all input columns at once.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138792851
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1464,20 +1464,79 @@ def test_logistic_regression_summary(self):
 self.assertEqual(s.probabilityCol, "probability")
 self.assertEqual(s.labelCol, "label")
 self.assertEqual(s.featuresCol, "features")
+self.assertEqual(s.predictionCol, "prediction")
 objHist = s.objectiveHistory
 self.assertTrue(isinstance(objHist, list) and 
isinstance(objHist[0], float))
 self.assertGreater(s.totalIterations, 0)
+self.assertTrue(isinstance(s.labels, list))
+self.assertTrue(isinstance(s.truePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.falsePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.precisionByLabel, list))
+self.assertTrue(isinstance(s.recallByLabel, list))
+self.assertTrue(isinstance(s.fMeasureByLabel(), list))
+self.assertTrue(isinstance(s.fMeasureByLabel(1.0), list))
 self.assertTrue(isinstance(s.roc, DataFrame))
 self.assertAlmostEqual(s.areaUnderROC, 1.0, 2)
 self.assertTrue(isinstance(s.pr, DataFrame))
 self.assertTrue(isinstance(s.fMeasureByThreshold, DataFrame))
 self.assertTrue(isinstance(s.precisionByThreshold, DataFrame))
 self.assertTrue(isinstance(s.recallByThreshold, DataFrame))
+self.assertAlmostEqual(s.accuracy, 1.0, 2)
+self.assertAlmostEqual(s.weightedTruePositiveRate, 1.0, 2)
+self.assertAlmostEqual(s.weightedFalsePositiveRate, 0.0, 2)
+self.assertAlmostEqual(s.weightedRecall, 1.0, 2)
+self.assertAlmostEqual(s.weightedPrecision, 1.0, 2)
+self.assertAlmostEqual(s.weightedFMeasure(), 1.0, 2)
+self.assertAlmostEqual(s.weightedFMeasure(1.0), 1.0, 2)
 # test evaluation (with training dataset) produces a summary with 
same values
 # one check is enough to verify a summary is returned, Scala 
version runs full test
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC)
 
+def test_multiclass_logistic_regression_summary(self):
+df = self.spark.createDataFrame([(1.0, 2.0, Vectors.dense(1.0)),
+ (0.0, 2.0, Vectors.sparse(1, [], 
[])),
+ (2.0, 2.0, Vectors.dense(2.0)),
+ (2.0, 2.0, Vectors.dense(1.9))],
+["label", "weight", "features"])
+lr = LogisticRegression(maxIter=5, regParam=0.01, 
weightCol="weight", fitIntercept=False)
+model = lr.fit(df)
+self.assertTrue(model.hasSummary)
+s = model.summary
+# test that api is callable and returns expected types
+self.assertTrue(isinstance(s.predictions, DataFrame))
+self.assertEqual(s.probabilityCol, "probability")
+self.assertEqual(s.labelCol, "label")
+self.assertEqual(s.featuresCol, "features")
+self.assertEqual(s.predictionCol, "prediction")
+objHist = s.objectiveHistory
+self.assertTrue(isinstance(objHist, list) and 
isinstance(objHist[0], float))
+self.assertGreater(s.totalIterations, 0)
+self.assertTrue(isinstance(s.labels, list))
+self.assertTrue(isinstance(s.truePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.falsePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.precisionByLabel, list))
+self.assertTrue(isinstance(s.recallByLabel, list))
+self.assertTrue(isinstance(s.fMeasureByLabel(), list))
+self.assertTrue(isinstance(s.fMeasureByLabel(1.0), list))
+self.assertAlmostEqual(s.accuracy, 0.75, 2)
+self.assertAlmostEqual(s.weightedTruePositiveRate, 0.75, 2)
+self.assertAlmostEqual(s.weightedFalsePositiveRate, 0.25, 2)
+self.assertAlmostEqual(s.weightedRecall, 0.75, 2)
+self.assertAlmostEqual(s.weightedPrecision, 0.583, 2)
+self.assertAlmostEqual(s.weightedFMeasure(), 0.65, 2)
+self.assertAlmostEqual(s.weightedFMeasure(1.0), 0.65, 2)
+# test evaluation (with training dataset) produces a summary with 
same values
+# one check is enough to verify a summary is returned, Scala 
version runs full test
+sameSummary = model.evaluate(df)
+self.assertAlmostEqual(sameSummary.accuracy, s.accuracy)
--- End diff --

Nit: Like mentioned in annotation, one check is enough to verify a summary 
is returned, let's remove others to simplify the test. Thanks.


---


[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81752 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81752/testReport)**
 for PR 19186 at commit 
[`74445cd`](https://github.com/apache/spark/commit/74445cdfec15bef2413ea88b712b9490a2997874).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81752/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19210: Fix Graphite re-connects for Graphite instances behind E...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19210
  
**[Test build #81754 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81754/testReport)**
 for PR 19210 at commit 
[`8e982c7`](https://github.com/apache/spark/commit/8e982c7d450498580ab857baeed2650488ea1837).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19210: Fix Graphite re-connects for Graphite instances behind E...

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19210
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19130: [SPARK-21917][CORE][YARN] Supporting adding http(...

2017-09-13 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19130#discussion_r138791246
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -367,6 +368,53 @@ object SparkSubmit extends CommandLineUtils with 
Logging {
   }.orNull
 }
 
+// When running in YARN cluster manager,
--- End diff --

Sorry for the broken comment, my bad, I will fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19188
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81745/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19188
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19210: Fix Graphite re-connects for Graphite instances behind E...

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19210
  
Sure.

ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19188
  
**[Test build #81745 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81745/testReport)**
 for PR 19188 at commit 
[`b543e71`](https://github.com/apache/spark/commit/b543e710dae79da33a9334d5bbe4bb474a44b39c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19226#discussion_r138790747
  
--- Diff: python/pyspark/serializers.py ---
@@ -343,9 +346,6 @@ def _load_stream_without_unbatching(self, stream):
 key_batch_stream = 
self.key_ser._load_stream_without_unbatching(stream)
 val_batch_stream = 
self.val_ser._load_stream_without_unbatching(stream)
 for (key_batch, val_batch) in zip(key_batch_stream, 
val_batch_stream):
-if len(key_batch) != len(val_batch):
-raise ValueError("Can not deserialize PairRDD with 
different number of items"
- " in batches: (%d, %d)" % 
(len(key_batch), len(val_batch)))
 # for correctness with repeated cartesian/zip this must be 
returned as one batch
 yield zip(key_batch, val_batch)
--- End diff --

How about returning this batch as a list (and as described in the doc)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/19228


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19228
  
Doh, sorry @holdenk and @aray, I didn't know the PR was open and was in 
progress. Although the approach looks different with 
https://github.com/apache/spark/pull/19226, let me close mine and discuss first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19228
  
**[Test build #81753 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81753/testReport)**
 for PR 19228 at commit 
[`0703b67`](https://github.com/apache/spark/commit/0703b67405fa721230af80421509a55eb88c5763).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19228: [SPARK-21985][PYTHON] Fix zip-chained RDD to work

2017-09-13 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19228

[SPARK-21985][PYTHON] Fix zip-chained RDD to work

## What changes were proposed in this pull request?

This PR proposes to return an iterator of lists (batches) of objects in 
`CartesianDeserializer` and `PairDeserializer` rather than an iterator of 
iterators (batches) of objects so that `zip` chaining works.

## How was this patch tested?

Unit tests added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-21985

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19228


commit 0703b67405fa721230af80421509a55eb88c5763
Author: hyukjinkwon 
Date:   2017-09-14T03:29:39Z

Returns an iterator of lists




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19210: Fix Graphite re-connects for Graphite instances behind E...

2017-09-13 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19210
  
@HyukjinKwon would you please help to trigger the Jenkins? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81752 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81752/testReport)**
 for PR 19186 at commit 
[`74445cd`](https://github.com/apache/spark/commit/74445cdfec15bef2413ea88b712b9490a2997874).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19132: [SPARK-21922] Fix duration always updating when t...

2017-09-13 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19132#discussion_r138789492
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/api/v1/OneStageResource.scala ---
@@ -81,7 +83,8 @@ private[v1] class OneStageResource(ui: SparkUI) {
   @DefaultValue("20") @QueryParam("length") length: Int,
   @DefaultValue("ID") @QueryParam("sortBy") sortBy: TaskSorting): 
Seq[TaskData] = {
 withStageAttempt(stageId, stageAttemptId) { stage =>
-  val tasks = 
stage.ui.taskData.values.map{AllStagesResource.convertTaskData}.toIndexedSeq
+  val tasks = stage.ui.taskData.values.map{
--- End diff --

The style should be changed to `map {  AllStagesResource.convertTaskData(_, 
ui.lastUpdateTime) }`, requires whitespace between `{` and `}`. You could check 
other similar codes about the style.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/19226
  
Sure, no worries. I think we should keep the test for now and we can hope 
this goes into RC2 (I assume something will be missing from RC1 or I'll screw 
up its packaging in some way). Otherwise the fix can go out into 2.2.1 if 
somehow RC1 magically passes :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19132: [SPARK-21922] Fix duration always updating when t...

2017-09-13 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19132#discussion_r138789213
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala ---
@@ -97,6 +97,7 @@ private[spark] object UIData {
 var memoryBytesSpilled: Long = _
 var diskBytesSpilled: Long = _
 var isBlacklisted: Int = _
+var jobLastUpdateTime: Option[Long] = None
--- End diff --

Is it better to rename to `stageLastUpdateTime` or just `lastUpdateTime`? 
Since this structure unrelated to job, would be better to not involve "job".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81751 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81751/testReport)**
 for PR 19186 at commit 
[`aa04d4b`](https://github.com/apache/spark/commit/aa04d4bb5124ccd570775076047849b49025735f).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81751/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19220: [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpa...

2017-09-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19220
  
LGTM Thanks for this catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81751 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81751/testReport)**
 for PR 19186 at commit 
[`aa04d4b`](https://github.com/apache/spark/commit/aa04d4bb5124ccd570775076047849b49025735f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18902
  
@MLnick Thanks for pinging me.

I go through this quickly. The basic idea is the same, performing the 
operations on multiple inputs columns at one single Dataset/DataFrame operation.

Unlike `Bucketizer`, `Imputer` has no compatibility concern because it 
already supports multiple input columns (`HasInputCols`). In `Bucketizer`, we 
don't want to break its current API so it makes thing more complicated a bit.

Actually I'm noticed by `ImputerModel` which also applies `withColumn` 
sequentially on each input column. I'd like to address this part with the 
`withColumns` API proposed in #17819. What do you think @MLnick?







---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/19226
  
@holdenk I'm not going to be able to solve this tonight (short of just 
removing the failing test). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 ParamMaps...

2017-09-13 Thread marktab
Github user marktab commented on the issue:

https://github.com/apache/spark/pull/19152
  
@srowen  -- may I close this pull request? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 Pa...

2017-09-13 Thread marktab
GitHub user marktab reopened a pull request:

https://github.com/apache/spark/pull/19152

[SPARK-21915][ML][PySpark] Model 1 and Model 2 ParamMaps Missing

@dongjoon-hyun @HyukjinKwon

Error in PySpark example code:
/examples/src/main/python/ml/estimator_transformer_param_example.py

The original Scala code says
println("Model 2 was fit using parameters: " + 
model2.parent.extractParamMap)

The parent is lr

There is no method for accessing parent as is done in Scala.

This code has been tested in Python, and returns values consistent with 
Scala

## What changes were proposed in this pull request?

Proposing to call the lr variable instead of model1 or model2

## How was this patch tested?

This patch was tested with Spark 2.1.0 comparing the Scala and PySpark 
results. Pyspark returns nothing at present for those two print lines.

The output for model2 in PySpark should be

{Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', 
doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. 
For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 
penalty.'): 0.0,
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='predictionCol', doc='prediction column name.'): 'prediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
doc='features column name.'): 'features',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
doc='label column name.'): 'label',
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='probabilityCol', doc='Column name for predicted class conditional 
probabilities. Note: Not all models output well-calibrated probability 
estimates! These probabilities should be treated as confidences, not precise 
probabilities.'): 'myProbability',
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
name.'): 'rawPrediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', 
doc='The name of family which is a description of the label distribution to be 
used in the model. Supported options: auto, binomial, multinomial'): 'auto',
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
doc='Threshold in binary classification prediction, in range [0, 1]. If 
threshold and thresholds are both set, they must match.e.g. if threshold is p, 
then thresholds must be equal to [1-p, p].'): 0.55,
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
doc='max number of iterations (>= 0).'): 30,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
doc='regularization parameter (>= 0).'): 0.1,
Param(parent='LogisticRegression_4187be538f744d5a9090', 
name='standardization', doc='whether to standardize the training features 
before fitting the model.'): True}

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marktab/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19152


commit a2ccb8a83d13d39c95f0ac1cac1c74dca064
Author: MarkTab marktab.net 
Date:   2017-09-07T02:20:59Z

Model 1 and Model 2 ParamMaps Missing

@dongjoon-hyun @HyukjinKwon

Error in PySpark example code:

[https://github.com/apache/spark/blob/master/examples/src/main/python/ml/estimator_transformer_param_example.py]

The original Scala code says
println("Model 2 was fit using parameters: " + 
model2.parent.extractParamMap)

The parent is lr

There is no method for accessing parent as is done in Scala.

This code has been tested in Python, and returns values consistent with 
Scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 Pa...

2017-09-13 Thread marktab
Github user marktab closed the pull request at:

https://github.com/apache/spark/pull/19152


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState in shutdown hook.

2017-09-13 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19219
  
cc @cloud-fan @jiangxb1987 
Could you please take a look at this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >