spark git commit: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-09-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master f76790557 -> 520d92a19


[SPARK-20098][PYSPARK] dataType's typeName fix

## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai 

Closes #17435 from szalai1/datatype-gettype-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/520d92a1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/520d92a1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/520d92a1

Branch: refs/heads/master
Commit: 520d92a191c3148498087d751aeeddd683055622
Parents: f767905
Author: Peter Szalai 
Authored: Sun Sep 10 17:47:45 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Sep 10 17:47:45 2017 +0900

--
 python/pyspark/sql/tests.py | 4 
 python/pyspark/sql/types.py | 5 +
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/520d92a1/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 4d65abc..6e7ddf9 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -209,6 +209,10 @@ class DataTypeTests(unittest.TestCase):
 row = Row()
 self.assertEqual(len(row), 0)
 
+def test_struct_field_type_name(self):
+struct_field = StructField("a", IntegerType())
+self.assertRaises(TypeError, struct_field.typeName)
+
 
 class SQLTests(ReusedPySparkTestCase):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/520d92a1/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 51bf7be..920cf00 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -440,6 +440,11 @@ class StructField(DataType):
 def fromInternal(self, obj):
 return self.dataType.fromInternal(obj)
 
+def typeName(self):
+raise TypeError(
+"StructField does not have typeName. "
+"Use typeName on its type explicitly instead.")
+
 
 class StructType(DataType):
 """Struct type, consisting of a list of :class:`StructField`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-09-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 182478e03 -> b1b5a7fdc


[SPARK-20098][PYSPARK] dataType's typeName fix

## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai 

Closes #17435 from szalai1/datatype-gettype-fix.

(cherry picked from commit 520d92a191c3148498087d751aeeddd683055622)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1b5a7fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1b5a7fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1b5a7fd

Branch: refs/heads/branch-2.2
Commit: b1b5a7fdc0f8fabfb235f0b31bde0f1bfb71591a
Parents: 182478e
Author: Peter Szalai 
Authored: Sun Sep 10 17:47:45 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Sep 10 17:48:00 2017 +0900

--
 python/pyspark/sql/tests.py | 4 
 python/pyspark/sql/types.py | 5 +
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b1b5a7fd/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index a100dc0..39655a5 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -188,6 +188,10 @@ class DataTypeTests(unittest.TestCase):
 row = Row()
 self.assertEqual(len(row), 0)
 
+def test_struct_field_type_name(self):
+struct_field = StructField("a", IntegerType())
+self.assertRaises(TypeError, struct_field.typeName)
+
 
 class SQLTests(ReusedPySparkTestCase):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b1b5a7fd/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 26b54a7..d9206dd 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -438,6 +438,11 @@ class StructField(DataType):
 def fromInternal(self, obj):
 return self.dataType.fromInternal(obj)
 
+def typeName(self):
+raise TypeError(
+"StructField does not have typeName. "
+"Use typeName on its type explicitly instead.")
+
 
 class StructType(DataType):
 """Struct type, consisting of a list of :class:`StructField`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[3/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/tests/fulltests/test_shuffle.R
--
diff --git a/R/pkg/tests/fulltests/test_shuffle.R 
b/R/pkg/tests/fulltests/test_shuffle.R
new file mode 100644
index 000..d38efab
--- /dev/null
+++ b/R/pkg/tests/fulltests/test_shuffle.R
@@ -0,0 +1,224 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+context("partitionBy, groupByKey, reduceByKey etc.")
+
+# JavaSparkContext handle
+sparkSession <- sparkR.session(enableHiveSupport = FALSE)
+sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
+
+# Data
+intPairs <- list(list(1L, -1), list(2L, 100), list(2L, 1), list(1L, 200))
+intRdd <- parallelize(sc, intPairs, 2L)
+
+doublePairs <- list(list(1.5, -1), list(2.5, 100), list(2.5, 1), list(1.5, 
200))
+doubleRdd <- parallelize(sc, doublePairs, 2L)
+
+numPairs <- list(list(1L, 100), list(2L, 200), list(4L, -1), list(3L, 1),
+ list(3L, 0))
+numPairsRdd <- parallelize(sc, numPairs, length(numPairs))
+
+strList <- list("Dexter Morgan: Blood. Sometimes it sets my teeth on edge and 
",
+"Dexter Morgan: Harry and Dorris Morgan did a wonderful job ")
+strListRDD <- parallelize(sc, strList, 4)
+
+test_that("groupByKey for integers", {
+  grouped <- groupByKey(intRdd, 2L)
+
+  actual <- collectRDD(grouped)
+
+  expected <- list(list(2L, list(100, 1)), list(1L, list(-1, 200)))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("groupByKey for doubles", {
+  grouped <- groupByKey(doubleRdd, 2L)
+
+  actual <- collectRDD(grouped)
+
+  expected <- list(list(1.5, list(-1, 200)), list(2.5, list(100, 1)))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("reduceByKey for ints", {
+  reduced <- reduceByKey(intRdd, "+", 2L)
+
+  actual <- collectRDD(reduced)
+
+  expected <- list(list(2L, 101), list(1L, 199))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("reduceByKey for doubles", {
+  reduced <- reduceByKey(doubleRdd, "+", 2L)
+  actual <- collectRDD(reduced)
+
+  expected <- list(list(1.5, 199), list(2.5, 101))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("combineByKey for ints", {
+  reduced <- combineByKey(intRdd, function(x) { x }, "+", "+", 2L)
+
+  actual <- collectRDD(reduced)
+
+  expected <- list(list(2L, 101), list(1L, 199))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("combineByKey for doubles", {
+  reduced <- combineByKey(doubleRdd, function(x) { x }, "+", "+", 2L)
+  actual <- collectRDD(reduced)
+
+  expected <- list(list(1.5, 199), list(2.5, 101))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("combineByKey for characters", {
+  stringKeyRDD <- parallelize(sc,
+  list(list("max", 1L), list("min", 2L),
+   list("other", 3L), list("max", 4L)), 2L)
+  reduced <- combineByKey(stringKeyRDD,
+  function(x) { x }, "+", "+", 2L)
+  actual <- collectRDD(reduced)
+
+  expected <- list(list("max", 5L), list("min", 2L), list("other", 3L))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+})
+
+test_that("aggregateByKey", {
+  # test aggregateByKey for int keys
+  rdd <- parallelize(sc, list(list(1, 1), list(1, 2), list(2, 3), list(2, 4)))
+
+  zeroValue <- list(0, 0)
+  seqOp <- function(x, y) { list(x[[1]] + y, x[[2]] + 1) }
+  combOp <- function(x, y) { list(x[[1]] + y[[1]], x[[2]] + y[[2]]) }
+  aggregatedRDD <- aggregateByKey(rdd, zeroValue, seqOp, combOp, 2L)
+
+  actual <- collectRDD(aggregatedRDD)
+
+  expected <- list(list(1, list(3, 2)), list(2, list(7, 2)))
+  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
+
+  # test aggregateByKey for string keys
+  rdd <- parallelize(sc, list(list("a", 1), list("a", 2), list("b", 3), 
list("b", 4)))
+
+  zeroValue <- list(0, 0)
+  seqOp <- function(x, y) { list(x[[1]] + y, x[[2]] + 1) }
+  combOp <- function(x, y) { list(x[[1]] + y[[1]], x[[2]] + y[[2]]) }
+  aggregatedRDD <- aggregateByK

[1/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 6a8a726f3 -> ae4e8ae41


http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/tests/fulltests/test_take.R
--
diff --git a/R/pkg/tests/fulltests/test_take.R 
b/R/pkg/tests/fulltests/test_take.R
new file mode 100644
index 000..aaa5328
--- /dev/null
+++ b/R/pkg/tests/fulltests/test_take.R
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+context("tests RDD function take()")
+
+# Mock data
+numVector <- c(-10:97)
+numList <- list(sqrt(1), sqrt(2), sqrt(3), 4 ** 10)
+strVector <- c("Dexter Morgan: I suppose I should be upset, even feel",
+   "violated, but I'm not. No, in fact, I think this is a 
friendly",
+   "message, like \"Hey, wanna play?\" and yes, I want to play. ",
+   "I really, really do.")
+strList <- list("Dexter Morgan: Blood. Sometimes it sets my teeth on edge, ",
+"other times it helps me control the chaos.",
+"Dexter Morgan: Harry and Dorris Morgan did a wonderful job ",
+"raising me. But they're both dead now. I didn't kill them. 
Honest.")
+
+# JavaSparkContext handle
+sparkSession <- sparkR.session(enableHiveSupport = FALSE)
+sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
+
+test_that("take() gives back the original elements in correct count and 
order", {
+  numVectorRDD <- parallelize(sc, numVector, 10)
+  # case: number of elements to take is less than the size of the first 
partition
+  expect_equal(takeRDD(numVectorRDD, 1), as.list(head(numVector, n = 1)))
+  # case: number of elements to take is the same as the size of the first 
partition
+  expect_equal(takeRDD(numVectorRDD, 11), as.list(head(numVector, n = 11)))
+  # case: number of elements to take is greater than all elements
+  expect_equal(takeRDD(numVectorRDD, length(numVector)), as.list(numVector))
+  expect_equal(takeRDD(numVectorRDD, length(numVector) + 1), 
as.list(numVector))
+
+  numListRDD <- parallelize(sc, numList, 1)
+  numListRDD2 <- parallelize(sc, numList, 4)
+  expect_equal(takeRDD(numListRDD, 3), takeRDD(numListRDD2, 3))
+  expect_equal(takeRDD(numListRDD, 5), takeRDD(numListRDD2, 5))
+  expect_equal(takeRDD(numListRDD, 1), as.list(head(numList, n = 1)))
+  expect_equal(takeRDD(numListRDD2, 999), numList)
+
+  strVectorRDD <- parallelize(sc, strVector, 2)
+  strVectorRDD2 <- parallelize(sc, strVector, 3)
+  expect_equal(takeRDD(strVectorRDD, 4), as.list(strVector))
+  expect_equal(takeRDD(strVectorRDD2, 2), as.list(head(strVector, n = 2)))
+
+  strListRDD <- parallelize(sc, strList, 4)
+  strListRDD2 <- parallelize(sc, strList, 1)
+  expect_equal(takeRDD(strListRDD, 3), as.list(head(strList, n = 3)))
+  expect_equal(takeRDD(strListRDD2, 1), as.list(head(strList, n = 1)))
+
+  expect_equal(length(takeRDD(strListRDD, 0)), 0)
+  expect_equal(length(takeRDD(strVectorRDD, 0)), 0)
+  expect_equal(length(takeRDD(numListRDD, 0)), 0)
+  expect_equal(length(takeRDD(numVectorRDD, 0)), 0)
+})
+
+sparkR.session.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/tests/fulltests/test_textFile.R
--
diff --git a/R/pkg/tests/fulltests/test_textFile.R 
b/R/pkg/tests/fulltests/test_textFile.R
new file mode 100644
index 000..3b46606
--- /dev/null
+++ b/R/pkg/tests/fulltests/test_textFile.R
@@ -0,0 +1,164 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the spe

[6/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
deleted file mode 100644
index 6285124..000
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ /dev/null
@@ -1,2883 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-library(testthat)
-
-context("SparkSQL functions")
-
-# Utility function for easily checking the values of a StructField
-checkStructField <- function(actual, expectedName, expectedType, 
expectedNullable) {
-  expect_equal(class(actual), "structField")
-  expect_equal(actual$name(), expectedName)
-  expect_equal(actual$dataType.toString(), expectedType)
-  expect_equal(actual$nullable(), expectedNullable)
-}
-
-markUtf8 <- function(s) {
-  Encoding(s) <- "UTF-8"
-  s
-}
-
-setHiveContext <- function(sc) {
-  if (exists(".testHiveSession", envir = .sparkREnv)) {
-hiveSession <- get(".testHiveSession", envir = .sparkREnv)
-  } else {
-# initialize once and reuse
-ssc <- callJMethod(sc, "sc")
-hiveCtx <- tryCatch({
-  newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)
-},
-error = function(err) {
-  skip("Hive is not build with SparkSQL, skipped")
-})
-hiveSession <- callJMethod(hiveCtx, "sparkSession")
-  }
-  previousSession <- get(".sparkRsession", envir = .sparkREnv)
-  assign(".sparkRsession", hiveSession, envir = .sparkREnv)
-  assign(".prevSparkRsession", previousSession, envir = .sparkREnv)
-  hiveSession
-}
-
-unsetHiveContext <- function() {
-  previousSession <- get(".prevSparkRsession", envir = .sparkREnv)
-  assign(".sparkRsession", previousSession, envir = .sparkREnv)
-  remove(".prevSparkRsession", envir = .sparkREnv)
-}
-
-# Tests for SparkSQL functions in SparkR
-
-filesBefore <- list.files(path = sparkRDir, all.files = TRUE)
-sparkSession <- sparkR.session()
-sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
-
-mockLines <- c("{\"name\":\"Michael\"}",
-   "{\"name\":\"Andy\", \"age\":30}",
-   "{\"name\":\"Justin\", \"age\":19}")
-jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
-parquetPath <- tempfile(pattern = "sparkr-test", fileext = ".parquet")
-orcPath <- tempfile(pattern = "sparkr-test", fileext = ".orc")
-writeLines(mockLines, jsonPath)
-
-# For test nafunctions, like dropna(), fillna(),...
-mockLinesNa <- c("{\"name\":\"Bob\",\"age\":16,\"height\":176.5}",
- "{\"name\":\"Alice\",\"age\":null,\"height\":164.3}",
- "{\"name\":\"David\",\"age\":60,\"height\":null}",
- "{\"name\":\"Amy\",\"age\":null,\"height\":null}",
- "{\"name\":null,\"age\":null,\"height\":null}")
-jsonPathNa <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
-writeLines(mockLinesNa, jsonPathNa)
-
-# For test complex types in DataFrame
-mockLinesComplexType <-
-  c("{\"c1\":[1, 2, 3], \"c2\":[\"a\", \"b\", \"c\"], \"c3\":[1.0, 2.0, 3.0]}",
-"{\"c1\":[4, 5, 6], \"c2\":[\"d\", \"e\", \"f\"], \"c3\":[4.0, 5.0, 6.0]}",
-"{\"c1\":[7, 8, 9], \"c2\":[\"g\", \"h\", \"i\"], \"c3\":[7.0, 8.0, 9.0]}")
-complexTypeJsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
-writeLines(mockLinesComplexType, complexTypeJsonPath)
-
-test_that("calling sparkRSQL.init returns existing SQL context", {
-  sqlContext <- suppressWarnings(sparkRSQL.init(sc))
-  expect_equal(suppressWarnings(sparkRSQL.init(sc)), sqlContext)
-})
-
-test_that("calling sparkRSQL.init returns existing SparkSession", {
-  expect_equal(suppressWarnings(sparkRSQL.init(sc)), sparkSession)
-})
-
-test_that("calling sparkR.session returns existing SparkSession", {
-  expect_equal(sparkR.session(), sparkSession)
-})
-
-test_that("infer types and check types", {
-  expect_equal(infer_type(1L), "integer")
-  expect_equal(infer_type(1.0), "double")
-  expect_equal(infer_type("abc"), "string")
-  expect_equal(infer_type(TRUE), "boolean")
-  expect_equal(infer_type(as.Date("2015-03-11")), "date")
-  expect_equal(infer_type(as.POSIXlt("2015-03-11 12:13:04.043")), "time

[2/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
new file mode 100644
index 000..07de45b
--- /dev/null
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -0,0 +1,2887 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+library(testthat)
+
+context("SparkSQL functions")
+
+# Utility function for easily checking the values of a StructField
+checkStructField <- function(actual, expectedName, expectedType, 
expectedNullable) {
+  expect_equal(class(actual), "structField")
+  expect_equal(actual$name(), expectedName)
+  expect_equal(actual$dataType.toString(), expectedType)
+  expect_equal(actual$nullable(), expectedNullable)
+}
+
+markUtf8 <- function(s) {
+  Encoding(s) <- "UTF-8"
+  s
+}
+
+setHiveContext <- function(sc) {
+  if (exists(".testHiveSession", envir = .sparkREnv)) {
+hiveSession <- get(".testHiveSession", envir = .sparkREnv)
+  } else {
+# initialize once and reuse
+ssc <- callJMethod(sc, "sc")
+hiveCtx <- tryCatch({
+  newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)
+},
+error = function(err) {
+  skip("Hive is not build with SparkSQL, skipped")
+})
+hiveSession <- callJMethod(hiveCtx, "sparkSession")
+  }
+  previousSession <- get(".sparkRsession", envir = .sparkREnv)
+  assign(".sparkRsession", hiveSession, envir = .sparkREnv)
+  assign(".prevSparkRsession", previousSession, envir = .sparkREnv)
+  hiveSession
+}
+
+unsetHiveContext <- function() {
+  previousSession <- get(".prevSparkRsession", envir = .sparkREnv)
+  assign(".sparkRsession", previousSession, envir = .sparkREnv)
+  remove(".prevSparkRsession", envir = .sparkREnv)
+}
+
+# Tests for SparkSQL functions in SparkR
+
+filesBefore <- list.files(path = sparkRDir, all.files = TRUE)
+sparkSession <- sparkR.session()
+sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
+
+mockLines <- c("{\"name\":\"Michael\"}",
+   "{\"name\":\"Andy\", \"age\":30}",
+   "{\"name\":\"Justin\", \"age\":19}")
+jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
+parquetPath <- tempfile(pattern = "sparkr-test", fileext = ".parquet")
+orcPath <- tempfile(pattern = "sparkr-test", fileext = ".orc")
+writeLines(mockLines, jsonPath)
+
+# For test nafunctions, like dropna(), fillna(),...
+mockLinesNa <- c("{\"name\":\"Bob\",\"age\":16,\"height\":176.5}",
+ "{\"name\":\"Alice\",\"age\":null,\"height\":164.3}",
+ "{\"name\":\"David\",\"age\":60,\"height\":null}",
+ "{\"name\":\"Amy\",\"age\":null,\"height\":null}",
+ "{\"name\":null,\"age\":null,\"height\":null}")
+jsonPathNa <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
+writeLines(mockLinesNa, jsonPathNa)
+
+# For test complex types in DataFrame
+mockLinesComplexType <-
+  c("{\"c1\":[1, 2, 3], \"c2\":[\"a\", \"b\", \"c\"], \"c3\":[1.0, 2.0, 3.0]}",
+"{\"c1\":[4, 5, 6], \"c2\":[\"d\", \"e\", \"f\"], \"c3\":[4.0, 5.0, 6.0]}",
+"{\"c1\":[7, 8, 9], \"c2\":[\"g\", \"h\", \"i\"], \"c3\":[7.0, 8.0, 9.0]}")
+complexTypeJsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
+writeLines(mockLinesComplexType, complexTypeJsonPath)
+
+if (.Platform$OS.type == "windows") {
+  Sys.setenv(TZ = "GMT")
+}
+
+test_that("calling sparkRSQL.init returns existing SQL context", {
+  sqlContext <- suppressWarnings(sparkRSQL.init(sc))
+  expect_equal(suppressWarnings(sparkRSQL.init(sc)), sqlContext)
+})
+
+test_that("calling sparkRSQL.init returns existing SparkSession", {
+  expect_equal(suppressWarnings(sparkRSQL.init(sc)), sparkSession)
+})
+
+test_that("calling sparkR.session returns existing SparkSession", {
+  expect_equal(sparkR.session(), sparkSession)
+})
+
+test_that("infer types and check types", {
+  expect_equal(infer_type(1L), "integer")
+  expect_equal(infer_type(1.0), "double")
+  expect_equal(infer_type("abc"), "string")
+  expect_equal(infer_type(TRUE), "boolean")
+  expect_equal(infer_type(as.Date("2015-03-11")), "date")
+  expect_equal(infer_

[9/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
[SPARKR][BACKPORT-2.1] backporting package and test changes

## What changes were proposed in this pull request?

cherrypick or manually porting changes to 2.1

## How was this patch tested?

Jenkins

Author: Felix Cheung 
Author: hyukjinkwon 
Author: Wayne Zhang 

Closes #19165 from felixcheung/rbackportpkg21.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ae4e8ae4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ae4e8ae4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ae4e8ae4

Branch: refs/heads/branch-2.1
Commit: ae4e8ae41d1ba135159afe9ffb1302971343efd1
Parents: 6a8a726
Author: Felix Cheung 
Authored: Sun Sep 10 10:24:46 2017 -0700
Committer: Felix Cheung 
Committed: Sun Sep 10 10:24:46 2017 -0700

--
 R/pkg/.Rbuildignore |2 +
 R/pkg/DESCRIPTION   |2 +-
 R/pkg/R/install.R   |6 +-
 R/pkg/inst/tests/testthat/jarTest.R |   32 -
 R/pkg/inst/tests/testthat/packageInAJarTest.R   |   30 -
 R/pkg/inst/tests/testthat/test_Serde.R  |   79 -
 R/pkg/inst/tests/testthat/test_Windows.R|   27 -
 R/pkg/inst/tests/testthat/test_basic.R  |   72 +
 R/pkg/inst/tests/testthat/test_binaryFile.R |   92 -
 .../inst/tests/testthat/test_binary_function.R  |  104 -
 R/pkg/inst/tests/testthat/test_broadcast.R  |   51 -
 R/pkg/inst/tests/testthat/test_client.R |   43 -
 R/pkg/inst/tests/testthat/test_context.R|  210 --
 R/pkg/inst/tests/testthat/test_includePackage.R |   60 -
 R/pkg/inst/tests/testthat/test_jvm_api.R|   36 -
 R/pkg/inst/tests/testthat/test_mllib.R  | 1205 
 .../tests/testthat/test_parallelize_collect.R   |  112 -
 R/pkg/inst/tests/testthat/test_rdd.R|  804 -
 R/pkg/inst/tests/testthat/test_shuffle.R|  224 --
 R/pkg/inst/tests/testthat/test_sparkR.R |   46 -
 R/pkg/inst/tests/testthat/test_sparkSQL.R   | 2883 -
 R/pkg/inst/tests/testthat/test_take.R   |   69 -
 R/pkg/inst/tests/testthat/test_textFile.R   |  164 -
 R/pkg/inst/tests/testthat/test_utils.R  |  242 --
 R/pkg/tests/fulltests/jarTest.R |   32 +
 R/pkg/tests/fulltests/packageInAJarTest.R   |   30 +
 R/pkg/tests/fulltests/test_Serde.R  |   79 +
 R/pkg/tests/fulltests/test_Windows.R|   27 +
 R/pkg/tests/fulltests/test_binaryFile.R |   92 +
 R/pkg/tests/fulltests/test_binary_function.R|  104 +
 R/pkg/tests/fulltests/test_broadcast.R  |   51 +
 R/pkg/tests/fulltests/test_client.R |   43 +
 R/pkg/tests/fulltests/test_context.R|  210 ++
 R/pkg/tests/fulltests/test_includePackage.R |   60 +
 R/pkg/tests/fulltests/test_jvm_api.R|   36 +
 R/pkg/tests/fulltests/test_mllib.R  | 1205 
 .../tests/fulltests/test_parallelize_collect.R  |  112 +
 R/pkg/tests/fulltests/test_rdd.R|  804 +
 R/pkg/tests/fulltests/test_shuffle.R|  224 ++
 R/pkg/tests/fulltests/test_sparkR.R |   46 +
 R/pkg/tests/fulltests/test_sparkSQL.R   | 2887 ++
 R/pkg/tests/fulltests/test_take.R   |   69 +
 R/pkg/tests/fulltests/test_textFile.R   |  164 +
 R/pkg/tests/fulltests/test_utils.R  |  242 ++
 R/pkg/tests/run-all.R   |   22 +-
 R/pkg/vignettes/sparkr-vignettes.Rmd|   47 +-
 R/run-tests.sh  |2 +-
 appveyor.yml|3 +
 .../apache/spark/deploy/SparkSubmitSuite.scala  |6 +-
 49 files changed, 6653 insertions(+), 6539 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/.Rbuildignore
--
diff --git a/R/pkg/.Rbuildignore b/R/pkg/.Rbuildignore
index f12f8c2..280ab58 100644
--- a/R/pkg/.Rbuildignore
+++ b/R/pkg/.Rbuildignore
@@ -6,3 +6,5 @@
 ^README\.Rmd$
 ^src-native$
 ^html$
+^tests/fulltests/*
+

http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 2d461ca..899d410 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -2,7 +2,7 @@ Package: SparkR
 Type: Package
 Version: 2.1.2
 Title: R Frontend for Apache Spark
-Description: The SparkR package provides an R Frontend for Apache Spark.
+Description: Provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
 email = "shiva...@cs.berkeley.edu"),
  person("Xiangrui", "Meng", role = "aut",

http://git-wip-us.apache.org/repo

[5/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/inst/tests/testthat/test_take.R
--
diff --git a/R/pkg/inst/tests/testthat/test_take.R 
b/R/pkg/inst/tests/testthat/test_take.R
deleted file mode 100644
index aaa5328..000
--- a/R/pkg/inst/tests/testthat/test_take.R
+++ /dev/null
@@ -1,69 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-context("tests RDD function take()")
-
-# Mock data
-numVector <- c(-10:97)
-numList <- list(sqrt(1), sqrt(2), sqrt(3), 4 ** 10)
-strVector <- c("Dexter Morgan: I suppose I should be upset, even feel",
-   "violated, but I'm not. No, in fact, I think this is a 
friendly",
-   "message, like \"Hey, wanna play?\" and yes, I want to play. ",
-   "I really, really do.")
-strList <- list("Dexter Morgan: Blood. Sometimes it sets my teeth on edge, ",
-"other times it helps me control the chaos.",
-"Dexter Morgan: Harry and Dorris Morgan did a wonderful job ",
-"raising me. But they're both dead now. I didn't kill them. 
Honest.")
-
-# JavaSparkContext handle
-sparkSession <- sparkR.session(enableHiveSupport = FALSE)
-sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
-
-test_that("take() gives back the original elements in correct count and 
order", {
-  numVectorRDD <- parallelize(sc, numVector, 10)
-  # case: number of elements to take is less than the size of the first 
partition
-  expect_equal(takeRDD(numVectorRDD, 1), as.list(head(numVector, n = 1)))
-  # case: number of elements to take is the same as the size of the first 
partition
-  expect_equal(takeRDD(numVectorRDD, 11), as.list(head(numVector, n = 11)))
-  # case: number of elements to take is greater than all elements
-  expect_equal(takeRDD(numVectorRDD, length(numVector)), as.list(numVector))
-  expect_equal(takeRDD(numVectorRDD, length(numVector) + 1), 
as.list(numVector))
-
-  numListRDD <- parallelize(sc, numList, 1)
-  numListRDD2 <- parallelize(sc, numList, 4)
-  expect_equal(takeRDD(numListRDD, 3), takeRDD(numListRDD2, 3))
-  expect_equal(takeRDD(numListRDD, 5), takeRDD(numListRDD2, 5))
-  expect_equal(takeRDD(numListRDD, 1), as.list(head(numList, n = 1)))
-  expect_equal(takeRDD(numListRDD2, 999), numList)
-
-  strVectorRDD <- parallelize(sc, strVector, 2)
-  strVectorRDD2 <- parallelize(sc, strVector, 3)
-  expect_equal(takeRDD(strVectorRDD, 4), as.list(strVector))
-  expect_equal(takeRDD(strVectorRDD2, 2), as.list(head(strVector, n = 2)))
-
-  strListRDD <- parallelize(sc, strList, 4)
-  strListRDD2 <- parallelize(sc, strList, 1)
-  expect_equal(takeRDD(strListRDD, 3), as.list(head(strList, n = 3)))
-  expect_equal(takeRDD(strListRDD2, 1), as.list(head(strList, n = 1)))
-
-  expect_equal(length(takeRDD(strListRDD, 0)), 0)
-  expect_equal(length(takeRDD(strVectorRDD, 0)), 0)
-  expect_equal(length(takeRDD(numListRDD, 0)), 0)
-  expect_equal(length(takeRDD(numVectorRDD, 0)), 0)
-})
-
-sparkR.session.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/inst/tests/testthat/test_textFile.R
--
diff --git a/R/pkg/inst/tests/testthat/test_textFile.R 
b/R/pkg/inst/tests/testthat/test_textFile.R
deleted file mode 100644
index 3b46606..000
--- a/R/pkg/inst/tests/testthat/test_textFile.R
+++ /dev/null
@@ -1,164 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# l

[7/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/inst/tests/testthat/test_shuffle.R
--
diff --git a/R/pkg/inst/tests/testthat/test_shuffle.R 
b/R/pkg/inst/tests/testthat/test_shuffle.R
deleted file mode 100644
index d38efab..000
--- a/R/pkg/inst/tests/testthat/test_shuffle.R
+++ /dev/null
@@ -1,224 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-context("partitionBy, groupByKey, reduceByKey etc.")
-
-# JavaSparkContext handle
-sparkSession <- sparkR.session(enableHiveSupport = FALSE)
-sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
-
-# Data
-intPairs <- list(list(1L, -1), list(2L, 100), list(2L, 1), list(1L, 200))
-intRdd <- parallelize(sc, intPairs, 2L)
-
-doublePairs <- list(list(1.5, -1), list(2.5, 100), list(2.5, 1), list(1.5, 
200))
-doubleRdd <- parallelize(sc, doublePairs, 2L)
-
-numPairs <- list(list(1L, 100), list(2L, 200), list(4L, -1), list(3L, 1),
- list(3L, 0))
-numPairsRdd <- parallelize(sc, numPairs, length(numPairs))
-
-strList <- list("Dexter Morgan: Blood. Sometimes it sets my teeth on edge and 
",
-"Dexter Morgan: Harry and Dorris Morgan did a wonderful job ")
-strListRDD <- parallelize(sc, strList, 4)
-
-test_that("groupByKey for integers", {
-  grouped <- groupByKey(intRdd, 2L)
-
-  actual <- collectRDD(grouped)
-
-  expected <- list(list(2L, list(100, 1)), list(1L, list(-1, 200)))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("groupByKey for doubles", {
-  grouped <- groupByKey(doubleRdd, 2L)
-
-  actual <- collectRDD(grouped)
-
-  expected <- list(list(1.5, list(-1, 200)), list(2.5, list(100, 1)))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("reduceByKey for ints", {
-  reduced <- reduceByKey(intRdd, "+", 2L)
-
-  actual <- collectRDD(reduced)
-
-  expected <- list(list(2L, 101), list(1L, 199))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("reduceByKey for doubles", {
-  reduced <- reduceByKey(doubleRdd, "+", 2L)
-  actual <- collectRDD(reduced)
-
-  expected <- list(list(1.5, 199), list(2.5, 101))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("combineByKey for ints", {
-  reduced <- combineByKey(intRdd, function(x) { x }, "+", "+", 2L)
-
-  actual <- collectRDD(reduced)
-
-  expected <- list(list(2L, 101), list(1L, 199))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("combineByKey for doubles", {
-  reduced <- combineByKey(doubleRdd, function(x) { x }, "+", "+", 2L)
-  actual <- collectRDD(reduced)
-
-  expected <- list(list(1.5, 199), list(2.5, 101))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("combineByKey for characters", {
-  stringKeyRDD <- parallelize(sc,
-  list(list("max", 1L), list("min", 2L),
-   list("other", 3L), list("max", 4L)), 2L)
-  reduced <- combineByKey(stringKeyRDD,
-  function(x) { x }, "+", "+", 2L)
-  actual <- collectRDD(reduced)
-
-  expected <- list(list("max", 5L), list("min", 2L), list("other", 3L))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-})
-
-test_that("aggregateByKey", {
-  # test aggregateByKey for int keys
-  rdd <- parallelize(sc, list(list(1, 1), list(1, 2), list(2, 3), list(2, 4)))
-
-  zeroValue <- list(0, 0)
-  seqOp <- function(x, y) { list(x[[1]] + y, x[[2]] + 1) }
-  combOp <- function(x, y) { list(x[[1]] + y[[1]], x[[2]] + y[[2]]) }
-  aggregatedRDD <- aggregateByKey(rdd, zeroValue, seqOp, combOp, 2L)
-
-  actual <- collectRDD(aggregatedRDD)
-
-  expected <- list(list(1, list(3, 2)), list(2, list(7, 2)))
-  expect_equal(sortKeyValueList(actual), sortKeyValueList(expected))
-
-  # test aggregateByKey for string keys
-  rdd <- parallelize(sc, list(list("a", 1), list("a", 2), list("b", 3), 
list("b", 4)))
-
-  zeroValue <- list(0, 0)
-  seqOp <- function(x, y) { list(x[[1]] + y, x[[2]] + 1) }
-  combOp <- function(x, y) { list(x[[1]] + y[[1]], x[[2]] + y[[2]]) }
-  aggregate

[8/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/inst/tests/testthat/test_mllib.R
--
diff --git a/R/pkg/inst/tests/testthat/test_mllib.R 
b/R/pkg/inst/tests/testthat/test_mllib.R
deleted file mode 100644
index 8fe3a87..000
--- a/R/pkg/inst/tests/testthat/test_mllib.R
+++ /dev/null
@@ -1,1205 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-library(testthat)
-
-context("MLlib functions")
-
-# Tests for MLlib functions in SparkR
-sparkSession <- sparkR.session(enableHiveSupport = FALSE)
-
-absoluteSparkPath <- function(x) {
-  sparkHome <- sparkR.conf("spark.home")
-  file.path(sparkHome, x)
-}
-
-test_that("formula of spark.glm", {
-  training <- suppressWarnings(createDataFrame(iris))
-  # directly calling the spark API
-  # dot minus and intercept vs native glm
-  model <- spark.glm(training, Sepal_Width ~ . - Species + 0)
-  vals <- collect(select(predict(model, training), "prediction"))
-  rVals <- predict(glm(Sepal.Width ~ . - Species + 0, data = iris), iris)
-  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
-
-  # feature interaction vs native glm
-  model <- spark.glm(training, Sepal_Width ~ Species:Sepal_Length)
-  vals <- collect(select(predict(model, training), "prediction"))
-  rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), iris)
-  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
-
-  # glm should work with long formula
-  training <- suppressWarnings(createDataFrame(iris))
-  training$LongLongLongLongLongName <- training$Sepal_Width
-  training$VeryLongLongLongLonLongName <- training$Sepal_Length
-  training$AnotherLongLongLongLongName <- training$Species
-  model <- spark.glm(training, LongLongLongLongLongName ~ 
VeryLongLongLongLonLongName +
-AnotherLongLongLongLongName)
-  vals <- collect(select(predict(model, training), "prediction"))
-  rVals <- predict(glm(Sepal.Width ~ Sepal.Length + Species, data = iris), 
iris)
-  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
-})
-
-test_that("spark.glm and predict", {
-  training <- suppressWarnings(createDataFrame(iris))
-  # gaussian family
-  model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species)
-  prediction <- predict(model, training)
-  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), 
"double")
-  vals <- collect(select(prediction, "prediction"))
-  rVals <- predict(glm(Sepal.Width ~ Sepal.Length + Species, data = iris), 
iris)
-  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
-
-  # poisson family
-  model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
-  family = poisson(link = identity))
-  prediction <- predict(model, training)
-  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), 
"double")
-  vals <- collect(select(prediction, "prediction"))
-  rVals <- suppressWarnings(predict(glm(Sepal.Width ~ Sepal.Length + Species,
-  data = iris, family = poisson(link = identity)), iris))
-  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
-
-  # Gamma family
-  x <- runif(100, -1, 1)
-  y <- rgamma(100, rate = 10 / exp(0.5 + 1.2 * x), shape = 10)
-  df <- as.DataFrame(as.data.frame(list(x = x, y = y)))
-  model <- glm(y ~ x, family = Gamma, df)
-  out <- capture.output(print(summary(model)))
-  expect_true(any(grepl("Dispersion parameter for gamma family", out)))
-
-  # Test stats::predict is working
-  x <- rnorm(15)
-  y <- x + rnorm(15)
-  expect_equal(length(predict(lm(y ~ x))), 15)
-})
-
-test_that("spark.glm summary", {
-  # gaussian family
-  training <- suppressWarnings(createDataFrame(iris))
-  stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))
-
-  rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))
-
-  coefs <- unlist(stats$coefficients)
-  rCoefs <- unlist(rStats$coefficients)
-  expect_true(all(abs(rCoefs - coefs) < 1e-4))
-  expect_true(all(
-rownames(stats$coefficients) ==
-c("(Intercept)", "Sepal_Length", "Species_versicolor", 
"Species_virginica")))
-  expect_equal(stats$dispersion, rStats$dispersion)
-  expect_equal(stats$null.deviance, rStats$null.deviance)
-  expect_equal(stats$deviance, rSta

[4/9] spark git commit: [SPARKR][BACKPORT-2.1] backporting package and test changes

2017-09-10 Thread felixcheung
http://git-wip-us.apache.org/repos/asf/spark/blob/ae4e8ae4/R/pkg/tests/fulltests/test_mllib.R
--
diff --git a/R/pkg/tests/fulltests/test_mllib.R 
b/R/pkg/tests/fulltests/test_mllib.R
new file mode 100644
index 000..8fe3a87
--- /dev/null
+++ b/R/pkg/tests/fulltests/test_mllib.R
@@ -0,0 +1,1205 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+library(testthat)
+
+context("MLlib functions")
+
+# Tests for MLlib functions in SparkR
+sparkSession <- sparkR.session(enableHiveSupport = FALSE)
+
+absoluteSparkPath <- function(x) {
+  sparkHome <- sparkR.conf("spark.home")
+  file.path(sparkHome, x)
+}
+
+test_that("formula of spark.glm", {
+  training <- suppressWarnings(createDataFrame(iris))
+  # directly calling the spark API
+  # dot minus and intercept vs native glm
+  model <- spark.glm(training, Sepal_Width ~ . - Species + 0)
+  vals <- collect(select(predict(model, training), "prediction"))
+  rVals <- predict(glm(Sepal.Width ~ . - Species + 0, data = iris), iris)
+  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
+
+  # feature interaction vs native glm
+  model <- spark.glm(training, Sepal_Width ~ Species:Sepal_Length)
+  vals <- collect(select(predict(model, training), "prediction"))
+  rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), iris)
+  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
+
+  # glm should work with long formula
+  training <- suppressWarnings(createDataFrame(iris))
+  training$LongLongLongLongLongName <- training$Sepal_Width
+  training$VeryLongLongLongLonLongName <- training$Sepal_Length
+  training$AnotherLongLongLongLongName <- training$Species
+  model <- spark.glm(training, LongLongLongLongLongName ~ 
VeryLongLongLongLonLongName +
+AnotherLongLongLongLongName)
+  vals <- collect(select(predict(model, training), "prediction"))
+  rVals <- predict(glm(Sepal.Width ~ Sepal.Length + Species, data = iris), 
iris)
+  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
+})
+
+test_that("spark.glm and predict", {
+  training <- suppressWarnings(createDataFrame(iris))
+  # gaussian family
+  model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species)
+  prediction <- predict(model, training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), 
"double")
+  vals <- collect(select(prediction, "prediction"))
+  rVals <- predict(glm(Sepal.Width ~ Sepal.Length + Species, data = iris), 
iris)
+  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
+
+  # poisson family
+  model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+  family = poisson(link = identity))
+  prediction <- predict(model, training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), 
"double")
+  vals <- collect(select(prediction, "prediction"))
+  rVals <- suppressWarnings(predict(glm(Sepal.Width ~ Sepal.Length + Species,
+  data = iris, family = poisson(link = identity)), iris))
+  expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
+
+  # Gamma family
+  x <- runif(100, -1, 1)
+  y <- rgamma(100, rate = 10 / exp(0.5 + 1.2 * x), shape = 10)
+  df <- as.DataFrame(as.data.frame(list(x = x, y = y)))
+  model <- glm(y ~ x, family = Gamma, df)
+  out <- capture.output(print(summary(model)))
+  expect_true(any(grepl("Dispersion parameter for gamma family", out)))
+
+  # Test stats::predict is working
+  x <- rnorm(15)
+  y <- x + rnorm(15)
+  expect_equal(length(predict(lm(y ~ x))), 15)
+})
+
+test_that("spark.glm summary", {
+  # gaussian family
+  training <- suppressWarnings(createDataFrame(iris))
+  stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))
+
+  rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))
+
+  coefs <- unlist(stats$coefficients)
+  rCoefs <- unlist(rStats$coefficients)
+  expect_true(all(abs(rCoefs - coefs) < 1e-4))
+  expect_true(all(
+rownames(stats$coefficients) ==
+c("(Intercept)", "Sepal_Length", "Species_versicolor", 
"Species_virginica")))
+  expect_equal(stats$dispersion, rStats$dispersion)
+  expect_equal(stats$null.deviance, rStats$null.deviance)
+  expect_equal(stats$deviance, rStats$deviance)
+  expe

spark git commit: [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file

2017-09-10 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 520d92a19 -> 6273a711b


[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

## What changes were proposed in this pull request?
```
echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
```

```scala
import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|1|null   |
|2|null   |
|null |{"field": "3"} |
+-+---+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3
```
When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
This PR captures above situation and raise an exception with a reasonable 
workaround messag so that users can know what happened and how to fix the query.

## How was this patch tested?

Added test case.

Author: Jen-Ming Chung 

Closes #18865 from jmchung/SPARK-21610.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6273a711
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6273a711
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6273a711

Branch: refs/heads/master
Commit: 6273a711b69139ef0210f59759030a0b4a26b118
Parents: 520d92a
Author: Jen-Ming Chung 
Authored: Sun Sep 10 17:26:43 2017 -0700
Committer: gatorsmile 
Committed: Sun Sep 10 17:26:43 2017 -0700

--
 docs/sql-programming-guide.md   |  4 +++
 .../datasources/json/JsonFileFormat.scala   | 14 ++
 .../execution/datasources/json/JsonSuite.scala  | 29 
 3 files changed, 47 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6273a711/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 45ba4d1..0a8acbb 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1543,6 +1543,10 @@ options.
 
 # Migration Guide
 
+## Upgrading From Spark SQL 2.2 to 2.3
+
+  - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
the referenced columns only include the internal corrupt record column (named 
`_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.
+
 ## Upgrading From Spark SQL 2.1 to 2.2
 
   - Spark 2.1.1 introduced a new configuration key: 
`spark.sql.hive.caseSensitiveInferenceMode`. It had a default setting of 
`NEVER_INFER`, which kept behavior identical to 2.1.0. However, Spark 2.2.0 
changes this setting's default value to `INFER_AND_SAVE` to restore 
compatibility with reading Hive metastore tables whose underlying file schema 
have mixed-case column names. With the `INFER_AND_SAVE` configuration value, on 
first access Spark will perform schema inference on any Hive metastore table 
for which it has not already saved an inferred schema. Note that schema 
inference can be a very time consuming operation for tables with thousands of 
partitions. If compatibility with mixed-case column names is not a concern, you 
can safely set `spark.sql.hive.caseSensitiveInferenceMode` to `NEVER_INFER` to 
avoid the initial overhead of schema inference. Note that with the new default 
`INFER_AND_SAVE` setting, the results of the schema inference are saved as a 
metastore key for future use
 . Therefore, the initial schema inference occurs only at a table's first 
access.

http://git-wip-us.apache.org/repos/asf/spark/blob/6273a711/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
index 53d62d8..b5ed6e4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
@@ -113,6 +113,20

spark git commit: [BUILD][TEST][SPARKR] add sparksubmitsuite to appveyor tests

2017-09-10 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master 6273a711b -> 828fab035


[BUILD][TEST][SPARKR] add sparksubmitsuite to appveyor tests

## What changes were proposed in this pull request?

more file regex

## How was this patch tested?

Jenkins, AppVeyor

Author: Felix Cheung 

Closes #19177 from felixcheung/rmoduletotest.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/828fab03
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/828fab03
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/828fab03

Branch: refs/heads/master
Commit: 828fab03567ecc245a65c4d295a677ce0ba26c19
Parents: 6273a71
Author: Felix Cheung 
Authored: Mon Sep 11 09:32:25 2017 +0900
Committer: hyukjinkwon 
Committed: Mon Sep 11 09:32:25 2017 +0900

--
 appveyor.yml | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/828fab03/appveyor.yml
--
diff --git a/appveyor.yml b/appveyor.yml
index 43dad9b..dc2d81f 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -32,6 +32,7 @@ only_commits:
 - sql/core/src/main/scala/org/apache/spark/sql/api/r/
 - core/src/main/scala/org/apache/spark/api/r/
 - mllib/src/main/scala/org/apache/spark/ml/r/
+- core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
 
 cache:
   - C:\Users\appveyor\.m2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org