from:"gurwls223"

spark git commit: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 5a07aca4d -> 7fae7995b


[SPARK-22268][BUILD] Fix lint-java

## What changes were proposed in this pull request?

Fix java style issues

## How was this patch tested?

Run `./dev/lint-java` locally since it's not run on Jenkins

Author: Andrew Ash 

Closes #19486 from ash211/aash/fix-lint-java.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7fae7995
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7fae7995
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7fae7995

Branch: refs/heads/master
Commit: 7fae7995ba05e0333d1decb7ca74ddb7c1b448d7
Parents: 5a07aca
Author: Andrew Ash 
Authored: Fri Oct 20 09:40:00 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Oct 20 09:40:00 2017 +0900

--
 .../unsafe/sort/UnsafeInMemorySorter.java   |  9 +
 .../unsafe/sort/UnsafeExternalSorterSuite.java  | 21 +++-
 .../unsafe/sort/UnsafeInMemorySorterSuite.java  |  3 ++-
 .../v2/reader/SupportsPushDownFilters.java  |  1 -
 4 files changed, 19 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7fae7995/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
--
diff --git 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
index 869ec90..3bb87a6 100644
--- 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
+++ 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
@@ -172,10 +172,11 @@ public final class UnsafeInMemorySorter {
   public void reset() {
 if (consumer != null) {
   consumer.freeArray(array);
-  // the call to consumer.allocateArray may trigger a spill
-  // which in turn access this instance and eventually re-enter this 
method and try to free the array again.
-  // by setting the array to null and its length to 0 we effectively make 
the spill code-path a no-op.
-  // setting the array to null also indicates that it has already been 
de-allocated which prevents a double de-allocation in free().
+  // the call to consumer.allocateArray may trigger a spill which in turn 
access this instance
+  // and eventually re-enter this method and try to free the array again.  
by setting the array
+  // to null and its length to 0 we effectively make the spill code-path a 
no-op.  setting the
+  // array to null also indicates that it has already been de-allocated 
which prevents a double
+  // de-allocation in free().
   array = null;
   usableCapacity = 0;
   pos = 0;

http://git-wip-us.apache.org/repos/asf/spark/blob/7fae7995/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
--
diff --git 
a/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
 
b/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
index 6c5451d..d0d0334 100644
--- 
a/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
+++ 
b/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
@@ -516,12 +516,13 @@ public class UnsafeExternalSorterSuite {
 for (int i = 0; sorter.hasSpaceForAnotherRecord(); ++i) {
   insertNumber(sorter, i);
 }
-// we expect the next insert to attempt growing the pointerssArray
-// first allocation is expected to fail, then a spill is triggered which 
attempts another allocation
-// which also fails and we expect to see this OOM here.
-// the original code messed with a released array within the spill code
-// and ended up with a failed assertion.
-// we also expect the location of the OOM to be 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset
+// we expect the next insert to attempt growing the pointerssArray first
+// allocation is expected to fail, then a spill is triggered which
+// attempts another allocation which also fails and we expect to see this
+// OOM here.  the original code messed with a released array within the
+// spill code and ended up with a failed assertion.  we also expect the
+// location of the OOM to be
+// org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset
 memoryManager.markconsequentOOM(2);
 try {
   insertNumber(sorter, 1024);
@@ -530,9 +531,11

spark git commit: [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

2017-10-26 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 3073344a2 -> a83d8d5ad


[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

## What changes were proposed in this pull request?

This PR proposes to revive `stringsAsFactors` option in collect API, which was 
mistakenly removed in 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.

Simply, it casts `charactor` to `factor` if it meets the condition, 
`stringsAsFactors && is.character(vec)` in primitive type conversion.

## How was this patch tested?

Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon 

Closes #19551 from HyukjinKwon/SPARK-17902.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a83d8d5a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a83d8d5a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a83d8d5a

Branch: refs/heads/master
Commit: a83d8d5adcb4e0061e43105767242ba9770dda96
Parents: 3073344
Author: hyukjinkwon 
Authored: Thu Oct 26 20:54:36 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 26 20:54:36 2017 +0900

--
 R/pkg/R/DataFrame.R   | 3 +++
 R/pkg/tests/fulltests/test_sparkSQL.R | 6 ++
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a83d8d5a/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 176bb3b..aaa3349 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1191,6 +1191,9 @@ setMethod("collect",
 vec <- do.call(c, col)
 stopifnot(class(vec) != "list")
 class(vec) <- PRIMITIVE_TYPES[[colType]]
+if (is.character(vec) && stringsAsFactors) {
+  vec <- as.factor(vec)
+}
 df[[colIndex]] <- vec
   } else {
 df[[colIndex]] <- col

http://git-wip-us.apache.org/repos/asf/spark/blob/a83d8d5a/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 4382ef2..0c8118a 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", {
   expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
 })
 
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
+  df <- suppressWarnings(collect(createDataFrame(iris), stringsAsFactors = 
TRUE))
+  expect_equal(class(iris$Species), class(df$Species))
+  expect_equal(iris$Species, df$Species)
+})
+
 test_that("SPARK-17811: can create DataFrame containing NA as date and time", {
   df <- data.frame(
 id = 1:2,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

2017-10-26 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 d2dc175a1 -> 24fe7ccba


[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

## What changes were proposed in this pull request?

This PR proposes to revive `stringsAsFactors` option in collect API, which was 
mistakenly removed in 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.

Simply, it casts `charactor` to `factor` if it meets the condition, 
`stringsAsFactors && is.character(vec)` in primitive type conversion.

## How was this patch tested?

Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon 

Closes #19551 from HyukjinKwon/SPARK-17902.

(cherry picked from commit a83d8d5adcb4e0061e43105767242ba9770dda96)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/24fe7ccb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/24fe7ccb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/24fe7ccb

Branch: refs/heads/branch-2.2
Commit: 24fe7ccbacd913c19fa40199fd5511aaf55c6bfa
Parents: d2dc175
Author: hyukjinkwon 
Authored: Thu Oct 26 20:54:36 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 26 20:55:00 2017 +0900

--
 R/pkg/R/DataFrame.R   | 3 +++
 R/pkg/tests/fulltests/test_sparkSQL.R | 6 ++
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/24fe7ccb/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 3859fa8..c0a954d 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1174,6 +1174,9 @@ setMethod("collect",
 vec <- do.call(c, col)
 stopifnot(class(vec) != "list")
 class(vec) <- PRIMITIVE_TYPES[[colType]]
+if (is.character(vec) && stringsAsFactors) {
+  vec <- as.factor(vec)
+}
 df[[colIndex]] <- vec
   } else {
 df[[colIndex]] <- col

http://git-wip-us.apache.org/repos/asf/spark/blob/24fe7ccb/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 12d8fef..50c60fe 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -483,6 +483,12 @@ test_that("create DataFrame with different data types", {
   expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
 })
 
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
+  df <- suppressWarnings(collect(createDataFrame(iris), stringsAsFactors = 
TRUE))
+  expect_equal(class(iris$Species), class(df$Species))
+  expect_equal(iris$Species, df$Species)
+})
+
 test_that("SPARK-17811: can create DataFrame containing NA as date and time", {
   df <- data.frame(
 id = 1:2,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

2017-10-26 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 3e77b7481 -> aa023fddb


[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

## What changes were proposed in this pull request?

This PR proposes to revive `stringsAsFactors` option in collect API, which was 
mistakenly removed in 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.

Simply, it casts `charactor` to `factor` if it meets the condition, 
`stringsAsFactors && is.character(vec)` in primitive type conversion.

## How was this patch tested?

Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon 

Closes #19551 from HyukjinKwon/SPARK-17902.

(cherry picked from commit a83d8d5adcb4e0061e43105767242ba9770dda96)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa023fdd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa023fdd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa023fdd

Branch: refs/heads/branch-2.1
Commit: aa023fddb0abb6cf8ded94ac695ba7b0edb02022
Parents: 3e77b74
Author: hyukjinkwon 
Authored: Thu Oct 26 20:54:36 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Oct 26 20:55:14 2017 +0900

--
 R/pkg/R/DataFrame.R   | 3 +++
 R/pkg/tests/fulltests/test_sparkSQL.R | 6 ++
 2 files changed, 9 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aa023fdd/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index d0f0979..5899fa8 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1173,6 +1173,9 @@ setMethod("collect",
 vec <- do.call(c, col)
 stopifnot(class(vec) != "list")
 class(vec) <- PRIMITIVE_TYPES[[colType]]
+if (is.character(vec) && stringsAsFactors) {
+  vec <- as.factor(vec)
+}
 df[[colIndex]] <- vec
   } else {
 df[[colIndex]] <- col

http://git-wip-us.apache.org/repos/asf/spark/blob/aa023fdd/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index fedca67..0b88e47 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -417,6 +417,12 @@ test_that("create DataFrame with different data types", {
   expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
 })
 
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
+  df <- suppressWarnings(collect(createDataFrame(iris), stringsAsFactors = 
TRUE))
+  expect_equal(class(iris$Species), class(df$Species))
+  expect_equal(iris$Species, df$Species)
+})
+
 test_that("SPARK-17811: can create DataFrame containing NA as date and time", {
   df <- data.frame(
 id = 1:2,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24709][SQL] schema_of_json() - schema inference from an example

2018-07-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 5585c5765 -> 776f299fc


[SPARK-24709][SQL] schema_of_json() - schema inference from an example

## What changes were proposed in this pull request?

In the PR, I propose to add new function - *schema_of_json()* which infers 
schema of JSON string literal. The result of the function is a string 
containing a schema in DDL format.

One of the use cases is using of *schema_of_json()* in the combination with 
*from_json()*. Currently, _from_json()_ requires a schema as a mandatory 
argument. The *schema_of_json()* function will allow to point out an JSON 
string as an example which has the same schema as the first argument of 
_from_json()_. For instance:

```sql
select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}'))
from json_table;
```

## How was this patch tested?

Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to 
`json-functions.sql`

Author: Maxim Gekk 

Closes #21686 from MaxGekk/infer_schema_json.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/776f299f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/776f299f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/776f299f

Branch: refs/heads/master
Commit: 776f299fc8146b400e97185b1577b0fc8f06e14b
Parents: 5585c57
Author: Maxim Gekk 
Authored: Wed Jul 4 09:38:18 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 4 09:38:18 2018 +0800

--
 python/pyspark/sql/functions.py |  27 ++
 .../catalyst/analysis/FunctionRegistry.scala|   1 +
 .../catalyst/expressions/jsonExpressions.scala  |  52 ++-
 .../sql/catalyst/json/JsonInferSchema.scala | 348 ++
 .../expressions/JsonExpressionsSuite.scala  |   7 +
 .../datasources/json/JsonDataSource.scala   |   2 +-
 .../datasources/json/JsonInferSchema.scala  | 349 ---
 .../scala/org/apache/spark/sql/functions.scala  |  42 +++
 .../sql-tests/inputs/json-functions.sql |   4 +
 .../sql-tests/results/json-functions.sql.out|  20 +-
 .../apache/spark/sql/JsonFunctionsSuite.scala   |  17 +-
 .../execution/datasources/json/JsonSuite.scala  |   4 +-
 12 files changed, 509 insertions(+), 364 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/776f299f/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 9652d3e..4d37197 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2189,11 +2189,16 @@ def from_json(col, schema, options={}):
 >>> df = spark.createDataFrame(data, ("key", "value"))
 >>> df.select(from_json(df.value, schema).alias("json")).collect()
 [Row(json=[Row(a=1)])]
+>>> schema = schema_of_json(lit('''{"a": 0}'''))
+>>> df.select(from_json(df.value, schema).alias("json")).collect()
+[Row(json=Row(a=1))]
 """
 
 sc = SparkContext._active_spark_context
 if isinstance(schema, DataType):
 schema = schema.json()
+elif isinstance(schema, Column):
+schema = _to_java_column(schema)
 jc = sc._jvm.functions.from_json(_to_java_column(col), schema, options)
 return Column(jc)
 
@@ -2235,6 +2240,28 @@ def to_json(col, options={}):
 return Column(jc)
 
 
+@ignore_unicode_prefix
+@since(2.4)
+def schema_of_json(col):
+"""
+Parses a column containing a JSON string and infers its schema in DDL 
format.
+
+:param col: string column in json format
+
+>>> from pyspark.sql.types import *
+>>> data = [(1, '{"a": 1}')]
+>>> df = spark.createDataFrame(data, ("key", "value"))
+>>> df.select(schema_of_json(df.value).alias("json")).collect()
+[Row(json=u'struct')]
+>>> df.select(schema_of_json(lit('{"a": 0}')).alias("json")).collect()
+[Row(json=u'struct')]
+"""
+
+sc = SparkContext._active_spark_context
+jc = sc._jvm.functions.schema_of_json(_to_java_column(col))
+return Column(jc)
+
+
 @since(1.5)
 def size(col):
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/776f299f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index a574d8a..80a0af6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -505,6 +505,7 @@ object FunctionRegistry {
 // json

[4/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.3.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/5660fb9a/site/docs/2.3.1/api/python/pyspark.mllib.html
--
diff --git a/site/docs/2.3.1/api/python/pyspark.mllib.html 
b/site/docs/2.3.1/api/python/pyspark.mllib.html
index c449f16..662b562 100644
--- a/site/docs/2.3.1/api/python/pyspark.mllib.html
+++ b/site/docs/2.3.1/api/python/pyspark.mllib.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.mllib package  PySpark master documentation
+pyspark.mllib package  PySpark 2.3.1 documentation

[1/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.3.1

2018-07-03 Thread gurwls223

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 26b527127 -> 5660fb9a4


http://git-wip-us.apache.org/repos/asf/spark-website/blob/5660fb9a/site/docs/2.3.1/api/python/searchindex.js
--
diff --git a/site/docs/2.3.1/api/python/searchindex.js 
b/site/docs/2.3.1/api/python/searchindex.js
index 0a5ec65..b5c8344 100644
--- a/site/docs/2.3.1/api/python/searchindex.js
+++ b/site/docs/2.3.1/api/python/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["index","pyspark","pyspark.ml","pyspark.mllib","pyspark.sql","pyspark.streaming"],envversion:52,filenames:["index.rst","pyspark.rst","pyspark.ml.rst","pyspark.mllib.rst","pyspark.sql.rst","pyspark.streaming.rst"],objects:{"":{pyspark:[1,0,0,"-"]},"pyspark.Accumulator":{add:[1,2,1,""],value:[1,3,1,""]},"pyspark.AccumulatorParam":{addInPlace:[1,2,1,""],zero:[1,2,1,""]},"pyspark.BasicProfiler":{profile:[1,2,1,""],stats:[1,2,1,""]},"pyspark.Broadcast":{destroy:[1,2,1,""],dump:[1,2,1,""],load:[1,2,1,""],unpersist:[1,2,1,""],value:[1,3,1,""]},"pyspark.MarshalSerializer":{dumps:[1,2,1,""],loads:[1,2,1,""]},"pyspark.PickleSerializer":{dumps:[1,2,1,""],loads:[1,2,1,""]},"pyspark.Profiler":{dump:[1,2,1,""],profile:[1,2,1,""],show:[1,2,1,""],stats:[1,2,1,""]},"pyspark.RDD":{aggregate:[1,2,1,""],aggregateByKey:[1,2,1,""],cache:[1,2,1,""],cartesian:[1,2,1,""],checkpoint:[1,2,1,""],coalesce:[1,2,1,""],cogroup:[1,2,1,""],collect:[1,2,1,""],collectAsMap:[1,2,1,""],combine
 
ByKey:[1,2,1,""],context:[1,3,1,""],count:[1,2,1,""],countApprox:[1,2,1,""],countApproxDistinct:[1,2,1,""],countByKey:[1,2,1,""],countByValue:[1,2,1,""],distinct:[1,2,1,""],filter:[1,2,1,""],first:[1,2,1,""],flatMap:[1,2,1,""],flatMapValues:[1,2,1,""],fold:[1,2,1,""],foldByKey:[1,2,1,""],foreach:[1,2,1,""],foreachPartition:[1,2,1,""],fullOuterJoin:[1,2,1,""],getCheckpointFile:[1,2,1,""],getNumPartitions:[1,2,1,""],getStorageLevel:[1,2,1,""],glom:[1,2,1,""],groupBy:[1,2,1,""],groupByKey:[1,2,1,""],groupWith:[1,2,1,""],histogram:[1,2,1,""],id:[1,2,1,""],intersection:[1,2,1,""],isCheckpointed:[1,2,1,""],isEmpty:[1,2,1,""],isLocallyCheckpointed:[1,2,1,""],join:[1,2,1,""],keyBy:[1,2,1,""],keys:[1,2,1,""],leftOuterJoin:[1,2,1,""],localCheckpoint:[1,2,1,""],lookup:[1,2,1,""],map:[1,2,1,""],mapPartitions:[1,2,1,""],mapPartitionsWithIndex:[1,2,1,""],mapPartitionsWithSplit:[1,2,1,""],mapValues:[1,2,1,""],max:[1,2,1,""],mean:[1,2,1,""],meanApprox:[1,2,1,""],min:[1,2,1,""],name:[1,2,1,""],parti
 
tionBy:[1,2,1,""],persist:[1,2,1,""],pipe:[1,2,1,""],randomSplit:[1,2,1,""],reduce:[1,2,1,""],reduceByKey:[1,2,1,""],reduceByKeyLocally:[1,2,1,""],repartition:[1,2,1,""],repartitionAndSortWithinPartitions:[1,2,1,""],rightOuterJoin:[1,2,1,""],sample:[1,2,1,""],sampleByKey:[1,2,1,""],sampleStdev:[1,2,1,""],sampleVariance:[1,2,1,""],saveAsHadoopDataset:[1,2,1,""],saveAsHadoopFile:[1,2,1,""],saveAsNewAPIHadoopDataset:[1,2,1,""],saveAsNewAPIHadoopFile:[1,2,1,""],saveAsPickleFile:[1,2,1,""],saveAsSequenceFile:[1,2,1,""],saveAsTextFile:[1,2,1,""],setName:[1,2,1,""],sortBy:[1,2,1,""],sortByKey:[1,2,1,""],stats:[1,2,1,""],stdev:[1,2,1,""],subtract:[1,2,1,""],subtractByKey:[1,2,1,""],sum:[1,2,1,""],sumApprox:[1,2,1,""],take:[1,2,1,""],takeOrdered:[1,2,1,""],takeSample:[1,2,1,""],toDebugString:[1,2,1,""],toLocalIterator:[1,2,1,""],top:[1,2,1,""],treeAggregate:[1,2,1,""],treeReduce:[1,2,1,""],union:[1,2,1,""],unpersist:[1,2,1,""],values:[1,2,1,""],variance:[1,2,1,""],zip:[1,2,1,""],zipWithIndex
 
:[1,2,1,""],zipWithUniqueId:[1,2,1,""]},"pyspark.SparkConf":{contains:[1,2,1,""],get:[1,2,1,""],getAll:[1,2,1,""],set:[1,2,1,""],setAll:[1,2,1,""],setAppName:[1,2,1,""],setExecutorEnv:[1,2,1,""],setIfMissing:[1,2,1,""],setMaster:[1,2,1,""],setSparkHome:[1,2,1,""],toDebugString:[1,2,1,""]},"pyspark.SparkContext":{PACKAGE_EXTENSIONS:[1,3,1,""],accumulator:[1,2,1,""],addFile:[1,2,1,""],addPyFile:[1,2,1,""],applicationId:[1,3,1,""],binaryFiles:[1,2,1,""],binaryRecords:[1,2,1,""],broadcast:[1,2,1,""],cancelAllJobs:[1,2,1,""],cancelJobGroup:[1,2,1,""],defaultMinPartitions:[1,3,1,""],defaultParallelism:[1,3,1,""],dump_profiles:[1,2,1,""],emptyRDD:[1,2,1,""],getConf:[1,2,1,""],getLocalProperty:[1,2,1,""],getOrCreate:[1,4,1,""],hadoopFile:[1,2,1,""],hadoopRDD:[1,2,1,""],newAPIHadoopFile:[1,2,1,""],newAPIHadoopRDD:[1,2,1,""],parallelize:[1,2,1,""],pickleFile:[1,2,1,""],range:[1,2,1,""],runJob:[1,2,1,""],sequenceFile:[1,2,1,""],setCheckpointDir:[1,2,1,""],setJobDescription:[1,2,1,""],setJobGro

[3/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.3.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/5660fb9a/site/docs/2.3.1/api/python/pyspark.sql.html
--
diff --git a/site/docs/2.3.1/api/python/pyspark.sql.html 
b/site/docs/2.3.1/api/python/pyspark.sql.html
index 43c51be..6716867 100644
--- a/site/docs/2.3.1/api/python/pyspark.sql.html
+++ b/site/docs/2.3.1/api/python/pyspark.sql.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.sql module  PySpark master documentation
+pyspark.sql module  PySpark 2.3.1 documentation

[2/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.3.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/5660fb9a/site/docs/2.3.1/api/python/pyspark.streaming.html
--
diff --git a/site/docs/2.3.1/api/python/pyspark.streaming.html 
b/site/docs/2.3.1/api/python/pyspark.streaming.html
index 7f1dee5..411799a 100644
--- a/site/docs/2.3.1/api/python/pyspark.streaming.html
+++ b/site/docs/2.3.1/api/python/pyspark.streaming.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.streaming module  PySpark master 
documentation
+pyspark.streaming module  PySpark 2.3.1 documentation

[6/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.3.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/5660fb9a/site/docs/2.3.1/api/python/_modules/pyspark/profiler.html
--
diff --git a/site/docs/2.3.1/api/python/_modules/pyspark/profiler.html 
b/site/docs/2.3.1/api/python/_modules/pyspark/profiler.html
index b7ac6ff..84aa845 100644
--- a/site/docs/2.3.1/api/python/_modules/pyspark/profiler.html
+++ b/site/docs/2.3.1/api/python/_modules/pyspark/profiler.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.profiler  PySpark master documentation
+pyspark.profiler  PySpark 2.3.1 documentation

[5/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.3.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/5660fb9a/site/docs/2.3.1/api/python/pyspark.ml.html
--
diff --git a/site/docs/2.3.1/api/python/pyspark.ml.html 
b/site/docs/2.3.1/api/python/pyspark.ml.html
index 4ada723..986c949 100644
--- a/site/docs/2.3.1/api/python/pyspark.ml.html
+++ b/site/docs/2.3.1/api/python/pyspark.ml.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.ml package  PySpark master documentation
+pyspark.ml package  PySpark 2.3.1 documentation

[7/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

Fix signature description broken in PySpark API documentation in 2.2.1


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/26b52712
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/26b52712
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/26b52712

Branch: refs/heads/asf-site
Commit: 26b5271279a72e7d78948abf96f69ea3a99db209
Parents: 8857572
Author: hyukjinkwon 
Authored: Tue Jul 3 01:53:07 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 4 12:40:02 2018 +0800

--
 site/docs/2.2.1/api/python/_modules/index.html  |   8 +-
 .../python/_modules/pyspark/accumulators.html   |   8 +-
 .../api/python/_modules/pyspark/broadcast.html  |   8 +-
 .../2.2.1/api/python/_modules/pyspark/conf.html |   8 +-
 .../api/python/_modules/pyspark/context.html|   8 +-
 .../api/python/_modules/pyspark/files.html  |   8 +-
 .../api/python/_modules/pyspark/ml/base.html|   8 +-
 .../_modules/pyspark/ml/classification.html |   8 +-
 .../python/_modules/pyspark/ml/clustering.html  |   8 +-
 .../python/_modules/pyspark/ml/evaluation.html  |   8 +-
 .../api/python/_modules/pyspark/ml/feature.html |   8 +-
 .../api/python/_modules/pyspark/ml/fpm.html |   8 +-
 .../api/python/_modules/pyspark/ml/linalg.html  |   8 +-
 .../api/python/_modules/pyspark/ml/param.html   |   8 +-
 .../_modules/pyspark/ml/param/shared.html   |   8 +-
 .../python/_modules/pyspark/ml/pipeline.html|   8 +-
 .../_modules/pyspark/ml/recommendation.html |   8 +-
 .../python/_modules/pyspark/ml/regression.html  |   8 +-
 .../api/python/_modules/pyspark/ml/stat.html|   8 +-
 .../api/python/_modules/pyspark/ml/tuning.html  |   8 +-
 .../api/python/_modules/pyspark/ml/util.html|   8 +-
 .../api/python/_modules/pyspark/ml/wrapper.html |   8 +-
 .../_modules/pyspark/mllib/classification.html  |   8 +-
 .../_modules/pyspark/mllib/clustering.html  |   8 +-
 .../python/_modules/pyspark/mllib/common.html   |   8 +-
 .../_modules/pyspark/mllib/evaluation.html  |   8 +-
 .../python/_modules/pyspark/mllib/feature.html  |   8 +-
 .../api/python/_modules/pyspark/mllib/fpm.html  |   8 +-
 .../python/_modules/pyspark/mllib/linalg.html   |   8 +-
 .../pyspark/mllib/linalg/distributed.html   |   8 +-
 .../python/_modules/pyspark/mllib/random.html   |   8 +-
 .../_modules/pyspark/mllib/recommendation.html  |   8 +-
 .../_modules/pyspark/mllib/regression.html  |   8 +-
 .../pyspark/mllib/stat/KernelDensity.html   |   8 +-
 .../pyspark/mllib/stat/distribution.html|   8 +-
 .../_modules/pyspark/mllib/stat/test.html   |   8 +-
 .../api/python/_modules/pyspark/mllib/tree.html |   8 +-
 .../api/python/_modules/pyspark/mllib/util.html |   8 +-
 .../api/python/_modules/pyspark/profiler.html   |   8 +-
 .../2.2.1/api/python/_modules/pyspark/rdd.html  |   8 +-
 .../python/_modules/pyspark/serializers.html|   8 +-
 .../api/python/_modules/pyspark/sql/column.html |   8 +-
 .../python/_modules/pyspark/sql/context.html|   8 +-
 .../python/_modules/pyspark/sql/dataframe.html  |   8 +-
 .../python/_modules/pyspark/sql/functions.html  |   8 +-
 .../api/python/_modules/pyspark/sql/group.html  |   8 +-
 .../python/_modules/pyspark/sql/readwriter.html |   8 +-
 .../python/_modules/pyspark/sql/session.html|   8 +-
 .../python/_modules/pyspark/sql/streaming.html  |   8 +-
 .../api/python/_modules/pyspark/sql/types.html  |   8 +-
 .../api/python/_modules/pyspark/sql/window.html |   8 +-
 .../api/python/_modules/pyspark/status.html |   8 +-
 .../python/_modules/pyspark/storagelevel.html   |   8 +-
 .../_modules/pyspark/streaming/context.html |   8 +-
 .../_modules/pyspark/streaming/dstream.html |   8 +-
 .../_modules/pyspark/streaming/flume.html   |   8 +-
 .../_modules/pyspark/streaming/kafka.html   |   8 +-
 .../_modules/pyspark/streaming/kinesis.html |   8 +-
 .../_modules/pyspark/streaming/listener.html|   8 +-
 .../python/_modules/pyspark/taskcontext.html|   8 +-
 site/docs/2.2.1/api/python/index.html   |   8 +-
 site/docs/2.2.1/api/python/pyspark.html |  30 +-
 site/docs/2.2.1/api/python/pyspark.ml.html  | 164 +--
 site/docs/2.2.1/api/python/pyspark.mllib.html   |  36 +--
 site/docs/2.2.1/api/python/pyspark.sql.html | 272 +--
 .../2.2.1/api/python/pyspark.streaming.html |  11 +-
 site/docs/2.2.1/api/python/search.html  |   8 +-
 site/docs/2.2.1/api/python/searchindex.js   |   2 +-
 68 files changed, 506 insertions(+), 505 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/_modules/index.html
--
diff --git

[5/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/pyspark.ml.html
--
diff --git a/site/docs/2.2.1/api/python/pyspark.ml.html 
b/site/docs/2.2.1/api/python/pyspark.ml.html
index 1398703..a5757cd 100644
--- a/site/docs/2.2.1/api/python/pyspark.ml.html
+++ b/site/docs/2.2.1/api/python/pyspark.ml.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.ml package  PySpark  documentation
+pyspark.ml package  PySpark 2.2.1 documentation

[6/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/_modules/pyspark/rdd.html
--
diff --git a/site/docs/2.2.1/api/python/_modules/pyspark/rdd.html 
b/site/docs/2.2.1/api/python/_modules/pyspark/rdd.html
index ee22d01..17adf92 100644
--- a/site/docs/2.2.1/api/python/_modules/pyspark/rdd.html
+++ b/site/docs/2.2.1/api/python/_modules/pyspark/rdd.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.rdd  PySpark  documentation
+pyspark.rdd  PySpark 2.2.1 documentation

[2/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/pyspark.streaming.html
--
diff --git a/site/docs/2.2.1/api/python/pyspark.streaming.html 
b/site/docs/2.2.1/api/python/pyspark.streaming.html
index 6254899..f5543b5 100644
--- a/site/docs/2.2.1/api/python/pyspark.streaming.html
+++ b/site/docs/2.2.1/api/python/pyspark.streaming.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.streaming module  PySpark  documentation
+pyspark.streaming module  PySpark 2.2.1 documentation

[4/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/pyspark.mllib.html
--
diff --git a/site/docs/2.2.1/api/python/pyspark.mllib.html 
b/site/docs/2.2.1/api/python/pyspark.mllib.html
index cd27d38..baf0804 100644
--- a/site/docs/2.2.1/api/python/pyspark.mllib.html
+++ b/site/docs/2.2.1/api/python/pyspark.mllib.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.mllib package  PySpark  documentation
+pyspark.mllib package  PySpark 2.2.1 documentation

spark git commit: [SPARK-23698] Remove raw_input() from Python 2

2018-07-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 776f299fc -> b42fda8ab


[SPARK-23698] Remove raw_input() from Python 2

Signed-off-by: cclauss 

## What changes were proposed in this pull request?

Humans will be able to enter text in Python 3 prompts which they can not do 
today.
The Python builtin __raw_input()__ was removed in Python 3 in favor of 
__input()__.  This PR does the same thing in Python 2.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
flake8 testing

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: cclauss 

Closes #21702 from cclauss/python-fix-raw_input.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b42fda8a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b42fda8a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b42fda8a

Branch: refs/heads/master
Commit: b42fda8ab3b5f82b33b96fce3f584c50f2ed5a3a
Parents: 776f299
Author: cclauss 
Authored: Wed Jul 4 09:40:58 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 4 09:40:58 2018 +0800

--
 dev/create-release/releaseutils.py |  5 -
 dev/merge_spark_pr.py  | 21 -
 2 files changed, 16 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b42fda8a/dev/create-release/releaseutils.py
--
diff --git a/dev/create-release/releaseutils.py 
b/dev/create-release/releaseutils.py
index 32f6cbb..ab812e1 100755
--- a/dev/create-release/releaseutils.py
+++ b/dev/create-release/releaseutils.py
@@ -49,13 +49,16 @@ except ImportError:
 print("Install using 'sudo pip install unidecode'")
 sys.exit(-1)
 
+if sys.version < '3':
+input = raw_input
+
 # Contributors list file name
 contributors_file_name = "contributors.txt"
 
 
 # Prompt the user to answer yes or no until they do so
 def yesOrNoPrompt(msg):
-response = raw_input("%s [y/n]: " % msg)
+response = input("%s [y/n]: " % msg)
 while response != "y" and response != "n":
 return yesOrNoPrompt(msg)
 return response == "y"

http://git-wip-us.apache.org/repos/asf/spark/blob/b42fda8a/dev/merge_spark_pr.py
--
diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index 7f46a1c..79c7c02 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -39,6 +39,9 @@ try:
 except ImportError:
 JIRA_IMPORTED = False
 
+if sys.version < '3':
+input = raw_input
+
 # Location of your Spark git development area
 SPARK_HOME = os.environ.get("SPARK_HOME", os.getcwd())
 # Remote name which points to the Gihub site
@@ -95,7 +98,7 @@ def run_cmd(cmd):
 
 
 def continue_maybe(prompt):
-result = raw_input("\n%s (y/n): " % prompt)
+result = input("\n%s (y/n): " % prompt)
 if result.lower() != "y":
 fail("Okay, exiting")
 
@@ -134,7 +137,7 @@ def merge_pr(pr_num, target_ref, title, body, pr_repo_desc):
  '--pretty=format:%an <%ae>']).split("\n")
 distinct_authors = sorted(set(commit_authors),
   key=lambda x: commit_authors.count(x), 
reverse=True)
-primary_author = raw_input(
+primary_author = input(
 "Enter primary author in the format of \"name \" [%s]: " %
 distinct_authors[0])
 if primary_author == "":
@@ -184,7 +187,7 @@ def merge_pr(pr_num, target_ref, title, body, pr_repo_desc):
 
 
 def cherry_pick(pr_num, merge_hash, default_branch):
-pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
+pick_ref = input("Enter a branch name [%s]: " % default_branch)
 if pick_ref == "":
 pick_ref = default_branch
 
@@ -231,7 +234,7 @@ def resolve_jira_issue(merge_branches, comment, 
default_jira_id=""):
 asf_jira = jira.client.JIRA({'server': JIRA_API_BASE},
 basic_auth=(JIRA_USERNAME, JIRA_PASSWORD))
 
-jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
+jira_id = input("Enter a JIRA id [%s]: " % default_jira_id)
 if jira_id == "":
 jira_id = default_jira_id
 
@@ -276,7 +279,7 @@ def resolve_jira_issue(merge_branches, comment, 
default_jira_id=""):
 default_fix_versions = filter(lambda x: x != v, 
default_fix_versions)
 default_fix_versions = ",".join(default_fix_versions)
 
-fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
default_fix_versions)
+fix_versions = input("Enter comma-separated fix version(s) [%s]: " % 
default_fix_versions)
 if fix_versions == "":

spark git commit: [BUILD] Close stale PRs

2018-07-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master b42fda8ab -> 5bf95f2a3


[BUILD] Close stale PRs

Closes #20932
Closes #17843
Closes #13477
Closes #14291
Closes #20919
Closes #17907
Closes #18766
Closes #20809
Closes #8849
Closes #21076
Closes #21507
Closes #21336
Closes #21681
Closes #21691

Author: Sean Owen 

Closes #21708 from srowen/CloseStalePRs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5bf95f2a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5bf95f2a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5bf95f2a

Branch: refs/heads/master
Commit: 5bf95f2a37e624eb6fb0ef6fbd2a40a129d5a470
Parents: b42fda8
Author: Sean Owen 
Authored: Wed Jul 4 09:53:04 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 4 09:53:04 2018 +0800

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[3/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/pyspark.sql.html
--
diff --git a/site/docs/2.2.1/api/python/pyspark.sql.html 
b/site/docs/2.2.1/api/python/pyspark.sql.html
index 8b349cc..2174c25 100644
--- a/site/docs/2.2.1/api/python/pyspark.sql.html
+++ b/site/docs/2.2.1/api/python/pyspark.sql.html
@@ -5,14 +5,14 @@
 http://www.w3.org/1999/xhtml;>
   
 
-pyspark.sql module  PySpark  documentation
+pyspark.sql module  PySpark 2.2.1 documentation

[1/7] spark-website git commit: Fix signature description broken in PySpark API documentation in 2.2.1

2018-07-03 Thread gurwls223

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 8857572df -> 26b527127


http://git-wip-us.apache.org/repos/asf/spark-website/blob/26b52712/site/docs/2.2.1/api/python/searchindex.js
--
diff --git a/site/docs/2.2.1/api/python/searchindex.js 
b/site/docs/2.2.1/api/python/searchindex.js
index b40aeb8..345db45 100644
--- a/site/docs/2.2.1/api/python/searchindex.js
+++ b/site/docs/2.2.1/api/python/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["index","pyspark","pyspark.ml","pyspark.mllib","pyspark.sql","pyspark.streaming"],envversion:52,filenames:["index.rst","pyspark.rst","pyspark.ml.rst","pyspark.mllib.rst","pyspark.sql.rst","pyspark.streaming.rst"],objects:{"":{pyspark:[1,0,0,"-"]},"pyspark.Accumulator":{add:[1,2,1,""],value:[1,3,1,""]},"pyspark.AccumulatorParam":{addInPlace:[1,2,1,""],zero:[1,2,1,""]},"pyspark.BasicProfiler":{profile:[1,2,1,""],stats:[1,2,1,""]},"pyspark.Broadcast":{destroy:[1,2,1,""],dump:[1,2,1,""],load:[1,2,1,""],unpersist:[1,2,1,""],value:[1,3,1,""]},"pyspark.MarshalSerializer":{dumps:[1,2,1,""],loads:[1,2,1,""]},"pyspark.PickleSerializer":{dumps:[1,2,1,""],loads:[1,2,1,""]},"pyspark.Profiler":{dump:[1,2,1,""],profile:[1,2,1,""],show:[1,2,1,""],stats:[1,2,1,""]},"pyspark.RDD":{aggregate:[1,2,1,""],aggregateByKey:[1,2,1,""],cache:[1,2,1,""],cartesian:[1,2,1,""],checkpoint:[1,2,1,""],coalesce:[1,2,1,""],cogroup:[1,2,1,""],collect:[1,2,1,""],collectAsMap:[1,2,1,""],combine
 
ByKey:[1,2,1,""],context:[1,3,1,""],count:[1,2,1,""],countApprox:[1,2,1,""],countApproxDistinct:[1,2,1,""],countByKey:[1,2,1,""],countByValue:[1,2,1,""],distinct:[1,2,1,""],filter:[1,2,1,""],first:[1,2,1,""],flatMap:[1,2,1,""],flatMapValues:[1,2,1,""],fold:[1,2,1,""],foldByKey:[1,2,1,""],foreach:[1,2,1,""],foreachPartition:[1,2,1,""],fullOuterJoin:[1,2,1,""],getCheckpointFile:[1,2,1,""],getNumPartitions:[1,2,1,""],getStorageLevel:[1,2,1,""],glom:[1,2,1,""],groupBy:[1,2,1,""],groupByKey:[1,2,1,""],groupWith:[1,2,1,""],histogram:[1,2,1,""],id:[1,2,1,""],intersection:[1,2,1,""],isCheckpointed:[1,2,1,""],isEmpty:[1,2,1,""],isLocallyCheckpointed:[1,2,1,""],join:[1,2,1,""],keyBy:[1,2,1,""],keys:[1,2,1,""],leftOuterJoin:[1,2,1,""],localCheckpoint:[1,2,1,""],lookup:[1,2,1,""],map:[1,2,1,""],mapPartitions:[1,2,1,""],mapPartitionsWithIndex:[1,2,1,""],mapPartitionsWithSplit:[1,2,1,""],mapValues:[1,2,1,""],max:[1,2,1,""],mean:[1,2,1,""],meanApprox:[1,2,1,""],min:[1,2,1,""],name:[1,2,1,""],parti
 
tionBy:[1,2,1,""],persist:[1,2,1,""],pipe:[1,2,1,""],randomSplit:[1,2,1,""],reduce:[1,2,1,""],reduceByKey:[1,2,1,""],reduceByKeyLocally:[1,2,1,""],repartition:[1,2,1,""],repartitionAndSortWithinPartitions:[1,2,1,""],rightOuterJoin:[1,2,1,""],sample:[1,2,1,""],sampleByKey:[1,2,1,""],sampleStdev:[1,2,1,""],sampleVariance:[1,2,1,""],saveAsHadoopDataset:[1,2,1,""],saveAsHadoopFile:[1,2,1,""],saveAsNewAPIHadoopDataset:[1,2,1,""],saveAsNewAPIHadoopFile:[1,2,1,""],saveAsPickleFile:[1,2,1,""],saveAsSequenceFile:[1,2,1,""],saveAsTextFile:[1,2,1,""],setName:[1,2,1,""],sortBy:[1,2,1,""],sortByKey:[1,2,1,""],stats:[1,2,1,""],stdev:[1,2,1,""],subtract:[1,2,1,""],subtractByKey:[1,2,1,""],sum:[1,2,1,""],sumApprox:[1,2,1,""],take:[1,2,1,""],takeOrdered:[1,2,1,""],takeSample:[1,2,1,""],toDebugString:[1,2,1,""],toLocalIterator:[1,2,1,""],top:[1,2,1,""],treeAggregate:[1,2,1,""],treeReduce:[1,2,1,""],union:[1,2,1,""],unpersist:[1,2,1,""],values:[1,2,1,""],variance:[1,2,1,""],zip:[1,2,1,""],zipWithIndex
 
:[1,2,1,""],zipWithUniqueId:[1,2,1,""]},"pyspark.SparkConf":{contains:[1,2,1,""],get:[1,2,1,""],getAll:[1,2,1,""],set:[1,2,1,""],setAll:[1,2,1,""],setAppName:[1,2,1,""],setExecutorEnv:[1,2,1,""],setIfMissing:[1,2,1,""],setMaster:[1,2,1,""],setSparkHome:[1,2,1,""],toDebugString:[1,2,1,""]},"pyspark.SparkContext":{PACKAGE_EXTENSIONS:[1,3,1,""],accumulator:[1,2,1,""],addFile:[1,2,1,""],addPyFile:[1,2,1,""],applicationId:[1,3,1,""],binaryFiles:[1,2,1,""],binaryRecords:[1,2,1,""],broadcast:[1,2,1,""],cancelAllJobs:[1,2,1,""],cancelJobGroup:[1,2,1,""],defaultMinPartitions:[1,3,1,""],defaultParallelism:[1,3,1,""],dump_profiles:[1,2,1,""],emptyRDD:[1,2,1,""],getConf:[1,2,1,""],getLocalProperty:[1,2,1,""],getOrCreate:[1,4,1,""],hadoopFile:[1,2,1,""],hadoopRDD:[1,2,1,""],newAPIHadoopFile:[1,2,1,""],newAPIHadoopRDD:[1,2,1,""],parallelize:[1,2,1,""],pickleFile:[1,2,1,""],range:[1,2,1,""],runJob:[1,2,1,""],sequenceFile:[1,2,1,""],setCheckpointDir:[1,2,1,""],setJobGroup:[1,2,1,""],setLocalPropert

spark git commit: [SPARK-24732][SQL] Type coercion between MapTypes.

2018-07-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 5bf95f2a3 -> 7c08eb6d6


[SPARK-24732][SQL] Type coercion between MapTypes.

## What changes were proposed in this pull request?

Currently we don't allow type coercion between maps.
We can support type coercion between MapTypes where both the key types and the 
value types are compatible.

## How was this patch tested?

Added tests.

Author: Takuya UESHIN 

Closes #21703 from ueshin/issues/SPARK-24732/maptypecoercion.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7c08eb6d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7c08eb6d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7c08eb6d

Branch: refs/heads/master
Commit: 7c08eb6d61d55ce45229f3302e6d463e7669183d
Parents: 5bf95f2
Author: Takuya UESHIN 
Authored: Wed Jul 4 12:21:26 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Jul 4 12:21:26 2018 +0800

--
 .../sql/catalyst/analysis/TypeCoercion.scala| 12 ++
 .../catalyst/analysis/TypeCoercionSuite.scala   | 45 +++-
 2 files changed, 56 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7c08eb6d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index 3ebab43..cf90e6e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -179,6 +179,12 @@ object TypeCoercion {
   .orElse((t1, t2) match {
 case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) =>
   findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || 
containsNull2))
+case (MapType(kt1, vt1, valueContainsNull1), MapType(kt2, vt2, 
valueContainsNull2)) =>
+  findWiderTypeForTwo(kt1, kt2).flatMap { kt =>
+findWiderTypeForTwo(vt1, vt2).map { vt =>
+  MapType(kt, vt, valueContainsNull1 || valueContainsNull2)
+}
+  }
 case _ => None
   })
   }
@@ -220,6 +226,12 @@ object TypeCoercion {
 case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) =>
   findWiderTypeWithoutStringPromotionForTwo(et1, et2)
 .map(ArrayType(_, containsNull1 || containsNull2))
+case (MapType(kt1, vt1, valueContainsNull1), MapType(kt2, vt2, 
valueContainsNull2)) =>
+  findWiderTypeWithoutStringPromotionForTwo(kt1, kt2).flatMap { kt =>
+findWiderTypeWithoutStringPromotionForTwo(vt1, vt2).map { vt =>
+  MapType(kt, vt, valueContainsNull1 || valueContainsNull2)
+}
+  }
 case _ => None
   })
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/7c08eb6d/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
index 0acd3b4..4e5ca1b 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
@@ -54,8 +54,9 @@ class TypeCoercionSuite extends AnalysisTest {
   // | NullType | ByteType | ShortType | IntegerType | LongType | 
DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | 
DateType | TimestampType | ArrayType  | MapType  | StructType  | NullType | 
CalendarIntervalType | DecimalType(38, 18) | DoubleType  | IntegerType  |
   // | CalendarIntervalType | X| X | X   | X| 
X  | X | X  | X  | X   | X  | X 
   | X | X  | X| X   | X| 
CalendarIntervalType | X   | X   | X|
   // 
+--+--+---+-+--++---+++-++--+---++--+-+--+--+-+-+--+
-  // Note: MapType*, StructType* are castable only when the internal child 
types also match; otherwise, not castable.
+  // Note: StructType* is castable only when the internal child types also 
match;

spark git commit: [SPARK-22924][SPARKR] R API for sortWithinPartitions

2017-12-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master fd7d141d8 -> ea0a5eef2


[SPARK-22924][SPARKR] R API for sortWithinPartitions

## What changes were proposed in this pull request?

Add to `arrange` the option to sort only within partition

## How was this patch tested?

manual, unit tests

Author: Felix Cheung 

Closes #20118 from felixcheung/rsortwithinpartition.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ea0a5eef
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ea0a5eef
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ea0a5eef

Branch: refs/heads/master
Commit: ea0a5eef2238daa68a15b60a6f1a74c361216140
Parents: fd7d141
Author: Felix Cheung 
Authored: Sun Dec 31 02:50:00 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Dec 31 02:50:00 2017 +0900

--
 R/pkg/R/DataFrame.R   | 14 ++
 R/pkg/tests/fulltests/test_sparkSQL.R |  5 +
 2 files changed, 15 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ea0a5eef/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index ace49da..fe238f6 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -2297,6 +2297,7 @@ setClassUnion("characterOrColumn", c("character", 
"Column"))
 #' @param ... additional sorting fields
 #' @param decreasing a logical argument indicating sorting order for columns 
when
 #'   a character vector is specified for col
+#' @param withinPartitions a logical argument indicating whether to sort only 
within each partition
 #' @return A SparkDataFrame where all elements are sorted.
 #' @family SparkDataFrame functions
 #' @aliases arrange,SparkDataFrame,Column-method
@@ -2312,16 +2313,21 @@ setClassUnion("characterOrColumn", c("character", 
"Column"))
 #' arrange(df, asc(df$col1), desc(abs(df$col2)))
 #' arrange(df, "col1", decreasing = TRUE)
 #' arrange(df, "col1", "col2", decreasing = c(TRUE, FALSE))
+#' arrange(df, "col1", "col2", withinPartitions = TRUE)
 #' }
 #' @note arrange(SparkDataFrame, Column) since 1.4.0
 setMethod("arrange",
   signature(x = "SparkDataFrame", col = "Column"),
-  function(x, col, ...) {
+  function(x, col, ..., withinPartitions = FALSE) {
   jcols <- lapply(list(col, ...), function(c) {
 c@jc
   })
 
-sdf <- callJMethod(x@sdf, "sort", jcols)
+if (withinPartitions) {
+  sdf <- callJMethod(x@sdf, "sortWithinPartitions", jcols)
+} else {
+  sdf <- callJMethod(x@sdf, "sort", jcols)
+}
 dataFrame(sdf)
   })
 
@@ -2332,7 +2338,7 @@ setMethod("arrange",
 #' @note arrange(SparkDataFrame, character) since 1.4.0
 setMethod("arrange",
   signature(x = "SparkDataFrame", col = "character"),
-  function(x, col, ..., decreasing = FALSE) {
+  function(x, col, ..., decreasing = FALSE, withinPartitions = FALSE) {
 
 # all sorting columns
 by <- list(col, ...)
@@ -2356,7 +2362,7 @@ setMethod("arrange",
   }
 })
 
-do.call("arrange", c(x, jcols))
+do.call("arrange", c(x, jcols, withinPartitions = 
withinPartitions))
   })
 
 #' @rdname arrange

http://git-wip-us.apache.org/repos/asf/spark/blob/ea0a5eef/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 1b7d53f..5197838 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -2130,6 +2130,11 @@ test_that("arrange() and orderBy() on a DataFrame", {
 
   sorted7 <- arrange(df, "name", decreasing = FALSE)
   expect_equal(collect(sorted7)[2, "age"], 19)
+
+  df <- createDataFrame(cars, numPartitions = 10)
+  expect_equal(getNumPartitions(df), 10)
+  sorted8 <- arrange(df, "dist", withinPartitions = TRUE)
+  expect_equal(collect(sorted8)[5:6, "dist"], c(22, 10))
 })
 
 test_that("filter() on a DataFrame", {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22370][SQL][PYSPARK][FOLLOW-UP] Fix a test failure when xmlrunner is installed.

2017-12-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master dbd492b7e -> 11a849b3a


[SPARK-22370][SQL][PYSPARK][FOLLOW-UP] Fix a test failure when xmlrunner is 
installed.

## What changes were proposed in this pull request?

This is a follow-up pr of #19587.

If `xmlrunner` is installed, 
`VectorizedUDFTests.test_vectorized_udf_check_config` fails by the following 
error because the `self` which is a subclass of `unittest.TestCase` in the UDF 
`check_records_per_batch` can't be pickled anymore.

```
PicklingError: Cannot pickle files that are not opened for reading: w
```

This changes the UDF not to refer the `self`.

## How was this patch tested?

Tested locally.

Author: Takuya UESHIN 

Closes #20115 from ueshin/issues/SPARK-22370_fup1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/11a849b3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/11a849b3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/11a849b3

Branch: refs/heads/master
Commit: 11a849b3a7b3d03c48d3e17c8a721acedfd89285
Parents: dbd492b
Author: Takuya UESHIN 
Authored: Fri Dec 29 23:04:28 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 29 23:04:28 2017 +0900

--
 python/pyspark/sql/tests.py | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/11a849b3/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 3ef1522..1c34c89 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3825,6 +3825,7 @@ class VectorizedUDFTests(ReusedSQLTestCase):
 
 def test_vectorized_udf_check_config(self):
 from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
 orig_value = 
self.spark.conf.get("spark.sql.execution.arrow.maxRecordsPerBatch", None)
 self.spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 3)
 try:
@@ -3832,11 +3833,11 @@ class VectorizedUDFTests(ReusedSQLTestCase):
 
 @pandas_udf(returnType=LongType())
 def check_records_per_batch(x):
-self.assertTrue(x.size <= 3)
-return x
+return pd.Series(x.size).repeat(x.size)
 
-result = df.select(check_records_per_batch(col("id")))
-self.assertEqual(df.collect(), result.collect())
+result = df.select(check_records_per_batch(col("id"))).collect()
+for (r,) in result:
+self.assertTrue(r <= 3)
 finally:
 if orig_value is None:
 
self.spark.conf.unset("spark.sql.execution.arrow.maxRecordsPerBatch")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [HOTFIX] Fix Scala style checks

2017-12-23 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master ea2642eb0 -> f6084a88f


[HOTFIX] Fix Scala style checks

## What changes were proposed in this pull request?

This PR fixes a style that broke the build.

## How was this patch tested?

Manually tested.

Author: hyukjinkwon 

Closes #20065 from HyukjinKwon/minor-style.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6084a88
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6084a88
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6084a88

Branch: refs/heads/master
Commit: f6084a88f0fe69111df8a016bc81c9884d3d3402
Parents: ea2642e
Author: hyukjinkwon 
Authored: Sun Dec 24 01:16:12 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Dec 24 01:16:12 2017 +0900

--
 .../org/apache/spark/examples/sql/hive/SparkHiveExample.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f6084a88/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
index 51df5dd..b193bd5 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
+++ 
b/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
@@ -135,7 +135,7 @@ object SparkHiveExample {
 hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite)
   .partitionBy("key").parquet(hiveExternalTableLocation)
 // $example off:spark_hive$
-
+
 spark.stop()
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22844][R] Adds date_trunc in R API

2017-12-23 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f6084a88f -> aeb45df66


[SPARK-22844][R] Adds date_trunc in R API

## What changes were proposed in this pull request?

This PR adds `date_trunc` in R API as below:

```r
> df <- createDataFrame(list(list(a = as.POSIXlt("2012-12-13 12:34:00"
> head(select(df, date_trunc("hour", df$a)))
  date_trunc(hour, a)
1 2012-12-13 12:00:00
```

## How was this patch tested?

Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon 

Closes #20031 from HyukjinKwon/r-datetrunc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aeb45df6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aeb45df6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aeb45df6

Branch: refs/heads/master
Commit: aeb45df668a97a2d48cfd4079ed62601390979ba
Parents: f6084a8
Author: hyukjinkwon 
Authored: Sun Dec 24 01:18:11 2017 +0900
Committer: hyukjinkwon 
Committed: Sun Dec 24 01:18:11 2017 +0900

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 34 ++
 R/pkg/R/generics.R|  5 +
 R/pkg/tests/fulltests/test_sparkSQL.R |  3 +++
 4 files changed, 39 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aeb45df6/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 57838f5..dce64e1 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -230,6 +230,7 @@ exportMethods("%<=>%",
   "date_add",
   "date_format",
   "date_sub",
+  "date_trunc",
   "datediff",
   "dayofmonth",
   "dayofweek",

http://git-wip-us.apache.org/repos/asf/spark/blob/aeb45df6/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 237ef06..3a96f94 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -40,10 +40,17 @@ NULL
 #'
 #' @param x Column to compute on. In \code{window}, it must be a time Column of
 #'  \code{TimestampType}.
-#' @param format For \code{to_date} and \code{to_timestamp}, it is the string 
to use to parse
-#'   Column \code{x} to DateType or TimestampType. For 
\code{trunc}, it is the string
-#'   to use to specify the truncation method. For example, "year", 
"", "yy" for
-#'   truncate by year, or "month", "mon", "mm" for truncate by 
month.
+#' @param format The format for the given dates or timestamps in Column 
\code{x}. See the
+#'   format used in the following methods:
+#'   \itemize{
+#'   \item \code{to_date} and \code{to_timestamp}: it is the 
string to use to parse
+#'Column \code{x} to DateType or TimestampType.
+#'   \item \code{trunc}: it is the string to use to specify the 
truncation method.
+#'For example, "year", "", "yy" for truncate by year, 
or "month", "mon",
+#'"mm" for truncate by month.
+#'   \item \code{date_trunc}: it is similar with \code{trunc}'s 
but additionally
+#'supports "day", "dd", "second", "minute", "hour", "week" 
and "quarter".
+#'   }
 #' @param ... additional argument(s).
 #' @name column_datetime_functions
 #' @rdname column_datetime_functions
@@ -3478,3 +3485,22 @@ setMethod("trunc",
   x@jc, as.character(format))
 column(jc)
   })
+
+#' @details
+#' \code{date_trunc}: Returns timestamp truncated to the unit specified by the 
format.
+#'
+#' @rdname column_datetime_functions
+#' @aliases date_trunc date_trunc,character,Column-method
+#' @export
+#' @examples
+#'
+#' \dontrun{
+#' head(select(df, df$time, date_trunc("hour", df$time), date_trunc("minute", 
df$time),
+#' date_trunc("week", df$time), date_trunc("quarter", df$time)))}
+#' @note date_trunc since 2.3.0
+setMethod("date_trunc",
+  signature(format = "character", x = "Column"),
+  function(format, x) {
+jc <- callJStatic("org.apache.spark.sql.functions", "date_trunc", 
format, x@jc)
+column(jc)
+  })

http://git-wip-us.apache.org/repos/asf/spark/blob/aeb45df6/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 8fcf269..5ddaa66 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1046,6 +1046,11 @@ setGeneric("date_sub", function(y, x) { 
standardGeneric("date_sub") })
 #' @rdname

spark git commit: [SPARK-22967][TESTS] Fix VersionSuite's unit tests by change Windows path into URI path

2018-01-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1c70da3bf -> 0552c36e0


[SPARK-22967][TESTS] Fix VersionSuite's unit tests by change Windows path into 
URI path

## What changes were proposed in this pull request?

Two unit test will fail due to Windows format path:

1.test(s"$version: read avro file containing decimal")
```
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);
```

2.test(s"$version: SPARK-17920: Insert into/overwrite avro table")
```
Unable to infer the schema. The schema specification is required to create the 
table `default`.`tab2`.;
org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema 
specification is required to create the table `default`.`tab2`.;
```

This pr fix these two unit test by change Windows path into URI path.

## How was this patch tested?
Existed.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: wuyi5 

Closes #20199 from Ngone51/SPARK-22967.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0552c36e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0552c36e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0552c36e

Branch: refs/heads/master
Commit: 0552c36e02434c60dad82024334d291f6008b822
Parents: 1c70da3
Author: wuyi5 
Authored: Thu Jan 11 22:17:15 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 11 22:17:15 2018 +0900

--
 .../org/apache/spark/sql/hive/client/VersionsSuite.scala | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0552c36e/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
index ff90e9d..e64389e 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
@@ -811,7 +811,7 @@ class VersionsSuite extends SparkFunSuite with Logging {
 
 test(s"$version: read avro file containing decimal") {
   val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
-  val location = new File(url.getFile)
+  val location = new File(url.getFile).toURI.toString
 
   val tableName = "tab1"
   val avroSchema =
@@ -851,6 +851,8 @@ class VersionsSuite extends SparkFunSuite with Logging {
 }
 
 test(s"$version: SPARK-17920: Insert into/overwrite avro table") {
+  // skipped because it's failed in the condition on Windows
+  assume(!(Utils.isWindows && version == "0.12"))
   withTempDir { dir =>
 val avroSchema =
   """
@@ -875,10 +877,10 @@ class VersionsSuite extends SparkFunSuite with Logging {
 val writer = new PrintWriter(schemaFile)
 writer.write(avroSchema)
 writer.close()
-val schemaPath = schemaFile.getCanonicalPath
+val schemaPath = schemaFile.toURI.toString
 
 val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
-val srcLocation = new File(url.getFile).getCanonicalPath
+val srcLocation = new File(url.getFile).toURI.toString
 val destTableName = "tab1"
 val srcTableName = "tab2"
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22967][TESTS] Fix VersionSuite's unit tests by change Windows path into URI path

2018-01-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b78130123 -> 799598905


[SPARK-22967][TESTS] Fix VersionSuite's unit tests by change Windows path into 
URI path

## What changes were proposed in this pull request?

Two unit test will fail due to Windows format path:

1.test(s"$version: read avro file containing decimal")
```
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);
```

2.test(s"$version: SPARK-17920: Insert into/overwrite avro table")
```
Unable to infer the schema. The schema specification is required to create the 
table `default`.`tab2`.;
org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema 
specification is required to create the table `default`.`tab2`.;
```

This pr fix these two unit test by change Windows path into URI path.

## How was this patch tested?
Existed.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: wuyi5 

Closes #20199 from Ngone51/SPARK-22967.

(cherry picked from commit 0552c36e02434c60dad82024334d291f6008b822)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79959890
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79959890
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79959890

Branch: refs/heads/branch-2.3
Commit: 79959890570d216c33069c8382b29d53977665b1
Parents: b781301
Author: wuyi5 
Authored: Thu Jan 11 22:17:15 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 11 22:17:28 2018 +0900

--
 .../org/apache/spark/sql/hive/client/VersionsSuite.scala | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79959890/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
index ff90e9d..e64389e 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala
@@ -811,7 +811,7 @@ class VersionsSuite extends SparkFunSuite with Logging {
 
 test(s"$version: read avro file containing decimal") {
   val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
-  val location = new File(url.getFile)
+  val location = new File(url.getFile).toURI.toString
 
   val tableName = "tab1"
   val avroSchema =
@@ -851,6 +851,8 @@ class VersionsSuite extends SparkFunSuite with Logging {
 }
 
 test(s"$version: SPARK-17920: Insert into/overwrite avro table") {
+  // skipped because it's failed in the condition on Windows
+  assume(!(Utils.isWindows && version == "0.12"))
   withTempDir { dir =>
 val avroSchema =
   """
@@ -875,10 +877,10 @@ class VersionsSuite extends SparkFunSuite with Logging {
 val writer = new PrintWriter(schemaFile)
 writer.write(avroSchema)
 writer.close()
-val schemaPath = schemaFile.getCanonicalPath
+val schemaPath = schemaFile.toURI.toString
 
 val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
-val srcLocation = new File(url.getFile).getCanonicalPath
+val srcLocation = new File(url.getFile).toURI.toString
 val destTableName = "tab1"
 val srcTableName = "tab2"
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and fillna

2018-01-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 76892bcf2 -> b46e58b74


[SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and fillna

## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/18164 introduces the behavior changes. We 
need to document it.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20234 from gatorsmile/docBehaviorChange.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b46e58b7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b46e58b7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b46e58b7

Branch: refs/heads/master
Commit: b46e58b74c82dac37b7b92284ea3714919c5a886
Parents: 76892bc
Author: gatorsmile 
Authored: Thu Jan 11 22:33:42 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 11 22:33:42 2018 +0900

--
 docs/sql-programming-guide.md | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b46e58b7/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 72f79d6..258c769 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1788,12 +1788,10 @@ options.
 Note that, for DecimalType(38,0)*, the table above intentionally 
does not cover all other combinations of scales and precisions because 
currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 
1.1 is inferred as double type.
   - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas 
related functionalities, such as `toPandas`, `createDataFrame` from Pandas 
DataFrame, etc.
   - In PySpark, the behavior of timestamp values for Pandas related 
functionalities was changed to respect session timezone. If you want to use the 
old behavior, you need to set a configuration 
`spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See 
[SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
-
- - Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
-
- - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns 
an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it 
always returns as a string despite of input types. To keep the old behavior, 
set `spark.sql.function.concatBinaryAsString` to `true`.
-
- - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output 
as binary. Otherwise, it returns as a string. Until Spark 2.3, it always 
returns as a string despite of input types. To keep the old behavior, set 
`spark.sql.function.eltOutputAsString` to `true`.
+  - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces 
nulls with booleans. In prior Spark versions, PySpark just ignores it and 
returns the original Dataset/DataFrame.  
+  - Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
+  - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns 
an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it 
always returns as a string despite of input types. To keep the old behavior, 
set `spark.sql.function.concatBinaryAsString` to `true`.
+  - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output 
as binary. Otherwise, it returns as a string. Until Spark 2.3, it always 
returns as a string despite of input types. To keep the old behavior, set 
`spark.sql.function.eltOutputAsString` to `true`.
 
 ## Upgrading From Spark SQL 2.1 to 2.2
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and fillna

2018-01-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 9ca0f6eaf -> f624850fe


[SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and fillna

## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/18164 introduces the behavior changes. We 
need to document it.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20234 from gatorsmile/docBehaviorChange.

(cherry picked from commit b46e58b74c82dac37b7b92284ea3714919c5a886)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f624850f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f624850f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f624850f

Branch: refs/heads/branch-2.3
Commit: f624850fe8acce52240217f376316734a23be00b
Parents: 9ca0f6e
Author: gatorsmile 
Authored: Thu Jan 11 22:33:42 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 11 22:33:57 2018 +0900

--
 docs/sql-programming-guide.md | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f624850f/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 72f79d6..258c769 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1788,12 +1788,10 @@ options.
 Note that, for DecimalType(38,0)*, the table above intentionally 
does not cover all other combinations of scales and precisions because 
currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 
1.1 is inferred as double type.
   - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas 
related functionalities, such as `toPandas`, `createDataFrame` from Pandas 
DataFrame, etc.
   - In PySpark, the behavior of timestamp values for Pandas related 
functionalities was changed to respect session timezone. If you want to use the 
old behavior, you need to set a configuration 
`spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See 
[SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
-
- - Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
-
- - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns 
an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it 
always returns as a string despite of input types. To keep the old behavior, 
set `spark.sql.function.concatBinaryAsString` to `true`.
-
- - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output 
as binary. Otherwise, it returns as a string. Until Spark 2.3, it always 
returns as a string despite of input types. To keep the old behavior, set 
`spark.sql.function.eltOutputAsString` to `true`.
+  - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces 
nulls with booleans. In prior Spark versions, PySpark just ignores it and 
returns the original Dataset/DataFrame.  
+  - Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
+  - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns 
an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it 
always returns as a string despite of input types. To keep the old behavior, 
set `spark.sql.function.concatBinaryAsString` to `true`.
+  - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output 
as binary. Otherwise, it returns as a string. Until Spark 2.3, it always 
returns as a string despite of input types. To keep the old behavior, set 
`spark.sql.function.eltOutputAsString` to `true`.
 
 ## Upgrading From Spark SQL 2.1 to 2.2
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas

2018-01-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 eb4fa551e -> 551ccfba5


[SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas

## What changes were proposed in this pull request?

This the case when calling `SparkSession.createDataFrame` using a Pandas 
DataFrame that has non-str column labels.

The column name conversion logic to handle non-string or unicode in python2 is:
```
if column is not any type of string:
name = str(column)
else if column is unicode in Python 2:
name = column.encode('utf-8')
```

## How was this patch tested?

Added a new test with a Pandas DataFrame that has int column labels

Author: Bryan Cutler 

Closes #20210 from BryanCutler/python-createDataFrame-int-col-error-SPARK-23009.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/551ccfba
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/551ccfba
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/551ccfba

Branch: refs/heads/branch-2.3
Commit: 551ccfba529996e987c4d2e8d4dd61c4ab9a2e95
Parents: eb4fa55
Author: Bryan Cutler 
Authored: Wed Jan 10 14:55:24 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 11 09:46:50 2018 +0900

--
 python/pyspark/sql/session.py | 4 +++-
 python/pyspark/sql/tests.py   | 9 +
 2 files changed, 12 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/551ccfba/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 3e45747..604021c 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -648,7 +648,9 @@ class SparkSession(object):
 
 # If no schema supplied by user then get the names of columns only
 if schema is None:
-schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in data.columns]
+schema = [str(x) if not isinstance(x, basestring) else
+  (x.encode('utf-8') if not isinstance(x, str) else x)
+  for x in data.columns]
 
 if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
 and len(data) > 0:

http://git-wip-us.apache.org/repos/asf/spark/blob/551ccfba/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 13576ff..80a94a9 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3532,6 +3532,15 @@ class ArrowTests(ReusedSQLTestCase):
 self.assertTrue(expected[r][e] == result_arrow[r][e] and
 result[r][e] == result_arrow[r][e])
 
+def test_createDataFrame_with_int_col_names(self):
+import numpy as np
+import pandas as pd
+pdf = pd.DataFrame(np.random.rand(4, 2))
+df, df_arrow = self._createDataFrame_toggle(pdf)
+pdf_col_names = [str(c) for c in pdf.columns]
+self.assertEqual(pdf_col_names, df.columns)
+self.assertEqual(pdf_col_names, df_arrow.columns)
+
 
 @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
 class PandasUDFTests(ReusedSQLTestCase):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23141][SQL][PYSPARK] Support data type string as a returnType for registerJavaFunction.

2018-01-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 8a9827482 -> e0421c650


[SPARK-23141][SQL][PYSPARK] Support data type string as a returnType for 
registerJavaFunction.

## What changes were proposed in this pull request?

Currently `UDFRegistration.registerJavaFunction` doesn't support data type 
string as a `returnType` whereas `UDFRegistration.register`, `udf`, or 
`pandas_udf` does.
We can support it for `UDFRegistration.registerJavaFunction` as well.

## How was this patch tested?

Added a doctest and existing tests.

Author: Takuya UESHIN 

Closes #20307 from ueshin/issues/SPARK-23141.

(cherry picked from commit 5063b7481173ad72bd0dc941b5cf3c9b26a591e4)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e0421c65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e0421c65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e0421c65

Branch: refs/heads/branch-2.3
Commit: e0421c65093f66b365539358dd9be38d2006fa47
Parents: 8a98274
Author: Takuya UESHIN 
Authored: Thu Jan 18 22:33:04 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 18 22:33:25 2018 +0900

--
 python/pyspark/sql/functions.py |  6 --
 python/pyspark/sql/udf.py   | 14 --
 2 files changed, 16 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e0421c65/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 988c1d2..961b326 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2108,7 +2108,8 @@ def udf(f=None, returnType=StringType()):
 can fail on special rows, the workaround is to incorporate the 
condition into the functions.
 
 :param f: python function if used as a standalone function
-:param returnType: a :class:`pyspark.sql.types.DataType` object
+:param returnType: the return type of the user-defined function. The value 
can be either a
+:class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
 
 >>> from pyspark.sql.types import IntegerType
 >>> slen = udf(lambda s: len(s), IntegerType())
@@ -2148,7 +2149,8 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 Creates a vectorized user defined function (UDF).
 
 :param f: user-defined function. A python function if used as a standalone 
function
-:param returnType: a :class:`pyspark.sql.types.DataType` object
+:param returnType: the return type of the user-defined function. The value 
can be either a
+:class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
 :param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUDFType`.
  Default: SCALAR.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e0421c65/python/pyspark/sql/udf.py
--
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 1943bb7..c77f19f8 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -206,7 +206,8 @@ class UDFRegistration(object):
 :param f: a Python function, or a user-defined function. The 
user-defined function can
 be either row-at-a-time or vectorized. See 
:meth:`pyspark.sql.functions.udf` and
 :meth:`pyspark.sql.functions.pandas_udf`.
-:param returnType: the return type of the registered user-defined 
function.
+:param returnType: the return type of the registered user-defined 
function. The value can
+be either a :class:`pyspark.sql.types.DataType` object or a 
DDL-formatted type string.
 :return: a user-defined function.
 
 `returnType` can be optionally specified when `f` is a Python function 
but not
@@ -303,21 +304,30 @@ class UDFRegistration(object):
 
 :param name: name of the user-defined function
 :param javaClassName: fully qualified name of java class
-:param returnType: a :class:`pyspark.sql.types.DataType` object
+:param returnType: the return type of the registered Java function. 
The value can be either
+a :class:`pyspark.sql.types.DataType` object or a DDL-formatted 
type string.
 
 >>> from pyspark.sql.types import IntegerType
 >>> spark.udf.registerJavaFunction(
 ... "javaStringLength", 
"test.org.apache.spark.sql.JavaStringLength", IntegerType())
 >>> spark.sql("SELECT javaStringLength('test')").collect()
 [Row(UDF:javaStringLength(test)=4)]
+
 >>> spark.udf.registerJavaFunction(
 ...

spark git commit: [SPARK-23141][SQL][PYSPARK] Support data type string as a returnType for registerJavaFunction.

2018-01-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e28eb4311 -> 5063b7481


[SPARK-23141][SQL][PYSPARK] Support data type string as a returnType for 
registerJavaFunction.

## What changes were proposed in this pull request?

Currently `UDFRegistration.registerJavaFunction` doesn't support data type 
string as a `returnType` whereas `UDFRegistration.register`, `udf`, or 
`pandas_udf` does.
We can support it for `UDFRegistration.registerJavaFunction` as well.

## How was this patch tested?

Added a doctest and existing tests.

Author: Takuya UESHIN 

Closes #20307 from ueshin/issues/SPARK-23141.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5063b748
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5063b748
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5063b748

Branch: refs/heads/master
Commit: 5063b7481173ad72bd0dc941b5cf3c9b26a591e4
Parents: e28eb43
Author: Takuya UESHIN 
Authored: Thu Jan 18 22:33:04 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 18 22:33:04 2018 +0900

--
 python/pyspark/sql/functions.py |  6 --
 python/pyspark/sql/udf.py   | 14 --
 2 files changed, 16 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5063b748/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 988c1d2..961b326 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2108,7 +2108,8 @@ def udf(f=None, returnType=StringType()):
 can fail on special rows, the workaround is to incorporate the 
condition into the functions.
 
 :param f: python function if used as a standalone function
-:param returnType: a :class:`pyspark.sql.types.DataType` object
+:param returnType: the return type of the user-defined function. The value 
can be either a
+:class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
 
 >>> from pyspark.sql.types import IntegerType
 >>> slen = udf(lambda s: len(s), IntegerType())
@@ -2148,7 +2149,8 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 Creates a vectorized user defined function (UDF).
 
 :param f: user-defined function. A python function if used as a standalone 
function
-:param returnType: a :class:`pyspark.sql.types.DataType` object
+:param returnType: the return type of the user-defined function. The value 
can be either a
+:class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
 :param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUDFType`.
  Default: SCALAR.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/5063b748/python/pyspark/sql/udf.py
--
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 1943bb7..c77f19f8 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -206,7 +206,8 @@ class UDFRegistration(object):
 :param f: a Python function, or a user-defined function. The 
user-defined function can
 be either row-at-a-time or vectorized. See 
:meth:`pyspark.sql.functions.udf` and
 :meth:`pyspark.sql.functions.pandas_udf`.
-:param returnType: the return type of the registered user-defined 
function.
+:param returnType: the return type of the registered user-defined 
function. The value can
+be either a :class:`pyspark.sql.types.DataType` object or a 
DDL-formatted type string.
 :return: a user-defined function.
 
 `returnType` can be optionally specified when `f` is a Python function 
but not
@@ -303,21 +304,30 @@ class UDFRegistration(object):
 
 :param name: name of the user-defined function
 :param javaClassName: fully qualified name of java class
-:param returnType: a :class:`pyspark.sql.types.DataType` object
+:param returnType: the return type of the registered Java function. 
The value can be either
+a :class:`pyspark.sql.types.DataType` object or a DDL-formatted 
type string.
 
 >>> from pyspark.sql.types import IntegerType
 >>> spark.udf.registerJavaFunction(
 ... "javaStringLength", 
"test.org.apache.spark.sql.JavaStringLength", IntegerType())
 >>> spark.sql("SELECT javaStringLength('test')").collect()
 [Row(UDF:javaStringLength(test)=4)]
+
 >>> spark.udf.registerJavaFunction(
 ... "javaStringLength2", 
"test.org.apache.spark.sql.JavaStringLength")
 >>> spark.sql("SELECT

spark git commit: [SPARK-23094] Fix invalid character handling in JsonDataSource

2018-01-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b8c6d9303 -> a295034da


[SPARK-23094] Fix invalid character handling in JsonDataSource

## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and 
`json_tuple` ([Fix 
#1](https://github.com/apache/spark/commit/c8803c06854683c8761fdb3c0e4c55d5a9e22a95),
 [Fix 
#2](https://github.com/apache/spark/commit/86174ea89b39a300caaba6baffac70f3dc702788)),
 but they weren't comprehensive it seems. I wanted to extend those fixes to all 
the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz 

Closes #20302 from brkyvz/json-invfix.

(cherry picked from commit e01919e834d301e13adc8919932796ebae900576)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a295034d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a295034d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a295034d

Branch: refs/heads/branch-2.3
Commit: a295034da6178f8654c3977903435384b3765b5e
Parents: b8c6d93
Author: Burak Yavuz 
Authored: Fri Jan 19 07:36:06 2018 +0900
Committer: hyukjinkwon 
Committed: Fri Jan 19 07:36:21 2018 +0900

--
 .../sql/catalyst/json/CreateJacksonParser.scala |  5 +--
 .../sql/sources/JsonHadoopFsRelationSuite.scala | 34 
 2 files changed, 37 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a295034d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
index 025a388..b1672e7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
@@ -40,10 +40,11 @@ private[sql] object CreateJacksonParser extends 
Serializable {
   }
 
   def text(jsonFactory: JsonFactory, record: Text): JsonParser = {
-jsonFactory.createParser(record.getBytes, 0, record.getLength)
+val bain = new ByteArrayInputStream(record.getBytes, 0, record.getLength)
+jsonFactory.createParser(new InputStreamReader(bain, "UTF-8"))
   }
 
   def inputStream(jsonFactory: JsonFactory, record: InputStream): JsonParser = 
{
-jsonFactory.createParser(record)
+jsonFactory.createParser(new InputStreamReader(record, "UTF-8"))
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/a295034d/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
index 49be304..27f398e 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
@@ -28,6 +28,8 @@ import org.apache.spark.sql.types._
 class JsonHadoopFsRelationSuite extends HadoopFsRelationTest {
   override val dataSourceName: String = "json"
 
+  private val badJson = "\u\u\uA\u0001AAA"
+
   // JSON does not write data of NullType and does not play well with 
BinaryType.
   override protected def supportsDataType(dataType: DataType): Boolean = 
dataType match {
 case _: NullType => false
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+

spark git commit: [SPARK-23094] Fix invalid character handling in JsonDataSource

2018-01-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f568e9cf7 -> e01919e83


[SPARK-23094] Fix invalid character handling in JsonDataSource

## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and 
`json_tuple` ([Fix 
#1](https://github.com/apache/spark/commit/c8803c06854683c8761fdb3c0e4c55d5a9e22a95),
 [Fix 
#2](https://github.com/apache/spark/commit/86174ea89b39a300caaba6baffac70f3dc702788)),
 but they weren't comprehensive it seems. I wanted to extend those fixes to all 
the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz 

Closes #20302 from brkyvz/json-invfix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e01919e8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e01919e8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e01919e8

Branch: refs/heads/master
Commit: e01919e834d301e13adc8919932796ebae900576
Parents: f568e9c
Author: Burak Yavuz 
Authored: Fri Jan 19 07:36:06 2018 +0900
Committer: hyukjinkwon 
Committed: Fri Jan 19 07:36:06 2018 +0900

--
 .../sql/catalyst/json/CreateJacksonParser.scala |  5 +--
 .../sql/sources/JsonHadoopFsRelationSuite.scala | 34 
 2 files changed, 37 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e01919e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
index 025a388..b1672e7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
@@ -40,10 +40,11 @@ private[sql] object CreateJacksonParser extends 
Serializable {
   }
 
   def text(jsonFactory: JsonFactory, record: Text): JsonParser = {
-jsonFactory.createParser(record.getBytes, 0, record.getLength)
+val bain = new ByteArrayInputStream(record.getBytes, 0, record.getLength)
+jsonFactory.createParser(new InputStreamReader(bain, "UTF-8"))
   }
 
   def inputStream(jsonFactory: JsonFactory, record: InputStream): JsonParser = 
{
-jsonFactory.createParser(record)
+jsonFactory.createParser(new InputStreamReader(record, "UTF-8"))
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/e01919e8/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
index 49be304..27f398e 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/sources/JsonHadoopFsRelationSuite.scala
@@ -28,6 +28,8 @@ import org.apache.spark.sql.types._
 class JsonHadoopFsRelationSuite extends HadoopFsRelationTest {
   override val dataSourceName: String = "json"
 
+  private val badJson = "\u\u\uA\u0001AAA"
+
   // JSON does not write data of NullType and does not play well with 
BinaryType.
   override protected def supportsDataType(dataType: DataType): Boolean = 
dataType match {
 case _: NullType => false
@@ -105,4 +107,36 @@ class JsonHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   )
 }
   }
+
+  test("invalid json with leading nulls - from file (multiLine=true)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val expected = s"""$badJson\n{"a":1}\n"""
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
true).schema(schema).load(path)
+  checkAnswer(df, Row(null, expected))
+}
+  }
+
+  test("invalid json with leading nulls - from file (multiLine=false)") {
+import testImplicits._
+withTempDir { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(badJson, """{"a":1}""").toDS().write.mode("overwrite").text(path)
+  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
+  val df =
+spark.read.format(dataSourceName).option("multiLine", 
false).schema(schema).load(path)
+  checkAnswer(df,

spark git commit: [SPARK-23080][SQL] Improve error message for built-in functions

2018-01-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6c81fe227 -> 8ab2d7ea9


[SPARK-23080][SQL] Improve error message for built-in functions

## What changes were proposed in this pull request?

When a user puts the wrong number of parameters in a function, an 
AnalysisException is thrown. If the function is a UDF, he user is told how many 
parameters the function expected and how many he/she put. If the function, 
instead, is a built-in one, no information about the number of parameters 
expected and the actual one is provided. This can help in some cases, to debug 
the errors (eg. bad quotes escaping may lead to a different number of 
parameters than expected, etc. etc.)

The PR adds the information about the number of parameters passed and the 
expected one, analogously to what happens for UDF.

## How was this patch tested?

modified existing UT + manual test

Author: Marco Gaido 

Closes #20271 from mgaido91/SPARK-23080.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ab2d7ea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ab2d7ea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ab2d7ea

Branch: refs/heads/master
Commit: 8ab2d7ea99b2cff8b54b2cb3a1dbf7580845986a
Parents: 6c81fe2
Author: Marco Gaido 
Authored: Tue Jan 16 11:47:42 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 16 11:47:42 2018 +0900

--
 .../spark/sql/catalyst/analysis/FunctionRegistry.scala| 10 +-
 .../resources/sql-tests/results/json-functions.sql.out|  4 ++--
 .../src/test/scala/org/apache/spark/sql/UDFSuite.scala|  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8ab2d7ea/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 5ddb398..747016b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -526,7 +526,15 @@ object FunctionRegistry {
 // Otherwise, find a constructor method that matches the number of 
arguments, and use that.
 val params = Seq.fill(expressions.size)(classOf[Expression])
 val f = constructors.find(_.getParameterTypes.toSeq == 
params).getOrElse {
-  throw new AnalysisException(s"Invalid number of arguments for 
function $name")
+  val validParametersCount = 
constructors.map(_.getParameterCount).distinct.sorted
+  val expectedNumberOfParameters = if (validParametersCount.length == 
1) {
+validParametersCount.head.toString
+  } else {
+validParametersCount.init.mkString("one of ", ", ", " and ") +
+  validParametersCount.last
+  }
+  throw new AnalysisException(s"Invalid number of arguments for 
function $name. " +
+s"Expected: $expectedNumberOfParameters; Found: ${params.length}")
 }
 Try(f.newInstance(expressions : _*).asInstanceOf[Expression]) match {
   case Success(e) => e

http://git-wip-us.apache.org/repos/asf/spark/blob/8ab2d7ea/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
--
diff --git 
a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
index d9dc728..581dddc 100644
--- a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
@@ -129,7 +129,7 @@ select to_json()
 struct<>
 -- !query 12 output
 org.apache.spark.sql.AnalysisException
-Invalid number of arguments for function to_json; line 1 pos 7
+Invalid number of arguments for function to_json. Expected: one of 1, 2 and 3; 
Found: 0; line 1 pos 7
 
 
 -- !query 13
@@ -225,7 +225,7 @@ select from_json()
 struct<>
 -- !query 21 output
 org.apache.spark.sql.AnalysisException
-Invalid number of arguments for function from_json; line 1 pos 7
+Invalid number of arguments for function from_json. Expected: one of 2, 3 and 
4; Found: 0; line 1 pos 7
 
 
 -- !query 22

http://git-wip-us.apache.org/repos/asf/spark/blob/8ab2d7ea/sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala
--
diff --git

spark git commit: [SPARK-23080][SQL] Improve error message for built-in functions

2018-01-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 706a308bd -> bb8e5addc


[SPARK-23080][SQL] Improve error message for built-in functions

## What changes were proposed in this pull request?

When a user puts the wrong number of parameters in a function, an 
AnalysisException is thrown. If the function is a UDF, he user is told how many 
parameters the function expected and how many he/she put. If the function, 
instead, is a built-in one, no information about the number of parameters 
expected and the actual one is provided. This can help in some cases, to debug 
the errors (eg. bad quotes escaping may lead to a different number of 
parameters than expected, etc. etc.)

The PR adds the information about the number of parameters passed and the 
expected one, analogously to what happens for UDF.

## How was this patch tested?

modified existing UT + manual test

Author: Marco Gaido 

Closes #20271 from mgaido91/SPARK-23080.

(cherry picked from commit 8ab2d7ea99b2cff8b54b2cb3a1dbf7580845986a)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb8e5add
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb8e5add
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb8e5add

Branch: refs/heads/branch-2.3
Commit: bb8e5addc79652308169532c33baa8117c2464ca
Parents: 706a308
Author: Marco Gaido 
Authored: Tue Jan 16 11:47:42 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 16 11:47:58 2018 +0900

--
 .../spark/sql/catalyst/analysis/FunctionRegistry.scala| 10 +-
 .../resources/sql-tests/results/json-functions.sql.out|  4 ++--
 .../src/test/scala/org/apache/spark/sql/UDFSuite.scala|  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb8e5add/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 5ddb398..747016b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -526,7 +526,15 @@ object FunctionRegistry {
 // Otherwise, find a constructor method that matches the number of 
arguments, and use that.
 val params = Seq.fill(expressions.size)(classOf[Expression])
 val f = constructors.find(_.getParameterTypes.toSeq == 
params).getOrElse {
-  throw new AnalysisException(s"Invalid number of arguments for 
function $name")
+  val validParametersCount = 
constructors.map(_.getParameterCount).distinct.sorted
+  val expectedNumberOfParameters = if (validParametersCount.length == 
1) {
+validParametersCount.head.toString
+  } else {
+validParametersCount.init.mkString("one of ", ", ", " and ") +
+  validParametersCount.last
+  }
+  throw new AnalysisException(s"Invalid number of arguments for 
function $name. " +
+s"Expected: $expectedNumberOfParameters; Found: ${params.length}")
 }
 Try(f.newInstance(expressions : _*).asInstanceOf[Expression]) match {
   case Success(e) => e

http://git-wip-us.apache.org/repos/asf/spark/blob/bb8e5add/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
--
diff --git 
a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
index d9dc728..581dddc 100644
--- a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
@@ -129,7 +129,7 @@ select to_json()
 struct<>
 -- !query 12 output
 org.apache.spark.sql.AnalysisException
-Invalid number of arguments for function to_json; line 1 pos 7
+Invalid number of arguments for function to_json. Expected: one of 1, 2 and 3; 
Found: 0; line 1 pos 7
 
 
 -- !query 13
@@ -225,7 +225,7 @@ select from_json()
 struct<>
 -- !query 21 output
 org.apache.spark.sql.AnalysisException
-Invalid number of arguments for function from_json; line 1 pos 7
+Invalid number of arguments for function from_json. Expected: one of 2, 3 and 
4; Found: 0; line 1 pos 7
 
 
 -- !query 22

http://git-wip-us.apache.org/repos/asf/spark/blob/bb8e5add/sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

spark git commit: [SPARK-22978][PYSPARK] Register Vectorized UDFs for SQL Statement

2018-01-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 20c69816a -> 5c06ee2d4


[SPARK-22978][PYSPARK] Register Vectorized UDFs for SQL Statement

## What changes were proposed in this pull request?
Register Vectorized UDFs for SQL Statement. For example,

```Python
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> pandas_udf("integer", PandasUDFType.SCALAR)
... def add_one(x):
... return x + 1
...
>>> _ = spark.udf.register("add_one", add_one)
>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()
[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
```

## How was this patch tested?
Added test cases

Author: gatorsmile 

Closes #20171 from gatorsmile/supportVectorizedUDF.

(cherry picked from commit b85eb946ac298e711dad25db0d04eee41d7fd236)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c06ee2d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c06ee2d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c06ee2d

Branch: refs/heads/branch-2.3
Commit: 5c06ee2d49987c297e93de87f99c701e178ba294
Parents: 20c6981
Author: gatorsmile 
Authored: Tue Jan 16 20:20:33 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 16 20:21:36 2018 +0900

--
 python/pyspark/sql/catalog.py | 75 ++---
 python/pyspark/sql/context.py | 51 +
 python/pyspark/sql/tests.py   | 76 --
 3 files changed, 155 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5c06ee2d/python/pyspark/sql/catalog.py
--
diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index 1566031..35fbe9e 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -226,18 +226,23 @@ class Catalog(object):
 
 @ignore_unicode_prefix
 @since(2.0)
-def registerFunction(self, name, f, returnType=StringType()):
+def registerFunction(self, name, f, returnType=None):
 """Registers a Python function (including lambda function) or a 
:class:`UserDefinedFunction`
-as a UDF. The registered UDF can be used in SQL statement.
+as a UDF. The registered UDF can be used in SQL statements.
 
-In addition to a name and the function itself, the return type can be 
optionally specified.
-When the return type is not given it default to a string and 
conversion will automatically
-be done.  For any other return type, the produced object must match 
the specified type.
+:func:`spark.udf.register` is an alias for 
:func:`spark.catalog.registerFunction`.
 
-:param name: name of the UDF
-:param f: a Python function, or a wrapped/native UserDefinedFunction
-:param returnType: a :class:`pyspark.sql.types.DataType` object
-:return: a wrapped :class:`UserDefinedFunction`
+In addition to a name and the function itself, `returnType` can be 
optionally specified.
+1) When f is a Python function, `returnType` defaults to a string. The 
produced object must
+match the specified type. 2) When f is a :class:`UserDefinedFunction`, 
Spark uses the return
+type of the given UDF as the return type of the registered UDF. The 
input parameter
+`returnType` is None by default. If given by users, the value must be 
None.
+
+:param name: name of the UDF in SQL statements.
+:param f: a Python function, or a wrapped/native UserDefinedFunction. 
The UDF can be either
+row-at-a-time or vectorized.
+:param returnType: the return type of the registered UDF.
+:return: a wrapped/native :class:`UserDefinedFunction`
 
 >>> strlen = spark.catalog.registerFunction("stringLengthString", len)
 >>> spark.sql("SELECT stringLengthString('test')").collect()
@@ -256,27 +261,55 @@ class Catalog(object):
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
 
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.register("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
+
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf =

spark git commit: [SPARK-22978][PYSPARK] Register Vectorized UDFs for SQL Statement

2018-01-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 66217dac4 -> b85eb946a


[SPARK-22978][PYSPARK] Register Vectorized UDFs for SQL Statement

## What changes were proposed in this pull request?
Register Vectorized UDFs for SQL Statement. For example,

```Python
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> pandas_udf("integer", PandasUDFType.SCALAR)
... def add_one(x):
... return x + 1
...
>>> _ = spark.udf.register("add_one", add_one)
>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()
[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
```

## How was this patch tested?
Added test cases

Author: gatorsmile 

Closes #20171 from gatorsmile/supportVectorizedUDF.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b85eb946
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b85eb946
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b85eb946

Branch: refs/heads/master
Commit: b85eb946ac298e711dad25db0d04eee41d7fd236
Parents: 66217da
Author: gatorsmile 
Authored: Tue Jan 16 20:20:33 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 16 20:20:33 2018 +0900

--
 python/pyspark/sql/catalog.py | 75 ++---
 python/pyspark/sql/context.py | 51 +
 python/pyspark/sql/tests.py   | 76 --
 3 files changed, 155 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b85eb946/python/pyspark/sql/catalog.py
--
diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index 1566031..35fbe9e 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -226,18 +226,23 @@ class Catalog(object):
 
 @ignore_unicode_prefix
 @since(2.0)
-def registerFunction(self, name, f, returnType=StringType()):
+def registerFunction(self, name, f, returnType=None):
 """Registers a Python function (including lambda function) or a 
:class:`UserDefinedFunction`
-as a UDF. The registered UDF can be used in SQL statement.
+as a UDF. The registered UDF can be used in SQL statements.
 
-In addition to a name and the function itself, the return type can be 
optionally specified.
-When the return type is not given it default to a string and 
conversion will automatically
-be done.  For any other return type, the produced object must match 
the specified type.
+:func:`spark.udf.register` is an alias for 
:func:`spark.catalog.registerFunction`.
 
-:param name: name of the UDF
-:param f: a Python function, or a wrapped/native UserDefinedFunction
-:param returnType: a :class:`pyspark.sql.types.DataType` object
-:return: a wrapped :class:`UserDefinedFunction`
+In addition to a name and the function itself, `returnType` can be 
optionally specified.
+1) When f is a Python function, `returnType` defaults to a string. The 
produced object must
+match the specified type. 2) When f is a :class:`UserDefinedFunction`, 
Spark uses the return
+type of the given UDF as the return type of the registered UDF. The 
input parameter
+`returnType` is None by default. If given by users, the value must be 
None.
+
+:param name: name of the UDF in SQL statements.
+:param f: a Python function, or a wrapped/native UserDefinedFunction. 
The UDF can be either
+row-at-a-time or vectorized.
+:param returnType: the return type of the registered UDF.
+:return: a wrapped/native :class:`UserDefinedFunction`
 
 >>> strlen = spark.catalog.registerFunction("stringLengthString", len)
 >>> spark.sql("SELECT stringLengthString('test')").collect()
@@ -256,27 +261,55 @@ class Catalog(object):
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
 
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.register("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
+
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf = spark.catalog.registerFunction("random_udf", 
random_udf, StringType())
+>>> new_random_udf =

spark git commit: [SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text

2018-01-14 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 7a3d0aad2 -> 66738d29c


[SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text

## What changes were proposed in this pull request?

fix doc truncated

## How was this patch tested?

manually

Author: Felix Cheung 

Closes #20263 from felixcheung/r23docfix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66738d29
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66738d29
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66738d29

Branch: refs/heads/master
Commit: 66738d29c59871b29d26fc3756772b95ef536248
Parents: 7a3d0aa
Author: Felix Cheung 
Authored: Sun Jan 14 19:43:10 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Jan 14 19:43:10 2018 +0900

--
 R/pkg/R/DataFrame.R | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/66738d29/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 9956f7e..6caa125 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -3054,10 +3054,10 @@ setMethod("describe",
 #' \item stddev
 #' \item min
 #' \item max
-#' \item arbitrary approximate percentiles specified as a percentage (eg, 
"75%")
+#' \item arbitrary approximate percentiles specified as a percentage (eg, 
"75\%")
 #' }
 #' If no statistics are given, this function computes count, mean, stddev, min,
-#' approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
+#' approximate quartiles (percentiles at 25\%, 50\%, and 75\%), and max.
 #' This function is meant for exploratory data analysis, as we make no 
guarantee about the
 #' backward compatibility of the schema of the resulting Dataset. If you want 
to
 #' programmatically compute summary statistics, use the \code{agg} function 
instead.
@@ -4019,9 +4019,9 @@ setMethod("broadcast",
 #'
 #' Spark will use this watermark for several purposes:
 #' \itemize{
-#'  \item{-} To know when a given time window aggregation can be finalized and 
thus can be emitted
+#'  \item To know when a given time window aggregation can be finalized and 
thus can be emitted
 #' when using output modes that do not allow updates.
-#'  \item{-} To minimize the amount of state that we need to keep for on-going 
aggregations.
+#'  \item To minimize the amount of state that we need to keep for on-going 
aggregations.
 #' }
 #' The current watermark is computed by looking at the \code{MAX(eventTime)} 
seen across
 #' all of the partitions in the query minus a user specified 
\code{delayThreshold}. Due to the cost


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text

2018-01-14 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 a335a49ce -> 0d425c336


[SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text

## What changes were proposed in this pull request?

fix doc truncated

## How was this patch tested?

manually

Author: Felix Cheung 

Closes #20263 from felixcheung/r23docfix.

(cherry picked from commit 66738d29c59871b29d26fc3756772b95ef536248)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0d425c33
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0d425c33
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0d425c33

Branch: refs/heads/branch-2.3
Commit: 0d425c3362dc648d5c85b2b07af1a9df23b6d422
Parents: a335a49
Author: Felix Cheung 
Authored: Sun Jan 14 19:43:10 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Jan 14 19:43:23 2018 +0900

--
 R/pkg/R/DataFrame.R | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0d425c33/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 9956f7e..6caa125 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -3054,10 +3054,10 @@ setMethod("describe",
 #' \item stddev
 #' \item min
 #' \item max
-#' \item arbitrary approximate percentiles specified as a percentage (eg, 
"75%")
+#' \item arbitrary approximate percentiles specified as a percentage (eg, 
"75\%")
 #' }
 #' If no statistics are given, this function computes count, mean, stddev, min,
-#' approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
+#' approximate quartiles (percentiles at 25\%, 50\%, and 75\%), and max.
 #' This function is meant for exploratory data analysis, as we make no 
guarantee about the
 #' backward compatibility of the schema of the resulting Dataset. If you want 
to
 #' programmatically compute summary statistics, use the \code{agg} function 
instead.
@@ -4019,9 +4019,9 @@ setMethod("broadcast",
 #'
 #' Spark will use this watermark for several purposes:
 #' \itemize{
-#'  \item{-} To know when a given time window aggregation can be finalized and 
thus can be emitted
+#'  \item To know when a given time window aggregation can be finalized and 
thus can be emitted
 #' when using output modes that do not allow updates.
-#'  \item{-} To minimize the amount of state that we need to keep for on-going 
aggregations.
+#'  \item To minimize the amount of state that we need to keep for on-going 
aggregations.
 #' }
 #' The current watermark is computed by looking at the \code{MAX(eventTime)} 
seen across
 #' all of the partitions in the query minus a user specified 
\code{delayThreshold}. Due to the cost


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action

2018-01-21 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 12faae295 -> 602c6d82d


[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action

## What changes were proposed in this pull request?

Pipe action convert objects into strings using a way that was affected by the 
default encoding setting of Python environment.

This patch fixed the problem. The detailed description is added here:

https://issues.apache.org/jira/browse/SPARK-20947

## How was this patch tested?

Run the following statement in pyspark-shell, and it will NOT raise exception 
if this patch is applied:

```python
sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
```

Author: çæå² 

Closes #18277 from chaoslawful/fix_pipe_encoding_error.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/602c6d82
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/602c6d82
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/602c6d82

Branch: refs/heads/master
Commit: 602c6d82d893a7f34b37d674642669048eb59b03
Parents: 12faae2
Author: çæå² 
Authored: Mon Jan 22 10:43:12 2018 +0900
Committer: hyukjinkwon 
Committed: Mon Jan 22 10:43:12 2018 +0900

--
 python/pyspark/rdd.py   | 2 +-
 python/pyspark/tests.py | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/602c6d82/python/pyspark/rdd.py
--
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 340bc3a..1b39155 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -766,7 +766,7 @@ class RDD(object):
 
 def pipe_objs(out):
 for obj in iterator:
-s = str(obj).rstrip('\n') + '\n'
+s = unicode(obj).rstrip('\n') + '\n'
 out.write(s.encode('utf-8'))
 out.close()
 Thread(target=pipe_objs, args=[pipe.stdin]).start()

http://git-wip-us.apache.org/repos/asf/spark/blob/602c6d82/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index da99872..5115857 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1239,6 +1239,13 @@ class RDDTests(ReusedPySparkTestCase):
 self.assertRaises(Py4JJavaError, rdd.pipe('grep 4', 
checkCode=True).collect)
 self.assertEqual([], rdd.pipe('grep 4').collect())
 
+def test_pipe_unicode(self):
+# Regression test for SPARK-20947
+data = [u'\u6d4b\u8bd5', '1']
+rdd = self.sc.parallelize(data)
+result = rdd.pipe('cat').collect()
+self.assertEqual(data, result)
+
 
 class ProfilerTests(PySparkTestCase):
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23169][INFRA][R] Run lintr on the changes of lint-r script and .lintr configuration

2018-01-21 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 2239d7a41 -> 12faae295


[SPARK-23169][INFRA][R] Run lintr on the changes of lint-r script and .lintr 
configuration

## What changes were proposed in this pull request?

When running the `run-tests` script, seems we don't run lintr on the changes of 
`lint-r` script and `.lintr` configuration.

## How was this patch tested?

Jenkins builds

Author: hyukjinkwon 

Closes #20339 from HyukjinKwon/check-r-changed.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12faae29
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12faae29
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12faae29

Branch: refs/heads/master
Commit: 12faae295e42820b99a695ba49826051944244e1
Parents: 2239d7a
Author: hyukjinkwon 
Authored: Mon Jan 22 09:45:27 2018 +0900
Committer: hyukjinkwon 
Committed: Mon Jan 22 09:45:27 2018 +0900

--
 dev/run-tests.py | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/12faae29/dev/run-tests.py
--
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 7e6f7ff..fb270c4 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -578,7 +578,10 @@ def main():
 pass
 if not changed_files or any(f.endswith(".py") for f in changed_files):
 run_python_style_checks()
-if not changed_files or any(f.endswith(".R") for f in changed_files):
+if not changed_files or any(f.endswith(".R")
+or f.endswith("lint-r")
+or f.endswith(".lintr")
+for f in changed_files):
 run_sparkr_style_checks()
 
 # determine if docs were changed and if we're inside the amplab environment


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20749][SQL][FOLLOW-UP] Override prettyName for bit_length and octet_length

2018-01-23 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 96cb60bc3 -> ee572ba8c


[SPARK-20749][SQL][FOLLOW-UP] Override prettyName for bit_length and 
octet_length

## What changes were proposed in this pull request?
We need to override the prettyName for bit_length and octet_length for getting 
the expected auto-generated alias name.

## How was this patch tested?
The existing tests

Author: gatorsmile 

Closes #20358 from gatorsmile/test2.3More.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ee572ba8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ee572ba8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ee572ba8

Branch: refs/heads/master
Commit: ee572ba8c1339d21c592001ec4f7f270005ff1cf
Parents: 96cb60b
Author: gatorsmile 
Authored: Tue Jan 23 21:36:20 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 23 21:36:20 2018 +0900

--
 .../apache/spark/sql/catalyst/parser/SqlBase.g4 |  2 +-
 .../expressions/stringExpressions.scala |  4 ++
 .../sql-tests/results/operators.sql.out |  4 +-
 .../scalar-subquery-predicate.sql.out   | 45 ++--
 4 files changed, 30 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ee572ba8/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
--
diff --git 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
index 39d5e4e..5fa75fe 100644
--- 
a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
+++ 
b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
@@ -141,7 +141,7 @@ statement
 (LIKE? pattern=STRING)?
#showTables
 | SHOW TABLE EXTENDED ((FROM | IN) db=identifier)?
 LIKE pattern=STRING partitionSpec? 
#showTable
-| SHOW DATABASES (LIKE? pattern=STRING)?
#showDatabases
+| SHOW DATABASES (LIKE? pattern=STRING)?   
#showDatabases
 | SHOW TBLPROPERTIES table=tableIdentifier
 ('(' key=tablePropertyKey ')')?
#showTblProperties
 | SHOW COLUMNS (FROM | IN) tableIdentifier

http://git-wip-us.apache.org/repos/asf/spark/blob/ee572ba8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index e004bfc..5cf783f 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -1708,6 +1708,8 @@ case class BitLength(child: Expression) extends 
UnaryExpression with ImplicitCas
   case BinaryType => defineCodeGen(ctx, ev, c => s"($c).length * 8")
 }
   }
+
+  override def prettyName: String = "bit_length"
 }
 
 /**
@@ -1735,6 +1737,8 @@ case class OctetLength(child: Expression) extends 
UnaryExpression with ImplicitC
   case BinaryType => defineCodeGen(ctx, ev, c => s"($c).length")
 }
   }
+
+  override def prettyName: String = "octet_length"
 }
 
 /**

http://git-wip-us.apache.org/repos/asf/spark/blob/ee572ba8/sql/core/src/test/resources/sql-tests/results/operators.sql.out
--
diff --git a/sql/core/src/test/resources/sql-tests/results/operators.sql.out 
b/sql/core/src/test/resources/sql-tests/results/operators.sql.out
index 237b618..840655b 100644
--- a/sql/core/src/test/resources/sql-tests/results/operators.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/operators.sql.out
@@ -425,7 +425,7 @@ struct<(7 % 2):int,(7 % 0):int,(0 % 2):int,(7 % CAST(NULL 
AS INT)):int,(CAST(NUL
 -- !query 51
 select BIT_LENGTH('abc')
 -- !query 51 schema
-struct
+struct
 -- !query 51 output
 24
 
@@ -449,7 +449,7 @@ struct
 -- !query 54
 select OCTET_LENGTH('abc')
 -- !query 54 schema
-struct
+struct
 -- !query 54 output
 3
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ee572ba8/sql/core/src/test/resources/sql-tests/results/subquery/scalar-subquery/scalar-subquery-predicate.sql.out
--
diff --git

spark git commit: [SPARK-23047][PYTHON][SQL] Change MapVector to NullableMapVector in ArrowColumnVector

2018-01-17 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 79ccd0cad -> 6e509fde3


[SPARK-23047][PYTHON][SQL] Change MapVector to NullableMapVector in 
ArrowColumnVector

## What changes were proposed in this pull request?
This PR changes usage of `MapVector` in Spark codebase to use 
`NullableMapVector`.

`MapVector` is an internal Arrow class that is not supposed to be used 
directly. We should use `NullableMapVector` instead.

## How was this patch tested?

Existing test.

Author: Li Jin 

Closes #20239 from icexelloss/arrow-map-vector.

(cherry picked from commit 4e6f8fb150ae09c7d1de6beecb2b98e5afa5da19)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6e509fde
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6e509fde
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6e509fde

Branch: refs/heads/branch-2.3
Commit: 6e509fde3f056316f46c71b672a7d69adb1b4f8e
Parents: 79ccd0c
Author: Li Jin 
Authored: Thu Jan 18 07:26:43 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 18 07:26:57 2018 +0900

--
 .../spark/sql/vectorized/ArrowColumnVector.java | 13 +--
 .../vectorized/ArrowColumnVectorSuite.scala | 36 
 2 files changed, 46 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6e509fde/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
b/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
index 7083332..eb69001 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
@@ -247,8 +247,8 @@ public final class ArrowColumnVector extends ColumnVector {
 
   childColumns = new ArrowColumnVector[1];
   childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
-} else if (vector instanceof MapVector) {
-  MapVector mapVector = (MapVector) vector;
+} else if (vector instanceof NullableMapVector) {
+  NullableMapVector mapVector = (NullableMapVector) vector;
   accessor = new StructAccessor(mapVector);
 
   childColumns = new ArrowColumnVector[mapVector.size()];
@@ -553,9 +553,16 @@ public final class ArrowColumnVector extends ColumnVector {
 }
   }
 
+  /**
+   * Any call to "get" method will throw UnsupportedOperationException.
+   *
+   * Access struct values in a ArrowColumnVector doesn't use this accessor. 
Instead, it uses getStruct() method defined
+   * in the parent class. Any call to "get" method in this class is a bug in 
the code.
+   *
+   */
   private static class StructAccessor extends ArrowVectorAccessor {
 
-StructAccessor(MapVector vector) {
+StructAccessor(NullableMapVector vector) {
   super(vector);
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/6e509fde/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
index 7304803..5343266 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
@@ -322,6 +322,42 @@ class ArrowColumnVectorSuite extends SparkFunSuite {
 allocator.close()
   }
 
+  test("non nullable struct") {
+val allocator = ArrowUtils.rootAllocator.newChildAllocator("struct", 0, 
Long.MaxValue)
+val schema = new StructType().add("int", IntegerType).add("long", LongType)
+val vector = ArrowUtils.toArrowField("struct", schema, nullable = false, 
null)
+  .createVector(allocator).asInstanceOf[NullableMapVector]
+
+vector.allocateNew()
+val intVector = vector.getChildByOrdinal(0).asInstanceOf[IntVector]
+val longVector = vector.getChildByOrdinal(1).asInstanceOf[BigIntVector]
+
+vector.setIndexDefined(0)
+intVector.setSafe(0, 1)
+longVector.setSafe(0, 1L)
+
+vector.setIndexDefined(1)
+intVector.setSafe(1, 2)
+longVector.setNull(1)
+
+vector.setValueCount(2)
+
+val columnVector = new ArrowColumnVector(vector)
+assert(columnVector.dataType === schema)
+assert(columnVector.numNulls === 0)
+
+val row0 = columnVector.getStruct(0, 2)
+assert(row0.getInt(0)

spark git commit: [SPARK-23047][PYTHON][SQL] Change MapVector to NullableMapVector in ArrowColumnVector

2018-01-17 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e946c63dd -> 4e6f8fb15


[SPARK-23047][PYTHON][SQL] Change MapVector to NullableMapVector in 
ArrowColumnVector

## What changes were proposed in this pull request?
This PR changes usage of `MapVector` in Spark codebase to use 
`NullableMapVector`.

`MapVector` is an internal Arrow class that is not supposed to be used 
directly. We should use `NullableMapVector` instead.

## How was this patch tested?

Existing test.

Author: Li Jin 

Closes #20239 from icexelloss/arrow-map-vector.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e6f8fb1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e6f8fb1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e6f8fb1

Branch: refs/heads/master
Commit: 4e6f8fb150ae09c7d1de6beecb2b98e5afa5da19
Parents: e946c63
Author: Li Jin 
Authored: Thu Jan 18 07:26:43 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 18 07:26:43 2018 +0900

--
 .../spark/sql/vectorized/ArrowColumnVector.java | 13 +--
 .../vectorized/ArrowColumnVectorSuite.scala | 36 
 2 files changed, 46 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4e6f8fb1/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
b/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
index 7083332..eb69001 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
@@ -247,8 +247,8 @@ public final class ArrowColumnVector extends ColumnVector {
 
   childColumns = new ArrowColumnVector[1];
   childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
-} else if (vector instanceof MapVector) {
-  MapVector mapVector = (MapVector) vector;
+} else if (vector instanceof NullableMapVector) {
+  NullableMapVector mapVector = (NullableMapVector) vector;
   accessor = new StructAccessor(mapVector);
 
   childColumns = new ArrowColumnVector[mapVector.size()];
@@ -553,9 +553,16 @@ public final class ArrowColumnVector extends ColumnVector {
 }
   }
 
+  /**
+   * Any call to "get" method will throw UnsupportedOperationException.
+   *
+   * Access struct values in a ArrowColumnVector doesn't use this accessor. 
Instead, it uses getStruct() method defined
+   * in the parent class. Any call to "get" method in this class is a bug in 
the code.
+   *
+   */
   private static class StructAccessor extends ArrowVectorAccessor {
 
-StructAccessor(MapVector vector) {
+StructAccessor(NullableMapVector vector) {
   super(vector);
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/4e6f8fb1/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
index 7304803..5343266 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala
@@ -322,6 +322,42 @@ class ArrowColumnVectorSuite extends SparkFunSuite {
 allocator.close()
   }
 
+  test("non nullable struct") {
+val allocator = ArrowUtils.rootAllocator.newChildAllocator("struct", 0, 
Long.MaxValue)
+val schema = new StructType().add("int", IntegerType).add("long", LongType)
+val vector = ArrowUtils.toArrowField("struct", schema, nullable = false, 
null)
+  .createVector(allocator).asInstanceOf[NullableMapVector]
+
+vector.allocateNew()
+val intVector = vector.getChildByOrdinal(0).asInstanceOf[IntVector]
+val longVector = vector.getChildByOrdinal(1).asInstanceOf[BigIntVector]
+
+vector.setIndexDefined(0)
+intVector.setSafe(0, 1)
+longVector.setSafe(0, 1L)
+
+vector.setIndexDefined(1)
+intVector.setSafe(1, 2)
+longVector.setNull(1)
+
+vector.setValueCount(2)
+
+val columnVector = new ArrowColumnVector(vector)
+assert(columnVector.dataType === schema)
+assert(columnVector.numNulls === 0)
+
+val row0 = columnVector.getStruct(0, 2)
+assert(row0.getInt(0) === 1)
+assert(row0.getLong(1) === 1L)
+
+val row1 = columnVector.getStruct(1, 2)
+assert(row1.getInt(0) === 2)
+

spark git commit: [SPARK-23132][PYTHON][ML] Run doctests in ml.image when testing

2018-01-17 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 4e6f8fb15 -> 45ad97df8


[SPARK-23132][PYTHON][ML] Run doctests in ml.image when testing

## What changes were proposed in this pull request?

This PR proposes to actually run the doctests in `ml/image.py`.

## How was this patch tested?

doctests in `python/pyspark/ml/image.py`.

Author: hyukjinkwon 

Closes #20294 from HyukjinKwon/trigger-image.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45ad97df
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45ad97df
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45ad97df

Branch: refs/heads/master
Commit: 45ad97df87c89cb94ce9564e5773897b6d9326f5
Parents: 4e6f8fb
Author: hyukjinkwon 
Authored: Thu Jan 18 07:30:54 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 18 07:30:54 2018 +0900

--
 python/pyspark/ml/image.py | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45ad97df/python/pyspark/ml/image.py
--
diff --git a/python/pyspark/ml/image.py b/python/pyspark/ml/image.py
index c9b8402..2d86c7f 100644
--- a/python/pyspark/ml/image.py
+++ b/python/pyspark/ml/image.py
@@ -194,9 +194,9 @@ class _ImageSchema(object):
 :return: a :class:`DataFrame` with a single column of "images",
see ImageSchema for details.
 
->>> df = ImageSchema.readImages('python/test_support/image/kittens', 
recursive=True)
+>>> df = ImageSchema.readImages('data/mllib/images/kittens', 
recursive=True)
 >>> df.count()
-4
+5
 
 .. versionadded:: 2.3.0
 """
@@ -216,3 +216,25 @@ ImageSchema = _ImageSchema()
 def _disallow_instance(_):
 raise RuntimeError("Creating instance of _ImageSchema class is 
disallowed.")
 _ImageSchema.__init__ = _disallow_instance
+
+
+def _test():
+import doctest
+import pyspark.ml.image
+globs = pyspark.ml.image.__dict__.copy()
+spark = SparkSession.builder\
+.master("local[2]")\
+.appName("ml.image tests")\
+.getOrCreate()
+globs['spark'] = spark
+
+(failure_count, test_count) = doctest.testmod(
+pyspark.ml.image, globs=globs,
+optionflags=doctest.ELLIPSIS | doctest.NORMALIZE_WHITESPACE)
+spark.stop()
+if failure_count:
+exit(-1)
+
+
+if __name__ == "__main__":
+_test()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23132][PYTHON][ML] Run doctests in ml.image when testing

2018-01-17 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 6e509fde3 -> b84c2a306


[SPARK-23132][PYTHON][ML] Run doctests in ml.image when testing

## What changes were proposed in this pull request?

This PR proposes to actually run the doctests in `ml/image.py`.

## How was this patch tested?

doctests in `python/pyspark/ml/image.py`.

Author: hyukjinkwon 

Closes #20294 from HyukjinKwon/trigger-image.

(cherry picked from commit 45ad97df87c89cb94ce9564e5773897b6d9326f5)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b84c2a30
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b84c2a30
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b84c2a30

Branch: refs/heads/branch-2.3
Commit: b84c2a30665ebbd65feb7418826501f6c959eb96
Parents: 6e509fd
Author: hyukjinkwon 
Authored: Thu Jan 18 07:30:54 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Jan 18 07:31:10 2018 +0900

--
 python/pyspark/ml/image.py | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b84c2a30/python/pyspark/ml/image.py
--
diff --git a/python/pyspark/ml/image.py b/python/pyspark/ml/image.py
index c9b8402..2d86c7f 100644
--- a/python/pyspark/ml/image.py
+++ b/python/pyspark/ml/image.py
@@ -194,9 +194,9 @@ class _ImageSchema(object):
 :return: a :class:`DataFrame` with a single column of "images",
see ImageSchema for details.
 
->>> df = ImageSchema.readImages('python/test_support/image/kittens', 
recursive=True)
+>>> df = ImageSchema.readImages('data/mllib/images/kittens', 
recursive=True)
 >>> df.count()
-4
+5
 
 .. versionadded:: 2.3.0
 """
@@ -216,3 +216,25 @@ ImageSchema = _ImageSchema()
 def _disallow_instance(_):
 raise RuntimeError("Creating instance of _ImageSchema class is 
disallowed.")
 _ImageSchema.__init__ = _disallow_instance
+
+
+def _test():
+import doctest
+import pyspark.ml.image
+globs = pyspark.ml.image.__dict__.copy()
+spark = SparkSession.builder\
+.master("local[2]")\
+.appName("ml.image tests")\
+.getOrCreate()
+globs['spark'] = spark
+
+(failure_count, test_count) = doctest.testmod(
+pyspark.ml.image, globs=globs,
+optionflags=doctest.ELLIPSIS | doctest.NORMALIZE_WHITESPACE)
+spark.stop()
+if failure_count:
+exit(-1)
+
+
+if __name__ == "__main__":
+_test()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23081][PYTHON] Add colRegex API to PySpark

2018-01-25 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 8532e26f3 -> 8480c0c57


[SPARK-23081][PYTHON] Add colRegex API to PySpark

## What changes were proposed in this pull request?

Add colRegex API to PySpark

## How was this patch tested?

add a test in sql/tests.py

Author: Huaxin Gao 

Closes #20390 from huaxingao/spark-23081.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8480c0c5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8480c0c5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8480c0c5

Branch: refs/heads/master
Commit: 8480c0c57698b7dcccec5483d67b17cf2c7527ed
Parents: 8532e26
Author: Huaxin Gao 
Authored: Fri Jan 26 07:50:48 2018 +0900
Committer: hyukjinkwon 
Committed: Fri Jan 26 07:50:48 2018 +0900

--
 python/pyspark/sql/dataframe.py | 23 
 .../scala/org/apache/spark/sql/Dataset.scala|  8 +++
 2 files changed, 27 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8480c0c5/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 2d5e9b9..ac40308 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -819,6 +819,29 @@ class DataFrame(object):
 """
 return [f.name for f in self.schema.fields]
 
+@since(2.3)
+def colRegex(self, colName):
+"""
+Selects column based on the column name specified as a regex and 
returns it
+as :class:`Column`.
+
+:param colName: string, column name specified as a regex.
+
+>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], 
["Col1", "Col2"])
+>>> df.select(df.colRegex("`(Col1)?+.+`")).show()
+++
+|Col2|
+++
+|   1|
+|   2|
+|   3|
+++
+"""
+if not isinstance(colName, basestring):
+raise ValueError("colName should be provided as string")
+jc = self._jdf.colRegex(colName)
+return Column(jc)
+
 @ignore_unicode_prefix
 @since(1.3)
 def alias(self, alias):

http://git-wip-us.apache.org/repos/asf/spark/blob/8480c0c5/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 912f411..edb6644 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1194,7 +1194,7 @@ class Dataset[T] private[sql](
   def orderBy(sortExprs: Column*): Dataset[T] = sort(sortExprs : _*)
 
   /**
-   * Selects column based on the column name and return it as a [[Column]].
+   * Selects column based on the column name and returns it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.
*
@@ -1220,7 +1220,7 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Selects column based on the column name and return it as a [[Column]].
+   * Selects column based on the column name and returns it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.
*
@@ -1240,7 +1240,7 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Selects column based on the column name specified as a regex and return 
it as [[Column]].
+   * Selects column based on the column name specified as a regex and returns 
it as [[Column]].
* @group untypedrel
* @since 2.3.0
*/
@@ -2729,7 +2729,7 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Return an iterator that contains all rows in this Dataset.
+   * Returns an iterator that contains all rows in this Dataset.
*
* The iterator will consume as much memory as the largest partition in this 
Dataset.
*


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23081][PYTHON] Add colRegex API to PySpark

2018-01-25 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 8866f9c24 -> 2f65c20ea


[SPARK-23081][PYTHON] Add colRegex API to PySpark

## What changes were proposed in this pull request?

Add colRegex API to PySpark

## How was this patch tested?

add a test in sql/tests.py

Author: Huaxin Gao 

Closes #20390 from huaxingao/spark-23081.

(cherry picked from commit 8480c0c57698b7dcccec5483d67b17cf2c7527ed)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f65c20e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f65c20e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f65c20e

Branch: refs/heads/branch-2.3
Commit: 2f65c20ea74a87729eaf3c9b2aebcfb10c0ecf4b
Parents: 8866f9c
Author: Huaxin Gao 
Authored: Fri Jan 26 07:50:48 2018 +0900
Committer: hyukjinkwon 
Committed: Fri Jan 26 07:51:01 2018 +0900

--
 python/pyspark/sql/dataframe.py | 23 
 .../scala/org/apache/spark/sql/Dataset.scala|  8 +++
 2 files changed, 27 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2f65c20e/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 2d5e9b9..ac40308 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -819,6 +819,29 @@ class DataFrame(object):
 """
 return [f.name for f in self.schema.fields]
 
+@since(2.3)
+def colRegex(self, colName):
+"""
+Selects column based on the column name specified as a regex and 
returns it
+as :class:`Column`.
+
+:param colName: string, column name specified as a regex.
+
+>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], 
["Col1", "Col2"])
+>>> df.select(df.colRegex("`(Col1)?+.+`")).show()
+++
+|Col2|
+++
+|   1|
+|   2|
+|   3|
+++
+"""
+if not isinstance(colName, basestring):
+raise ValueError("colName should be provided as string")
+jc = self._jdf.colRegex(colName)
+return Column(jc)
+
 @ignore_unicode_prefix
 @since(1.3)
 def alias(self, alias):

http://git-wip-us.apache.org/repos/asf/spark/blob/2f65c20e/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 912f411..edb6644 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1194,7 +1194,7 @@ class Dataset[T] private[sql](
   def orderBy(sortExprs: Column*): Dataset[T] = sort(sortExprs : _*)
 
   /**
-   * Selects column based on the column name and return it as a [[Column]].
+   * Selects column based on the column name and returns it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.
*
@@ -1220,7 +1220,7 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Selects column based on the column name and return it as a [[Column]].
+   * Selects column based on the column name and returns it as a [[Column]].
*
* @note The column name can also reference to a nested column like `a.b`.
*
@@ -1240,7 +1240,7 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Selects column based on the column name specified as a regex and return 
it as [[Column]].
+   * Selects column based on the column name specified as a regex and returns 
it as [[Column]].
* @group untypedrel
* @since 2.3.0
*/
@@ -2729,7 +2729,7 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Return an iterator that contains all rows in this Dataset.
+   * Returns an iterator that contains all rows in this Dataset.
*
* The iterator will consume as much memory as the largest partition in this 
Dataset.
*


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas

2018-01-09 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 7bcc26668 -> e59983724


[SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas

## What changes were proposed in this pull request?

This the case when calling `SparkSession.createDataFrame` using a Pandas 
DataFrame that has non-str column labels.

The column name conversion logic to handle non-string or unicode in python2 is:
```
if column is not any type of string:
name = str(column)
else if column is unicode in Python 2:
name = column.encode('utf-8')
```

## How was this patch tested?

Added a new test with a Pandas DataFrame that has int column labels

Author: Bryan Cutler 

Closes #20210 from BryanCutler/python-createDataFrame-int-col-error-SPARK-23009.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5998372
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5998372
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5998372

Branch: refs/heads/master
Commit: e5998372487af20114e160264a594957344ff433
Parents: 7bcc266
Author: Bryan Cutler 
Authored: Wed Jan 10 14:55:24 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Jan 10 14:55:24 2018 +0900

--
 python/pyspark/sql/session.py | 4 +++-
 python/pyspark/sql/tests.py   | 9 +
 2 files changed, 12 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e5998372/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 3e45747..604021c 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -648,7 +648,9 @@ class SparkSession(object):
 
 # If no schema supplied by user then get the names of columns only
 if schema is None:
-schema = [x.encode('utf-8') if not isinstance(x, str) else x 
for x in data.columns]
+schema = [str(x) if not isinstance(x, basestring) else
+  (x.encode('utf-8') if not isinstance(x, str) else x)
+  for x in data.columns]
 
 if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
 and len(data) > 0:

http://git-wip-us.apache.org/repos/asf/spark/blob/e5998372/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 13576ff..80a94a9 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3532,6 +3532,15 @@ class ArrowTests(ReusedSQLTestCase):
 self.assertTrue(expected[r][e] == result_arrow[r][e] and
 result[r][e] == result_arrow[r][e])
 
+def test_createDataFrame_with_int_col_names(self):
+import numpy as np
+import pandas as pd
+pdf = pd.DataFrame(np.random.rand(4, 2))
+df, df_arrow = self._createDataFrame_toggle(pdf)
+pdf_col_names = [str(c) for c in pdf.columns]
+self.assertEqual(pdf_col_names, df.columns)
+self.assertEqual(pdf_col_names, df_arrow.columns)
+
 
 @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
 class PandasUDFTests(ReusedSQLTestCase):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF

2018-01-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 55dbfbca3 -> cd9f49a2a


[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch 
within scalar Pandas UDF

## What changes were proposed in this pull request?

This PR proposes to add a note that saying the length of a scalar Pandas UDF's 
`Series` is not of the whole input column but of the batch.

We are fine for a group map UDF because the usage is different from our typical 
UDF but scalar UDFs might cause confusion with the normal UDF.

For example, please consider this example:

```python
from pyspark.sql.functions import pandas_udf, col, lit

df = spark.range(1)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
```

```
+--+
|(text, id)|
+--+
| 1|
+--+
```

```python
from pyspark.sql.functions import udf, col, lit

df = spark.range(1)
f = udf(lambda x, y: len(x) + y, "long")
df.select(f(lit('text'), col('id'))).show()
```

```
+--+
|(text, id)|
+--+
| 4|
+--+
```

## How was this patch tested?

Manually built the doc and checked the output.

Author: hyukjinkwon 

Closes #20237 from HyukjinKwon/SPARK-22980.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cd9f49a2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cd9f49a2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cd9f49a2

Branch: refs/heads/master
Commit: cd9f49a2aed3799964976ead06080a0f7044a0c3
Parents: 55dbfbc
Author: hyukjinkwon 
Authored: Sat Jan 13 16:13:44 2018 +0900
Committer: hyukjinkwon 
Committed: Sat Jan 13 16:13:44 2018 +0900

--
 python/pyspark/sql/functions.py | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cd9f49a2/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 733e32b..e1ad659 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
| 8|  JOHN DOE|  22|
+--+--++
 
+   .. note:: The length of `pandas.Series` within a scalar UDF is not that 
of the whole input
+   column, but is the length of an internal batch used for each call 
to the function.
+   Therefore, this can be used, for example, to ensure the length of 
each returned
+   `pandas.Series`, and can not be used as the column length.
+
 2. GROUP_MAP
 
A group map UDF defines transformation: A `pandas.DataFrame` -> A 
`pandas.DataFrame`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF

2018-01-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 60bcb4685 -> ca27d9cb5


[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch 
within scalar Pandas UDF

## What changes were proposed in this pull request?

This PR proposes to add a note that saying the length of a scalar Pandas UDF's 
`Series` is not of the whole input column but of the batch.

We are fine for a group map UDF because the usage is different from our typical 
UDF but scalar UDFs might cause confusion with the normal UDF.

For example, please consider this example:

```python
from pyspark.sql.functions import pandas_udf, col, lit

df = spark.range(1)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
```

```
+--+
|(text, id)|
+--+
| 1|
+--+
```

```python
from pyspark.sql.functions import udf, col, lit

df = spark.range(1)
f = udf(lambda x, y: len(x) + y, "long")
df.select(f(lit('text'), col('id'))).show()
```

```
+--+
|(text, id)|
+--+
| 4|
+--+
```

## How was this patch tested?

Manually built the doc and checked the output.

Author: hyukjinkwon 

Closes #20237 from HyukjinKwon/SPARK-22980.

(cherry picked from commit cd9f49a2aed3799964976ead06080a0f7044a0c3)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca27d9cb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca27d9cb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca27d9cb

Branch: refs/heads/branch-2.3
Commit: ca27d9cb5e30b6a50a4c8b7d10ac28f4f51d44ee
Parents: 60bcb46
Author: hyukjinkwon 
Authored: Sat Jan 13 16:13:44 2018 +0900
Committer: hyukjinkwon 
Committed: Sat Jan 13 16:13:57 2018 +0900

--
 python/pyspark/sql/functions.py | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ca27d9cb/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 733e32b..e1ad659 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
| 8|  JOHN DOE|  22|
+--+--++
 
+   .. note:: The length of `pandas.Series` within a scalar UDF is not that 
of the whole input
+   column, but is the length of an internal batch used for each call 
to the function.
+   Therefore, this can be used, for example, to ensure the length of 
each returned
+   `pandas.Series`, and can not be used as the column length.
+
 2. GROUP_MAP
 
A group map UDF defines transformation: A `pandas.DataFrame` -> A 
`pandas.DataFrame`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23261][PYSPARK] Rename Pandas UDFs

2018-01-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 0a9ac0248 -> 7a2ada223


[SPARK-23261][PYSPARK] Rename Pandas UDFs

## What changes were proposed in this pull request?
Rename the public APIs and names of pandas udfs.

- `PANDAS SCALAR UDF` -> `SCALAR PANDAS UDF`
- `PANDAS GROUP MAP UDF` -> `GROUPED MAP PANDAS UDF`
- `PANDAS GROUP AGG UDF` -> `GROUPED AGG PANDAS UDF`

## How was this patch tested?
The existing tests

Author: gatorsmile 

Closes #20428 from gatorsmile/renamePandasUDFs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7a2ada22
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7a2ada22
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7a2ada22

Branch: refs/heads/master
Commit: 7a2ada223e14d09271a76091be0338b2d375081e
Parents: 0a9ac02
Author: gatorsmile 
Authored: Tue Jan 30 21:55:55 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 30 21:55:55 2018 +0900

--
 .../apache/spark/api/python/PythonRunner.scala  | 12 +--
 docs/sql-programming-guide.md   |  8 +-
 examples/src/main/python/sql/arrow.py   | 12 +--
 python/pyspark/rdd.py   |  6 +-
 python/pyspark/sql/functions.py | 34 
 python/pyspark/sql/group.py | 10 +--
 python/pyspark/sql/tests.py | 92 ++--
 python/pyspark/sql/udf.py   | 25 +++---
 python/pyspark/worker.py| 24 ++---
 .../sql/catalyst/expressions/PythonUDF.scala|  4 +-
 .../spark/sql/catalyst/planning/patterns.scala  |  1 -
 .../spark/sql/RelationalGroupedDataset.scala|  4 +-
 .../python/AggregateInPandasExec.scala  |  2 +-
 .../execution/python/ArrowEvalPythonExec.scala  |  2 +-
 .../execution/python/ExtractPythonUDFs.scala|  2 +-
 .../python/FlatMapGroupsInPandasExec.scala  |  2 +-
 16 files changed, 120 insertions(+), 120 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7a2ada22/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
index 29148a7..f075a7e 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
@@ -37,16 +37,16 @@ private[spark] object PythonEvalType {
 
   val SQL_BATCHED_UDF = 100
 
-  val SQL_PANDAS_SCALAR_UDF = 200
-  val SQL_PANDAS_GROUP_MAP_UDF = 201
-  val SQL_PANDAS_GROUP_AGG_UDF = 202
+  val SQL_SCALAR_PANDAS_UDF = 200
+  val SQL_GROUPED_MAP_PANDAS_UDF = 201
+  val SQL_GROUPED_AGG_PANDAS_UDF = 202
 
   def toString(pythonEvalType: Int): String = pythonEvalType match {
 case NON_UDF => "NON_UDF"
 case SQL_BATCHED_UDF => "SQL_BATCHED_UDF"
-case SQL_PANDAS_SCALAR_UDF => "SQL_PANDAS_SCALAR_UDF"
-case SQL_PANDAS_GROUP_MAP_UDF => "SQL_PANDAS_GROUP_MAP_UDF"
-case SQL_PANDAS_GROUP_AGG_UDF => "SQL_PANDAS_GROUP_AGG_UDF"
+case SQL_SCALAR_PANDAS_UDF => "SQL_SCALAR_PANDAS_UDF"
+case SQL_GROUPED_MAP_PANDAS_UDF => "SQL_GROUPED_MAP_PANDAS_UDF"
+case SQL_GROUPED_AGG_PANDAS_UDF => "SQL_GROUPED_AGG_PANDAS_UDF"
   }
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/7a2ada22/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d49c8d8..a0e221b 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1684,7 +1684,7 @@ Spark will fall back to create the DataFrame without 
Arrow.
 Pandas UDFs are user defined functions that are executed by Spark using Arrow 
to transfer data and
 Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
 or to wrap the function, no additional configuration is required. Currently, 
there are two types of
-Pandas UDF: Scalar and Group Map.
+Pandas UDF: Scalar and Grouped Map.
 
 ### Scalar
 
@@ -1702,8 +1702,8 @@ The following example shows how to create a scalar Pandas 
UDF that computes the
 
 
 
-### Group Map
-Group map Pandas UDFs are used with `groupBy().apply()` which implements the 
"split-apply-combine" pattern.
+### Grouped Map
+Grouped map Pandas UDFs are used with `groupBy().apply()` which implements the 
"split-apply-combine" pattern.
 Split-apply-combine consists of three steps:
 * Split the data into groups by using `DataFrame.groupBy`.
 * Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
@@ -1723,7 +1723,7 @@ The following

spark git commit: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file.

2018-01-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 84bcf9dc8 -> a23187f53


[SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file.

## What changes were proposed in this pull request?

This is a follow-up pr of #20338 which changed the downloaded file name of the 
python code style checker but it's not contained in .gitignore file so the file 
remains as an untracked file for git after running the checker.
This pr adds the file name to .gitignore file.

## How was this patch tested?

Tested manually.

Author: Takuya UESHIN 

Closes #20432 from ueshin/issues/SPARK-23174/fup1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a23187f5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a23187f5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a23187f5

Branch: refs/heads/master
Commit: a23187f53037425c61f1180b5e7990a116f86a42
Parents: 84bcf9d
Author: Takuya UESHIN 
Authored: Wed Jan 31 00:51:00 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Jan 31 00:51:00 2018 +0900

--
 dev/.gitignore | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a23187f5/dev/.gitignore
--
diff --git a/dev/.gitignore b/dev/.gitignore
index 4a60274..c673922 100644
--- a/dev/.gitignore
+++ b/dev/.gitignore
@@ -1 +1,2 @@
 pep8*.py
+pycodestyle*.py


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] Fix typos in dev/* scripts.

2018-01-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 58fcb5a95 -> 9623a9824


[MINOR] Fix typos in dev/* scripts.

## What changes were proposed in this pull request?

Consistency in style, grammar and removal of extraneous characters.

## How was this patch tested?

Manually as this is a doc change.

Author: Shashwat Anand 

Closes #20436 from ashashwat/SPARK-23174.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9623a982
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9623a982
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9623a982

Branch: refs/heads/master
Commit: 9623a98248837da302ba4ec240335d1c4268ee21
Parents: 58fcb5a
Author: Shashwat Anand 
Authored: Wed Jan 31 07:37:25 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Jan 31 07:37:25 2018 +0900

--
 dev/appveyor-guide.md|  6 +++---
 dev/lint-python  | 12 ++--
 dev/run-pip-tests|  4 ++--
 dev/run-tests-jenkins|  2 +-
 dev/sparktestsupport/modules.py  |  8 
 dev/sparktestsupport/toposort.py |  6 +++---
 dev/tests/pr_merge_ability.sh|  4 ++--
 dev/tests/pr_public_classes.sh   |  4 ++--
 8 files changed, 23 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9623a982/dev/appveyor-guide.md
--
diff --git a/dev/appveyor-guide.md b/dev/appveyor-guide.md
index d2e00b4..a842f39 100644
--- a/dev/appveyor-guide.md
+++ b/dev/appveyor-guide.md
@@ -1,6 +1,6 @@
 # AppVeyor Guides
 
-Currently, SparkR on Windows is being tested with 
[AppVeyor](https://ci.appveyor.com). This page describes how to set up AppVeyor 
with Spark, how to run the build, check the status and stop the build via this 
tool. There is the documenation for AppVeyor 
[here](https://www.appveyor.com/docs). Please refer this for full details.
+Currently, SparkR on Windows is being tested with 
[AppVeyor](https://ci.appveyor.com). This page describes how to set up AppVeyor 
with Spark, how to run the build, check the status and stop the build via this 
tool. There is the documentation for AppVeyor 
[here](https://www.appveyor.com/docs). Please refer this for full details.
 
 
 ### Setting up AppVeyor
@@ -45,7 +45,7 @@ Currently, SparkR on Windows is being tested with 
[AppVeyor](https://ci.appveyor
   
   https://cloud.githubusercontent.com/assets/6477701/18075026/3ee57bc6-6eac-11e6-826e-5dd09aeb0e7c.png;>
 
-- Since we will use Github here, click the "GITHUB" button and then click 
"Authorize Github" so that AppVeyor can access to the Github logs (e.g. 
commits).
+- Since we will use Github here, click the "GITHUB" button and then click 
"Authorize Github" so that AppVeyor can access the Github logs (e.g. commits).
 
   https://cloud.githubusercontent.com/assets/6477701/18228819/9a4d5722-7299-11e6-900c-c5ff6b0450b1.png;>
 
@@ -87,7 +87,7 @@ Currently, SparkR on Windows is being tested with 
[AppVeyor](https://ci.appveyor
 
   https://cloud.githubusercontent.com/assets/6477701/18075336/de618b52-6eae-11e6-8f01-e4ce48963087.png;>
 
-- If the build is running, "CANCEL BUILD" buttom appears. Click this button 
top cancel the current build.
+- If the build is running, "CANCEL BUILD" button appears. Click this button to 
cancel the current build.
 
   https://cloud.githubusercontent.com/assets/6477701/18075806/4de68564-6eb3-11e6-855b-ee22918767f9.png;>
 

http://git-wip-us.apache.org/repos/asf/spark/blob/9623a982/dev/lint-python
--
diff --git a/dev/lint-python b/dev/lint-python
index e069caf..f738af9 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -34,8 +34,8 @@ python -B -m compileall -q -l $PATHS_TO_CHECK > 
"$PYCODESTYLE_REPORT_PATH"
 compile_status="${PIPESTATUS[0]}"
 
 # Get pycodestyle at runtime so that we don't rely on it being installed on 
the build server.
-#+ See: https://github.com/apache/spark/pull/1744#issuecomment-50982162
-# Updated to latest official version for pep8. pep8 is formally renamed to 
pycodestyle.
+# See: https://github.com/apache/spark/pull/1744#issuecomment-50982162
+# Updated to the latest official version of pep8. pep8 is formally renamed to 
pycodestyle.
 PYCODESTYLE_VERSION="2.3.1"
 
PYCODESTYLE_SCRIPT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-$PYCODESTYLE_VERSION.py"
 
PYCODESTYLE_SCRIPT_REMOTE_PATH="https://raw.githubusercontent.com/PyCQA/pycodestyle/$PYCODESTYLE_VERSION/pycodestyle.py;
@@ -60,9 +60,9 @@ export "PYLINT_HOME=$PYTHONPATH"
 export "PATH=$PYTHONPATH:$PATH"
 
 # There is no need to write this output to a file
-#+ first, but we do so so that the check status can
-#+ be output before the report, like with the
-#+ scalastyle and

spark git commit: [SPARK-23238][SQL] Externalize SQLConf configurations exposed in documentation

2018-01-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 49b0207dc -> 39d2c6b03


[SPARK-23238][SQL] Externalize SQLConf configurations exposed in documentation

## What changes were proposed in this pull request?

This PR proposes to expose few internal configurations found in the 
documentation.

Also it fixes the description for `spark.sql.execution.arrow.enabled`.
It's quite self-explanatory.

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #20403 from HyukjinKwon/minor-doc-arrow.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39d2c6b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39d2c6b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39d2c6b0

Branch: refs/heads/master
Commit: 39d2c6b03488895a0acb1dd3c46329db00fdd357
Parents: 49b0207
Author: hyukjinkwon 
Authored: Mon Jan 29 21:09:05 2018 +0900
Committer: hyukjinkwon 
Committed: Mon Jan 29 21:09:05 2018 +0900

--
 .../scala/org/apache/spark/sql/internal/SQLConf.scala   | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39d2c6b0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 2c70b00..61ea03d 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -123,14 +123,12 @@ object SQLConf {
   .createWithDefault(10)
 
   val COMPRESS_CACHED = 
buildConf("spark.sql.inMemoryColumnarStorage.compressed")
-.internal()
 .doc("When set to true Spark SQL will automatically select a compression 
codec for each " +
   "column based on statistics of the data.")
 .booleanConf
 .createWithDefault(true)
 
   val COLUMN_BATCH_SIZE = 
buildConf("spark.sql.inMemoryColumnarStorage.batchSize")
-.internal()
 .doc("Controls the size of batches for columnar caching.  Larger batch 
sizes can improve " +
   "memory utilization and compression, but risk OOMs when caching data.")
 .intConf
@@ -1043,11 +1041,11 @@ object SQLConf {
 
   val ARROW_EXECUTION_ENABLE =
 buildConf("spark.sql.execution.arrow.enabled")
-  .internal()
-  .doc("Make use of Apache Arrow for columnar data transfers. Currently 
available " +
-"for use with pyspark.sql.DataFrame.toPandas with the following data 
types: " +
-"StringType, BinaryType, BooleanType, DoubleType, FloatType, ByteType, 
IntegerType, " +
-"LongType, ShortType")
+  .doc("When true, make use of Apache Arrow for columnar data transfers. 
Currently available " +
+"for use with pyspark.sql.DataFrame.toPandas, and " +
+"pyspark.sql.SparkSession.createDataFrame when its input is a Pandas 
DataFrame. " +
+"The following data types are unsupported: " +
+"MapType, ArrayType of TimestampType, and nested StructType.")
   .booleanConf
   .createWithDefault(false)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23238][SQL] Externalize SQLConf configurations exposed in documentation

2018-01-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 5dda5db12 -> 8229e155d


[SPARK-23238][SQL] Externalize SQLConf configurations exposed in documentation

## What changes were proposed in this pull request?

This PR proposes to expose few internal configurations found in the 
documentation.

Also it fixes the description for `spark.sql.execution.arrow.enabled`.
It's quite self-explanatory.

## How was this patch tested?

N/A

Author: hyukjinkwon 

Closes #20403 from HyukjinKwon/minor-doc-arrow.

(cherry picked from commit 39d2c6b03488895a0acb1dd3c46329db00fdd357)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8229e155
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8229e155
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8229e155

Branch: refs/heads/branch-2.3
Commit: 8229e155d84cf02479c5dd0df6d577aff5075c00
Parents: 5dda5db
Author: hyukjinkwon 
Authored: Mon Jan 29 21:09:05 2018 +0900
Committer: hyukjinkwon 
Committed: Mon Jan 29 21:10:21 2018 +0900

--
 .../scala/org/apache/spark/sql/internal/SQLConf.scala   | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8229e155/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 2c70b00..61ea03d 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -123,14 +123,12 @@ object SQLConf {
   .createWithDefault(10)
 
   val COMPRESS_CACHED = 
buildConf("spark.sql.inMemoryColumnarStorage.compressed")
-.internal()
 .doc("When set to true Spark SQL will automatically select a compression 
codec for each " +
   "column based on statistics of the data.")
 .booleanConf
 .createWithDefault(true)
 
   val COLUMN_BATCH_SIZE = 
buildConf("spark.sql.inMemoryColumnarStorage.batchSize")
-.internal()
 .doc("Controls the size of batches for columnar caching.  Larger batch 
sizes can improve " +
   "memory utilization and compression, but risk OOMs when caching data.")
 .intConf
@@ -1043,11 +1041,11 @@ object SQLConf {
 
   val ARROW_EXECUTION_ENABLE =
 buildConf("spark.sql.execution.arrow.enabled")
-  .internal()
-  .doc("Make use of Apache Arrow for columnar data transfers. Currently 
available " +
-"for use with pyspark.sql.DataFrame.toPandas with the following data 
types: " +
-"StringType, BinaryType, BooleanType, DoubleType, FloatType, ByteType, 
IntegerType, " +
-"LongType, ShortType")
+  .doc("When true, make use of Apache Arrow for columnar data transfers. 
Currently available " +
+"for use with pyspark.sql.DataFrame.toPandas, and " +
+"pyspark.sql.SparkSession.createDataFrame when its input is a Pandas 
DataFrame. " +
+"The following data types are unsupported: " +
+"MapType, ArrayType of TimestampType, and nested StructType.")
   .booleanConf
   .createWithDefault(false)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in PySpark examples

2018-01-27 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 3b6fc286d -> 8ff0cc48b


[SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in 
PySpark examples

## What changes were proposed in this pull request?

This PR proposes to relocate the docstrings in modules of examples to the top. 
Seems these are mistakes. So, for example, the below codes

```python
>>> help(aft_survival_regression)
```

shows the module docstrings for examples as below:

**Before**

```
Help on module aft_survival_regression:

NAME
aft_survival_regression

...

DESCRIPTION
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

...

(END)
```

**After**

```
Help on module aft_survival_regression:

NAME
aft_survival_regression

...

DESCRIPTION
An example demonstrating aft survival regression.
Run with:
  bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py

(END)
```

## How was this patch tested?

Manually checked.

Author: hyukjinkwon 

Closes #20416 from HyukjinKwon/module-docstring-example.

(cherry picked from commit b8c32dc57368e49baaacf660b7e8836eedab2df7)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ff0cc48
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ff0cc48
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ff0cc48

Branch: refs/heads/branch-2.3
Commit: 8ff0cc48b1b45ed41914822ffaaf8de8dff87b72
Parents: 3b6fc28
Author: hyukjinkwon 
Authored: Sun Jan 28 10:33:06 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Jan 28 10:33:24 2018 +0900

--
 examples/src/main/python/avro_inputformat.py | 14 +++---
 .../src/main/python/ml/aft_survival_regression.py| 11 +--
 .../src/main/python/ml/bisecting_k_means_example.py  | 11 +--
 .../ml/bucketed_random_projection_lsh_example.py | 12 +---
 .../src/main/python/ml/chi_square_test_example.py| 10 +-
 examples/src/main/python/ml/correlation_example.py   | 10 +-
 examples/src/main/python/ml/cross_validator.py   | 15 +++
 examples/src/main/python/ml/fpgrowth_example.py  |  9 -
 .../src/main/python/ml/gaussian_mixture_example.py   | 11 +--
 .../ml/generalized_linear_regression_example.py  | 11 +--
 examples/src/main/python/ml/imputer_example.py   |  9 -
 .../main/python/ml/isotonic_regression_example.py|  9 +++--
 examples/src/main/python/ml/kmeans_example.py| 15 +++
 examples/src/main/python/ml/lda_example.py   | 12 +---
 .../python/ml/logistic_regression_summary_example.py | 11 +--
 examples/src/main/python/ml/min_hash_lsh_example.py  | 12 +---
 examples/src/main/python/ml/one_vs_rest_example.py   | 13 ++---
 .../src/main/python/ml/train_validation_split.py | 13 ++---
 examples/src/main/python/parquet_inputformat.py  | 12 ++--
 examples/src/main/python/sql/basic.py| 11 +--
 examples/src/main/python/sql/datasource.py   | 11 +--
 examples/src/main/python/sql/hive.py | 11 +--
 22 files changed, 115 insertions(+), 138 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8ff0cc48/examples/src/main/python/avro_inputformat.py
--
diff --git a/examples/src/main/python/avro_inputformat.py 
b/examples/src/main/python/avro_inputformat.py
index 4422f9e..6286ba6 100644
--- a/examples/src/main/python/avro_inputformat.py
+++ b/examples/src/main/python/avro_inputformat.py
@@ -15,13 +15,6 @@
 # limitations under the License.
 #
 
-from __future__ import print_function
-
-import sys
-
-from functools import reduce
-from pyspark.sql import SparkSession
-
 """
 Read data file users.avro in local Spark distro:
 
@@ -50,6 +43,13 @@ $ ./bin/spark-submit --driver-class-path

spark git commit: [SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in PySpark examples

2018-01-27 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 3227d14fe -> b8c32dc57


[SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in 
PySpark examples

## What changes were proposed in this pull request?

This PR proposes to relocate the docstrings in modules of examples to the top. 
Seems these are mistakes. So, for example, the below codes

```python
>>> help(aft_survival_regression)
```

shows the module docstrings for examples as below:

**Before**

```
Help on module aft_survival_regression:

NAME
aft_survival_regression

...

DESCRIPTION
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

...

(END)
```

**After**

```
Help on module aft_survival_regression:

NAME
aft_survival_regression

...

DESCRIPTION
An example demonstrating aft survival regression.
Run with:
  bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py

(END)
```

## How was this patch tested?

Manually checked.

Author: hyukjinkwon 

Closes #20416 from HyukjinKwon/module-docstring-example.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b8c32dc5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b8c32dc5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b8c32dc5

Branch: refs/heads/master
Commit: b8c32dc57368e49baaacf660b7e8836eedab2df7
Parents: 3227d14
Author: hyukjinkwon 
Authored: Sun Jan 28 10:33:06 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Jan 28 10:33:06 2018 +0900

--
 examples/src/main/python/avro_inputformat.py | 14 +++---
 .../src/main/python/ml/aft_survival_regression.py| 11 +--
 .../src/main/python/ml/bisecting_k_means_example.py  | 11 +--
 .../ml/bucketed_random_projection_lsh_example.py | 12 +---
 .../src/main/python/ml/chi_square_test_example.py| 10 +-
 examples/src/main/python/ml/correlation_example.py   | 10 +-
 examples/src/main/python/ml/cross_validator.py   | 15 +++
 examples/src/main/python/ml/fpgrowth_example.py  |  9 -
 .../src/main/python/ml/gaussian_mixture_example.py   | 11 +--
 .../ml/generalized_linear_regression_example.py  | 11 +--
 examples/src/main/python/ml/imputer_example.py   |  9 -
 .../main/python/ml/isotonic_regression_example.py|  9 +++--
 examples/src/main/python/ml/kmeans_example.py| 15 +++
 examples/src/main/python/ml/lda_example.py   | 12 +---
 .../python/ml/logistic_regression_summary_example.py | 11 +--
 examples/src/main/python/ml/min_hash_lsh_example.py  | 12 +---
 examples/src/main/python/ml/one_vs_rest_example.py   | 13 ++---
 .../src/main/python/ml/train_validation_split.py | 13 ++---
 examples/src/main/python/parquet_inputformat.py  | 12 ++--
 examples/src/main/python/sql/basic.py| 11 +--
 examples/src/main/python/sql/datasource.py   | 11 +--
 examples/src/main/python/sql/hive.py | 11 +--
 22 files changed, 115 insertions(+), 138 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b8c32dc5/examples/src/main/python/avro_inputformat.py
--
diff --git a/examples/src/main/python/avro_inputformat.py 
b/examples/src/main/python/avro_inputformat.py
index 4422f9e..6286ba6 100644
--- a/examples/src/main/python/avro_inputformat.py
+++ b/examples/src/main/python/avro_inputformat.py
@@ -15,13 +15,6 @@
 # limitations under the License.
 #
 
-from __future__ import print_function
-
-import sys
-
-from functools import reduce
-from pyspark.sql import SparkSession
-
 """
 Read data file users.avro in local Spark distro:
 
@@ -50,6 +43,13 @@ $ ./bin/spark-submit --driver-class-path 
/path/to/example/jar \
 {u'favorite_color': None, u'name': u'Alyssa'}
 {u'favorite_color': u'red', u'name': u'Ben'}
 """
+from __future__

spark git commit: [SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment

2018-01-31 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 9ff1d96f0 -> f470df2fc


[SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment

Author: Henry Robinson 

Closes #20443 from henryr/SPARK-23157.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f470df2f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f470df2f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f470df2f

Branch: refs/heads/master
Commit: f470df2fcf14e6234c577dc1bdfac27d49b441f5
Parents: 9ff1d96
Author: Henry Robinson 
Authored: Thu Feb 1 11:15:17 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Feb 1 11:15:17 2018 +0900

--
 R/pkg/R/DataFrame.R | 4 ++--
 python/pyspark/sql/dataframe.py | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f470df2f/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 547b5ea..41c3c3a 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -2090,8 +2090,8 @@ setMethod("selectExpr",
 #'
 #' @param x a SparkDataFrame.
 #' @param colName a column name.
-#' @param col a Column expression (which must refer only to this DataFrame), 
or an atomic vector in
-#' the length of 1 as literal value.
+#' @param col a Column expression (which must refer only to this 
SparkDataFrame), or an atomic
+#' vector in the length of 1 as literal value.
 #' @return A SparkDataFrame with the new column added or the existing column 
replaced.
 #' @family SparkDataFrame functions
 #' @aliases withColumn,SparkDataFrame,character-method

http://git-wip-us.apache.org/repos/asf/spark/blob/f470df2f/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 055b2c4..1496cba 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1829,7 +1829,7 @@ class DataFrame(object):
 Returns a new :class:`DataFrame` by adding a column or replacing the
 existing column that has the same name.
 
-The column expression must be an expression over this dataframe; 
attempting to add
+The column expression must be an expression over this DataFrame; 
attempting to add
 a column from some other dataframe will raise an error.
 
 :param colName: string, name of the new column.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment

2018-01-31 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 8ee3a71c9 -> 7ccfc7530


[SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment

Author: Henry Robinson 

Closes #20443 from henryr/SPARK-23157.

(cherry picked from commit f470df2fcf14e6234c577dc1bdfac27d49b441f5)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ccfc753
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ccfc753
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ccfc753

Branch: refs/heads/branch-2.3
Commit: 7ccfc753086c3859abe358c87f2e7b7a30422d5e
Parents: 8ee3a71
Author: Henry Robinson 
Authored: Thu Feb 1 11:15:17 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Feb 1 11:15:32 2018 +0900

--
 R/pkg/R/DataFrame.R | 4 ++--
 python/pyspark/sql/dataframe.py | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ccfc753/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 547b5ea..41c3c3a 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -2090,8 +2090,8 @@ setMethod("selectExpr",
 #'
 #' @param x a SparkDataFrame.
 #' @param colName a column name.
-#' @param col a Column expression (which must refer only to this DataFrame), 
or an atomic vector in
-#' the length of 1 as literal value.
+#' @param col a Column expression (which must refer only to this 
SparkDataFrame), or an atomic
+#' vector in the length of 1 as literal value.
 #' @return A SparkDataFrame with the new column added or the existing column 
replaced.
 #' @family SparkDataFrame functions
 #' @aliases withColumn,SparkDataFrame,character-method

http://git-wip-us.apache.org/repos/asf/spark/blob/7ccfc753/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 055b2c4..1496cba 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1829,7 +1829,7 @@ class DataFrame(object):
 Returns a new :class:`DataFrame` by adding a column or replacing the
 existing column that has the same name.
 
-The column expression must be an expression over this dataframe; 
attempting to add
+The column expression must be an expression over this DataFrame; 
attempting to add
 a column from some other dataframe will raise an error.
 
 :param colName: string, name of the new column.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23228][PYSPARK] Add Python Created jsparkSession to JVM's defaultSession

2018-01-31 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 161a3f2ae -> 3d0911bbe


[SPARK-23228][PYSPARK] Add Python Created jsparkSession to JVM's defaultSession

## What changes were proposed in this pull request?

In the current PySpark code, Python created `jsparkSession` doesn't add to 
JVM's defaultSession, this `SparkSession` object cannot be fetched from Java 
side, so the below scala code will be failed when loaded in PySpark application.

```scala
class TestSparkSession extends SparkListener with Logging {
  override def onOtherEvent(event: SparkListenerEvent): Unit = {
event match {
  case CreateTableEvent(db, table) =>
val session = 
SparkSession.getActiveSession.orElse(SparkSession.getDefaultSession)
assert(session.isDefined)
val tableInfo = session.get.sharedState.externalCatalog.getTable(db, 
table)
logInfo(s"Table info ${tableInfo}")

  case e =>
logInfo(s"event $e")

}
  }
}
```

So here propose to add fresh create `jsparkSession` to `defaultSession`.

## How was this patch tested?

Manual verification.

Author: jerryshao 
Author: hyukjinkwon 
Author: Saisai Shao 

Closes #20404 from jerryshao/SPARK-23228.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3d0911bb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3d0911bb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3d0911bb

Branch: refs/heads/master
Commit: 3d0911bbe47f76c341c090edad3737e88a67e3d7
Parents: 161a3f2
Author: jerryshao 
Authored: Wed Jan 31 20:04:51 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Jan 31 20:04:51 2018 +0900

--
 python/pyspark/sql/session.py | 10 +-
 python/pyspark/sql/tests.py   | 28 +++-
 2 files changed, 36 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3d0911bb/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 6c84023..1ed0429 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -213,7 +213,12 @@ class SparkSession(object):
 self._jsc = self._sc._jsc
 self._jvm = self._sc._jvm
 if jsparkSession is None:
-jsparkSession = self._jvm.SparkSession(self._jsc.sc())
+if self._jvm.SparkSession.getDefaultSession().isDefined() \
+and not self._jvm.SparkSession.getDefaultSession().get() \
+.sparkContext().isStopped():
+jsparkSession = 
self._jvm.SparkSession.getDefaultSession().get()
+else:
+jsparkSession = self._jvm.SparkSession(self._jsc.sc())
 self._jsparkSession = jsparkSession
 self._jwrapped = self._jsparkSession.sqlContext()
 self._wrapped = SQLContext(self._sc, self, self._jwrapped)
@@ -225,6 +230,7 @@ class SparkSession(object):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+self._jvm.SparkSession.setDefaultSession(self._jsparkSession)
 
 def _repr_html_(self):
 return """
@@ -759,6 +765,8 @@ class SparkSession(object):
 """Stop the underlying :class:`SparkContext`.
 """
 self._sc.stop()
+# We should clean the default session up. See SPARK-23228.
+self._jvm.SparkSession.clearDefaultSession()
 SparkSession._instantiatedSession = None
 
 @since(2.0)

http://git-wip-us.apache.org/repos/asf/spark/blob/3d0911bb/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index dc80870..dc26b96 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -69,7 +69,7 @@ from pyspark.sql.types import UserDefinedType, _infer_type, 
_make_type_verifier
 from pyspark.sql.types import _array_signed_int_typecode_ctype_mappings, 
_array_type_mappings
 from pyspark.sql.types import _array_unsigned_int_typecode_ctype_mappings
 from pyspark.sql.types import _merge_type
-from pyspark.tests import QuietTest, ReusedPySparkTestCase, SparkSubmitTests
+from pyspark.tests import QuietTest, ReusedPySparkTestCase, PySparkTestCase, 
SparkSubmitTests
 from pyspark.sql.functions import UserDefinedFunction, sha2, lit
 from pyspark.sql.window import Window
 from pyspark.sql.utils import AnalysisException, ParseException, 
IllegalArgumentException
@@ -2925,6 +2925,32 @@ class SQLTests2(ReusedSQLTestCase):
 sc.stop()
 
 
+class

spark git commit: [SPARK-23300][TESTS][BRANCH-2.3] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests

2018-02-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 05239afc9 -> 2ba07d5b1


[SPARK-23300][TESTS][BRANCH-2.3] Prints out if Pandas and PyArrow are installed 
or not in PySpark SQL tests

This PR backports https://github.com/apache/spark/pull/20473 to branch-2.3.

Author: hyukjinkwon 

Closes #20533 from HyukjinKwon/backport-20473.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2ba07d5b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2ba07d5b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2ba07d5b

Branch: refs/heads/branch-2.3
Commit: 2ba07d5b101c44382e0db6d660da756c2f5ce627
Parents: 05239af
Author: hyukjinkwon 
Authored: Thu Feb 8 09:29:31 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Feb 8 09:29:31 2018 +0900

--
 python/run-tests.py | 56 +++-
 1 file changed, 55 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2ba07d5b/python/run-tests.py
--
diff --git a/python/run-tests.py b/python/run-tests.py
index 1341086..3539c76 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -31,6 +31,7 @@ if sys.version < '3':
 import Queue
 else:
 import queue as Queue
+from distutils.version import LooseVersion
 
 
 # Append `SPARK_HOME/dev` to the Python path so that we can import the 
sparktestsupport module
@@ -39,7 +40,7 @@ 
sys.path.append(os.path.join(os.path.dirname(os.path.realpath(__file__)), "../de
 
 from sparktestsupport import SPARK_HOME  # noqa (suppress pep8 warnings)
 from sparktestsupport.shellutils import which, subprocess_check_output  # noqa
-from sparktestsupport.modules import all_modules  # noqa
+from sparktestsupport.modules import all_modules, pyspark_sql  # noqa
 
 
 python_modules = dict((m.name, m) for m in all_modules if m.python_test_goals 
if m.name != 'root')
@@ -151,6 +152,55 @@ def parse_opts():
 return opts
 
 
+def _check_dependencies(python_exec, modules_to_test):
+# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are 
installed and
+# explicitly prints out. See SPARK-23300.
+if pyspark_sql in modules_to_test:
+# TODO(HyukjinKwon): Relocate and deduplicate these version 
specifications.
+minimum_pyarrow_version = '0.8.0'
+minimum_pandas_version = '0.19.2'
+
+try:
+pyarrow_version = subprocess_check_output(
+[python_exec, "-c", "import pyarrow; 
print(pyarrow.__version__)"],
+universal_newlines=True,
+stderr=open(os.devnull, 'w')).strip()
+if LooseVersion(pyarrow_version) >= 
LooseVersion(minimum_pyarrow_version):
+LOGGER.info("Will test PyArrow related features against Python 
executable "
+"'%s' in '%s' module." % (python_exec, 
pyspark_sql.name))
+else:
+LOGGER.warning(
+"Will skip PyArrow related features against Python 
executable "
+"'%s' in '%s' module. PyArrow >= %s is required; however, 
PyArrow "
+"%s was found." % (
+python_exec, pyspark_sql.name, 
minimum_pyarrow_version, pyarrow_version))
+except:
+LOGGER.warning(
+"Will skip PyArrow related features against Python executable "
+"'%s' in '%s' module. PyArrow >= %s is required; however, 
PyArrow "
+"was not found." % (python_exec, pyspark_sql.name, 
minimum_pyarrow_version))
+
+try:
+pandas_version = subprocess_check_output(
+[python_exec, "-c", "import pandas; 
print(pandas.__version__)"],
+universal_newlines=True,
+stderr=open(os.devnull, 'w')).strip()
+if LooseVersion(pandas_version) >= 
LooseVersion(minimum_pandas_version):
+LOGGER.info("Will test Pandas related features against Python 
executable "
+"'%s' in '%s' module." % (python_exec, 
pyspark_sql.name))
+else:
+LOGGER.warning(
+"Will skip Pandas related features against Python 
executable "
+"'%s' in '%s' module. Pandas >= %s is required; however, 
Pandas "
+"%s was found." % (
+python_exec, pyspark_sql.name, minimum_pandas_version, 
pandas_version))
+except:
+LOGGER.warning(
+"Will skip Pandas related features against Python executable "
+"'%s' in '%s' module. Pandas >= %s is required; however, 
Pandas "
+"was not found." % (python_exec, pyspark_sql.name,

spark git commit: [SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)

2018-02-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 db59e5542 -> 053830256


[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions 
in PySpark tests (to skip or test)

This PR backports https://github.com/apache/spark/pull/20487 to branch-2.3.

Author: hyukjinkwon 
Author: Takuya UESHIN 

Closes #20534 from HyukjinKwon/PR_TOOL_PICK_PR_20487_BRANCH-2.3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/05383025
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/05383025
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/05383025

Branch: refs/heads/branch-2.3
Commit: 0538302561c4d77b2856b1ce73b3ccbcb6688ac6
Parents: db59e55
Author: hyukjinkwon 
Authored: Thu Feb 8 16:47:12 2018 +0900
Committer: hyukjinkwon 
Committed: Thu Feb 8 16:47:12 2018 +0900

--
 pom.xml |  4 ++
 python/pyspark/sql/dataframe.py |  3 ++
 python/pyspark/sql/session.py   |  3 ++
 python/pyspark/sql/tests.py | 83 +++-
 python/pyspark/sql/utils.py | 30 +
 python/setup.py | 10 -
 6 files changed, 86 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/05383025/pom.xml
--
diff --git a/pom.xml b/pom.xml
index a8e448a..9aa531e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -185,6 +185,10 @@
 2.8
 1.8
 1.0.0
+
 0.8.0
 
 ${java.home}

http://git-wip-us.apache.org/repos/asf/spark/blob/05383025/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 59a4170..8ec24db 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1913,6 +1913,9 @@ class DataFrame(object):
 02  Alice
 15Bob
 """
+from pyspark.sql.utils import require_minimum_pandas_version
+require_minimum_pandas_version()
+
 import pandas as pd
 
 if 
self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower()
 \

http://git-wip-us.apache.org/repos/asf/spark/blob/05383025/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 6c84023..2ac2ec2 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -640,6 +640,9 @@ class SparkSession(object):
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
+from pyspark.sql.utils import require_minimum_pandas_version
+require_minimum_pandas_version()
+
 if 
self.conf.get("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
== "true":
 timezone = self.conf.get("spark.sql.session.timeZone")

http://git-wip-us.apache.org/repos/asf/spark/blob/05383025/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 878d402..0e1b2ec 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -48,19 +48,26 @@ if sys.version_info[:2] <= (2, 6):
 else:
 import unittest
 
-_have_pandas = False
-_have_old_pandas = False
+_pandas_requirement_message = None
 try:
-import pandas
-try:
-from pyspark.sql.utils import require_minimum_pandas_version
-require_minimum_pandas_version()
-_have_pandas = True
-except:
-_have_old_pandas = True
-except:
-# No Pandas, but that's okay, we'll skip those tests
-pass
+from pyspark.sql.utils import require_minimum_pandas_version
+require_minimum_pandas_version()
+except ImportError as e:
+from pyspark.util import _exception_message
+# If Pandas version requirement is not satisfied, skip related tests.
+_pandas_requirement_message = _exception_message(e)
+
+_pyarrow_requirement_message = None
+try:
+from pyspark.sql.utils import require_minimum_pyarrow_version
+require_minimum_pyarrow_version()
+except ImportError as e:
+from pyspark.util import _exception_message
+# If Arrow version requirement is not satisfied, skip related tests.
+_pyarrow_requirement_message = _exception_message(e)
+
+_have_pandas = _pandas_requirement_message is None
+_have_pyarrow = _pyarrow_requirement_message is None
 
 from pyspark import SparkContext
 from pyspark.sql import SparkSession, SQLContext, HiveContext, Column, Row
@@ -75,15 +82,6 @@ from

spark git commit: [SPARK-23256][ML][PYTHON] Add columnSchema method to PySpark image reader

2018-02-04 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 551dff2bc -> 715047b02


[SPARK-23256][ML][PYTHON] Add columnSchema method to PySpark image reader

## What changes were proposed in this pull request?

This PR proposes to add `columnSchema` in Python side too.

```python
>>> from pyspark.ml.image import ImageSchema
>>> ImageSchema.columnSchema.simpleString()
'struct'
```

## How was this patch tested?

Manually tested and unittest was added in `python/pyspark/ml/tests.py`.

Author: hyukjinkwon 

Closes #20475 from HyukjinKwon/SPARK-23256.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/715047b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/715047b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/715047b0

Branch: refs/heads/master
Commit: 715047b02df0ac9ec16ab2a73481ab7f36ffc6ca
Parents: 551dff2
Author: hyukjinkwon 
Authored: Sun Feb 4 17:53:31 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 4 17:53:31 2018 +0900

--
 python/pyspark/ml/image.py | 20 +++-
 python/pyspark/ml/tests.py |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/715047b0/python/pyspark/ml/image.py
--
diff --git a/python/pyspark/ml/image.py b/python/pyspark/ml/image.py
index 2d86c7f..45c9366 100644
--- a/python/pyspark/ml/image.py
+++ b/python/pyspark/ml/image.py
@@ -40,6 +40,7 @@ class _ImageSchema(object):
 def __init__(self):
 self._imageSchema = None
 self._ocvTypes = None
+self._columnSchema = None
 self._imageFields = None
 self._undefinedImageType = None
 
@@ -49,7 +50,7 @@ class _ImageSchema(object):
 Returns the image schema.
 
 :return: a :class:`StructType` with a single column of images
-   named "image" (nullable).
+   named "image" (nullable) and having the same type returned by 
:meth:`columnSchema`.
 
 .. versionadded:: 2.3.0
 """
@@ -76,6 +77,23 @@ class _ImageSchema(object):
 return self._ocvTypes
 
 @property
+def columnSchema(self):
+"""
+Returns the schema for the image column.
+
+:return: a :class:`StructType` for image column,
+``struct``.
+
+.. versionadded:: 2.4.0
+"""
+
+if self._columnSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.columnSchema()
+self._columnSchema = _parse_datatype_json_string(jschema.json())
+return self._columnSchema
+
+@property
 def imageFields(self):
 """
 Returns field names of image columns.

http://git-wip-us.apache.org/repos/asf/spark/blob/715047b0/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 1af2b91..75d0478 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1852,6 +1852,7 @@ class ImageReaderTest(SparkSessionTestCase):
 self.assertEqual(len(array), first_row[1])
 self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), 
first_row)
 self.assertEqual(df.schema, ImageSchema.imageSchema)
+self.assertEqual(df.schema["image"].dataType, ImageSchema.columnSchema)
 expected = {'CV_8UC3': 16, 'Undefined': -1, 'CV_8U': 0, 'CV_8UC1': 0, 
'CV_8UC4': 24}
 self.assertEqual(ImageSchema.ocvTypes, expected)
 expected = ['origin', 'height', 'width', 'nChannels', 'mode', 'data']


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23290][SQL][PYTHON][BACKPORT-2.3] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame.

2018-02-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 521494d7b -> 44933033e


[SPARK-23290][SQL][PYTHON][BACKPORT-2.3] Use datetime.date for date type when 
converting Spark DataFrame to Pandas DataFrame.

## What changes were proposed in this pull request?

This is a backport of #20506.

In #18664, there was a change in how `DateType` is being returned to users 
([line 1968 in 
dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)).
 This can cause client code which works in Spark 2.2 to fail.
See 
[SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917)
 for an example.

This pr modifies to use `datetime.date` for date type as Spark 2.2 does.

## How was this patch tested?

Tests modified to fit the new behavior and existing tests.

Author: Takuya UESHIN 

Closes #20515 from ueshin/issues/SPARK-23290_2.3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44933033
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44933033
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44933033

Branch: refs/heads/branch-2.3
Commit: 44933033e9216ccb2e533b9dc6e6cb03cce39817
Parents: 521494d
Author: Takuya UESHIN 
Authored: Tue Feb 6 18:29:37 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Feb 6 18:29:37 2018 +0900

--
 python/pyspark/serializers.py   |  9 --
 python/pyspark/sql/dataframe.py |  7 ++---
 python/pyspark/sql/tests.py | 57 ++--
 python/pyspark/sql/types.py | 15 ++
 4 files changed, 66 insertions(+), 22 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/44933033/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 88d6a19..e870325 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -267,12 +267,15 @@ class ArrowStreamPandasSerializer(Serializer):
 """
 Deserialize ArrowRecordBatches to an Arrow table and return as a list 
of pandas.Series.
 """
-from pyspark.sql.types import _check_dataframe_localize_timestamps
+from pyspark.sql.types import from_arrow_schema, 
_check_dataframe_convert_date, \
+_check_dataframe_localize_timestamps
 import pyarrow as pa
 reader = pa.open_stream(stream)
+schema = from_arrow_schema(reader.schema)
 for batch in reader:
-# NOTE: changed from pa.Columns.to_pandas, timezone issue in 
conversion fixed in 0.7.1
-pdf = _check_dataframe_localize_timestamps(batch.to_pandas(), 
self._timezone)
+pdf = batch.to_pandas()
+pdf = _check_dataframe_convert_date(pdf, schema)
+pdf = _check_dataframe_localize_timestamps(pdf, self._timezone)
 yield [c for _, c in pdf.iteritems()]
 
 def __repr__(self):

http://git-wip-us.apache.org/repos/asf/spark/blob/44933033/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 2e55407..59a4170 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1923,7 +1923,8 @@ class DataFrame(object):
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
 try:
-from pyspark.sql.types import 
_check_dataframe_localize_timestamps
+from pyspark.sql.types import _check_dataframe_convert_date, \
+_check_dataframe_localize_timestamps
 from pyspark.sql.utils import require_minimum_pyarrow_version
 import pyarrow
 require_minimum_pyarrow_version()
@@ -1931,6 +1932,7 @@ class DataFrame(object):
 if tables:
 table = pyarrow.concat_tables(tables)
 pdf = table.to_pandas()
+pdf = _check_dataframe_convert_date(pdf, self.schema)
 return _check_dataframe_localize_timestamps(pdf, timezone)
 else:
 return pd.DataFrame.from_records([], columns=self.columns)
@@ -2009,7 +2011,6 @@ def _to_corrected_pandas_type(dt):
 """
 When converting Spark SQL records to Pandas DataFrame, the inferred data 
type may be wrong.
 This method gets the corrected data type for Pandas if that type may be 
inferred uncorrectly.
-NOTE: DateType is inferred incorrectly as 'object', TimestampType is 
correct with

spark git commit: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

2018-02-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 44933033e -> a51154482


[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to 
handle str type properly in Python 2.

## What changes were proposed in this pull request?

In Python 2, when `pandas_udf` tries to return string type value created in the 
udf with `".."`, the execution fails. E.g.,

```python
from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()
```

raises the following exception:

```
...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
expected StringType, got BinaryType
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)

...
```

Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and 
consider it as binary type when the type is string type and the string values 
are `str` instead of `unicode` in Python 2.

This pr adds a workaround for the case.

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN 

Closes #20507 from ueshin/issues/SPARK-23334.

(cherry picked from commit 63c5bf13ce5cd3b8d7e7fb88de881ed207fde720)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5115448
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5115448
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5115448

Branch: refs/heads/branch-2.3
Commit: a511544822be6e3fc9c6bb5080a163b9acbb41f2
Parents: 4493303
Author: Takuya UESHIN 
Authored: Tue Feb 6 18:30:50 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Feb 6 18:31:06 2018 +0900

--
 python/pyspark/serializers.py | 4 
 python/pyspark/sql/tests.py   | 9 +
 2 files changed, 13 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a5115448/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index e870325..91a7f09 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -230,6 +230,10 @@ def _create_batch(series, timezone):
 s = _check_series_convert_timestamps_internal(s.fillna(0), 
timezone)
 # TODO: need cast after Arrow conversion, ns values cause error 
with pandas 0.19.2
 return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+elif t is not None and pa.types.is_string(t) and sys.version < '3':
+# TODO: need decode before converting to Arrow in Python 2
+return pa.Array.from_pandas(s.apply(
+lambda v: v.decode("utf-8") if isinstance(v, str) else v), 
mask=mask, type=t)
 return pa.Array.from_pandas(s, mask=mask, type=t)
 
 arrs = [create_array(s, t) for s, t in series]

http://git-wip-us.apache.org/repos/asf/spark/blob/a5115448/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 95b9c0e..2577ed7 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3896,6 +3896,15 @@ class VectorizedUDFTests(ReusedSQLTestCase):
 res = df.select(str_f(col('str')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_string_in_udf(self):
+from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
+df = self.spark.range(10)
+str_f = pandas_udf(lambda x: pd.Series(map(str, x)), StringType())
+actual = df.select(str_f(col('id')))
+expected = df.select(col('id').cast('string'))
+self.assertEquals(expected.collect(), actual.collect())
+
 def test_vectorized_udf_datatype_string(self):
 from pyspark.sql.functions import pandas_udf, col
 df = self.spark.range(10).select(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

2018-02-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 8141c3e3d -> 63c5bf13c


[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to 
handle str type properly in Python 2.

## What changes were proposed in this pull request?

In Python 2, when `pandas_udf` tries to return string type value created in the 
udf with `".."`, the execution fails. E.g.,

```python
from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()
```

raises the following exception:

```
...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
expected StringType, got BinaryType
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)

...
```

Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and 
consider it as binary type when the type is string type and the string values 
are `str` instead of `unicode` in Python 2.

This pr adds a workaround for the case.

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN 

Closes #20507 from ueshin/issues/SPARK-23334.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/63c5bf13
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/63c5bf13
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/63c5bf13

Branch: refs/heads/master
Commit: 63c5bf13ce5cd3b8d7e7fb88de881ed207fde720
Parents: 8141c3e
Author: Takuya UESHIN 
Authored: Tue Feb 6 18:30:50 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Feb 6 18:30:50 2018 +0900

--
 python/pyspark/serializers.py | 4 
 python/pyspark/sql/tests.py   | 9 +
 2 files changed, 13 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/63c5bf13/python/pyspark/serializers.py
--
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index e870325..91a7f09 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -230,6 +230,10 @@ def _create_batch(series, timezone):
 s = _check_series_convert_timestamps_internal(s.fillna(0), 
timezone)
 # TODO: need cast after Arrow conversion, ns values cause error 
with pandas 0.19.2
 return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+elif t is not None and pa.types.is_string(t) and sys.version < '3':
+# TODO: need decode before converting to Arrow in Python 2
+return pa.Array.from_pandas(s.apply(
+lambda v: v.decode("utf-8") if isinstance(v, str) else v), 
mask=mask, type=t)
 return pa.Array.from_pandas(s, mask=mask, type=t)
 
 arrs = [create_array(s, t) for s, t in series]

http://git-wip-us.apache.org/repos/asf/spark/blob/63c5bf13/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 545ec5a..89b7c21 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3922,6 +3922,15 @@ class ScalarPandasUDF(ReusedSQLTestCase):
 res = df.select(str_f(col('str')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_string_in_udf(self):
+from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
+df = self.spark.range(10)
+str_f = pandas_udf(lambda x: pd.Series(map(str, x)), StringType())
+actual = df.select(str_f(col('id')))
+expected = df.select(col('id').cast('string'))
+self.assertEquals(expected.collect(), actual.collect())
+
 def test_vectorized_udf_datatype_string(self):
 from pyspark.sql.functions import pandas_udf, col
 df = self.spark.range(10).select(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs

2018-02-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6efd5d117 -> c338c8cf8


[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs

## What changes were proposed in this pull request?

This PR targets to explicitly specify supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in 
`returnType` ahead with documenting this; however, it happened to fix multiple 
things.

1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:

```python
from pyspark.sql.functions import pandas_udf
pudf = pandas_udf(lambda x: x, "binary")
df = spark.createDataFrame([[bytearray(1)]])
df.select(pudf("_1")).show()
```
```
...
TypeError: Unsupported type in conversion to Arrow: BinaryType
```

We can document this behaviour for its guide.

2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems 
we can support this case.

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
foo = pandas_udf(lambda v: v.mean(), 'array', 
PandasUDFType.GROUPED_AGG)
df = spark.range(100).selectExpr("id", "array(id) as value")
df.groupBy("id").agg(foo("value")).show()
```

```
...
 NotImplementedError: ArrayType, StructType and MapType are not supported 
with PandasUDFType.GROUPED_AGG
```

3. Since we can check the return type ahead, we can fail fast before actual 
execution.

```python
# we can fail fast at this stage because we know the schema ahead
pandas_udf(lambda x: x, BinaryType())
```

## How was this patch tested?

Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.

Author: hyukjinkwon 

Closes #20531 from HyukjinKwon/pudf-cleanup.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c338c8cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c338c8cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c338c8cf

Branch: refs/heads/master
Commit: c338c8cf8253c037ecd4f39bbd58ed5a86581b37
Parents: 6efd5d1
Author: hyukjinkwon 
Authored: Mon Feb 12 20:49:36 2018 +0900
Committer: hyukjinkwon 
Committed: Mon Feb 12 20:49:36 2018 +0900

--
 docs/sql-programming-guide.md   |   4 +-
 python/pyspark/sql/tests.py | 130 +++
 python/pyspark/sql/types.py |   4 +
 python/pyspark/sql/udf.py   |  36 +++--
 python/pyspark/worker.py|   2 +-
 .../org/apache/spark/sql/internal/SQLConf.scala |   2 +-
 6 files changed, 111 insertions(+), 67 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c338c8cf/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index eab4030..6174a93 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1676,7 +1676,7 @@ Using the above optimizations with Arrow will produce the 
same results as when A
 enabled. Note that even with Arrow, `toPandas()` results in the collection of 
all records in the
 DataFrame to the driver program and should be done on a small subset of the 
data. Not all Spark
 data types are currently supported and an error can be raised if a column has 
an unsupported type,
-see [Supported Types](#supported-sql-arrow-types). If an error occurs during 
`createDataFrame()`,
+see [Supported SQL Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
 Spark will fall back to create the DataFrame without Arrow.
 
 ## Pandas UDFs (a.k.a. Vectorized UDFs)
@@ -1734,7 +1734,7 @@ For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/p
 
 ### Supported SQL Types
 
-Currently, all Spark SQL data types are supported by Arrow-based conversion 
except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion 
except `BinaryType`, `MapType`,
 `ArrayType` of `TimestampType`, and nested `StructType`.
 
 ### Setting Arrow Batch Size

http://git-wip-us.apache.org/repos/asf/spark/blob/c338c8cf/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index fe89bd0..2af218a 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3790,10 +3790,10 @@ class PandasUDFTests(ReusedSQLTestCase):
 self.assertEqual(foo.returnType, schema)
 self.assertEqual(foo.evalType, 
PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF)
 
-@pandas_udf(returnType='v double', functionType=PandasUDFType.SCALAR)
+

spark git commit: [SPARK-23084][PYTHON] Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-02-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a34fce19b -> 8acb51f08


[SPARK-23084][PYTHON] Add unboundedPreceding(), unboundedFollowing() and 
currentRow() to PySpark

## What changes were proposed in this pull request?

Added unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark, 
also updated the rangeBetween API

## How was this patch tested?

did unit test on my local. Please let me know if I need to add unit test in 
tests.py

Author: Huaxin Gao 

Closes #20400 from huaxingao/spark_23084.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8acb51f0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8acb51f0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8acb51f0

Branch: refs/heads/master
Commit: 8acb51f08b448628b65e90af3b268994f9550e45
Parents: a34fce1
Author: Huaxin Gao 
Authored: Sun Feb 11 18:55:38 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 18:55:38 2018 +0900

--
 python/pyspark/sql/functions.py | 30 
 python/pyspark/sql/window.py| 70 ++--
 2 files changed, 82 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8acb51f0/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 05031f5..9bb9c32 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -809,6 +809,36 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(2.4)
+def unboundedPreceding():
+"""
+Window function: returns the special frame boundary that represents the 
first row
+in the window partition.
+"""
+sc = SparkContext._active_spark_context
+return Column(sc._jvm.functions.unboundedPreceding())
+
+
+@since(2.4)
+def unboundedFollowing():
+"""
+Window function: returns the special frame boundary that represents the 
last row
+in the window partition.
+"""
+sc = SparkContext._active_spark_context
+return Column(sc._jvm.functions.unboundedFollowing())
+
+
+@since(2.4)
+def currentRow():
+"""
+Window function: returns the special frame boundary that represents the 
current row
+in the window partition.
+"""
+sc = SparkContext._active_spark_context
+return Column(sc._jvm.functions.currentRow())
+
+
 # -- Date/Timestamp functions 
--
 
 @since(1.5)

http://git-wip-us.apache.org/repos/asf/spark/blob/8acb51f0/python/pyspark/sql/window.py
--
diff --git a/python/pyspark/sql/window.py b/python/pyspark/sql/window.py
index 7ce27f9..bb841a9 100644
--- a/python/pyspark/sql/window.py
+++ b/python/pyspark/sql/window.py
@@ -16,9 +16,11 @@
 #
 
 import sys
+if sys.version >= '3':
+long = int
 
 from pyspark import since, SparkContext
-from pyspark.sql.column import _to_seq, _to_java_column
+from pyspark.sql.column import Column, _to_seq, _to_java_column
 
 __all__ = ["Window", "WindowSpec"]
 
@@ -120,20 +122,45 @@ class Window(object):
 and "5" means the five off after the current row.
 
 We recommend users use ``Window.unboundedPreceding``, 
``Window.unboundedFollowing``,
-and ``Window.currentRow`` to specify special boundary values, rather 
than using integral
-values directly.
+``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``,
+``pyspark.sql.functions.unboundedFollowing`` and 
``pyspark.sql.functions.currentRow``
+to specify special boundary values, rather than using integral values 
directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  a column returned by 
``pyspark.sql.functions.unboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+a column returned by 
``pyspark.sql.functions.unboundedFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
+
+>>> from pyspark.sql import functions as F, SparkSession, Window
+>>> spark = SparkSession.builder.getOrCreate()
+>>> df = spark.createDataFrame(
+...

spark git commit: [SPARK-23387][SQL][PYTHON][TEST][BRANCH-2.3] Backport assertPandasEqual to branch-2.3.

2018-02-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 9fa7b0e10 -> 8875e47ce


[SPARK-23387][SQL][PYTHON][TEST][BRANCH-2.3] Backport assertPandasEqual to 
branch-2.3.

## What changes were proposed in this pull request?

When backporting a pr with tests using `assertPandasEqual` from master to 
branch-2.3, the tests fail because `assertPandasEqual` doesn't exist in 
branch-2.3.
We should backport `assertPandasEqual` to branch-2.3 to avoid the failures.

## How was this patch tested?

Modified tests.

Author: Takuya UESHIN 

Closes #20577 from ueshin/issues/SPARK-23387/branch-2.3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8875e47c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8875e47c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8875e47c

Branch: refs/heads/branch-2.3
Commit: 8875e47cec01ae8da4ffb855409b54089e1016fb
Parents: 9fa7b0e
Author: Takuya UESHIN 
Authored: Sun Feb 11 22:16:47 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 22:16:47 2018 +0900

--
 python/pyspark/sql/tests.py | 44 +---
 1 file changed, 19 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8875e47c/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 0f76c96..5480144 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -195,6 +195,12 @@ class ReusedSQLTestCase(ReusedPySparkTestCase):
 ReusedPySparkTestCase.tearDownClass()
 cls.spark.stop()
 
+def assertPandasEqual(self, expected, result):
+msg = ("DataFrames are not equal: " +
+   "\n\nExpected:\n%s\n%s" % (expected, expected.dtypes) +
+   "\n\nResult:\n%s\n%s" % (result, result.dtypes))
+self.assertTrue(expected.equals(result), msg=msg)
+
 
 class DataTypeTests(unittest.TestCase):
 # regression test for SPARK-6055
@@ -3422,12 +3428,6 @@ class ArrowTests(ReusedSQLTestCase):
 time.tzset()
 ReusedSQLTestCase.tearDownClass()
 
-def assertFramesEqual(self, df_with_arrow, df_without):
-msg = ("DataFrame from Arrow is not equal" +
-   ("\n\nWith Arrow:\n%s\n%s" % (df_with_arrow, 
df_with_arrow.dtypes)) +
-   ("\n\nWithout:\n%s\n%s" % (df_without, df_without.dtypes)))
-self.assertTrue(df_without.equals(df_with_arrow), msg=msg)
-
 def create_pandas_data_frame(self):
 import pandas as pd
 import numpy as np
@@ -3466,8 +3466,8 @@ class ArrowTests(ReusedSQLTestCase):
 df = self.spark.createDataFrame(self.data, schema=self.schema)
 pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
 expected = self.create_pandas_data_frame()
-self.assertFramesEqual(expected, pdf)
-self.assertFramesEqual(expected, pdf_arrow)
+self.assertPandasEqual(expected, pdf)
+self.assertPandasEqual(expected, pdf_arrow)
 
 def test_toPandas_respect_session_timezone(self):
 df = self.spark.createDataFrame(self.data, schema=self.schema)
@@ -3478,11 +3478,11 @@ class ArrowTests(ReusedSQLTestCase):
 
self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", 
"false")
 try:
 pdf_la, pdf_arrow_la = self._toPandas_arrow_toggle(df)
-self.assertFramesEqual(pdf_arrow_la, pdf_la)
+self.assertPandasEqual(pdf_arrow_la, pdf_la)
 finally:
 
self.spark.conf.set("spark.sql.execution.pandas.respectSessionTimeZone", "true")
 pdf_ny, pdf_arrow_ny = self._toPandas_arrow_toggle(df)
-self.assertFramesEqual(pdf_arrow_ny, pdf_ny)
+self.assertPandasEqual(pdf_arrow_ny, pdf_ny)
 
 self.assertFalse(pdf_ny.equals(pdf_la))
 
@@ -3492,7 +3492,7 @@ class ArrowTests(ReusedSQLTestCase):
 if isinstance(field.dataType, TimestampType):
 pdf_la_corrected[field.name] = 
_check_series_convert_timestamps_local_tz(
 pdf_la_corrected[field.name], timezone)
-self.assertFramesEqual(pdf_ny, pdf_la_corrected)
+self.assertPandasEqual(pdf_ny, pdf_la_corrected)
 finally:
 self.spark.conf.set("spark.sql.session.timeZone", orig_tz)
 
@@ -3500,7 +3500,7 @@ class ArrowTests(ReusedSQLTestCase):
 pdf = self.create_pandas_data_frame()
 df = self.spark.createDataFrame(self.data, schema=self.schema)
 pdf_arrow = df.toPandas()
-self.assertFramesEqual(pdf_arrow, pdf)
+self.assertPandasEqual(pdf_arrow, pdf)
 
 def test_filtered_frame(self):

spark git commit: [SPARK-22624][PYSPARK] Expose range partitioning shuffle introduced by spark-22614

2018-02-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 8acb51f08 -> eacb62fbb


[SPARK-22624][PYSPARK] Expose range partitioning shuffle introduced by 
spark-22614

## What changes were proposed in this pull request?

 Expose range partitioning shuffle introduced by spark-22614

## How was this patch tested?

Unit test in dataframe.py

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: xubo245 <601450...@qq.com>

Closes #20456 from xubo245/SPARK22624_PysparkRangePartition.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eacb62fb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eacb62fb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eacb62fb

Branch: refs/heads/master
Commit: eacb62fbbed317fd0e972102838af231385d54d8
Parents: 8acb51f
Author: xubo245 <601450...@qq.com>
Authored: Sun Feb 11 19:23:15 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 19:23:15 2018 +0900

--
 python/pyspark/sql/dataframe.py | 45 
 python/pyspark/sql/tests.py | 28 ++
 2 files changed, 73 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eacb62fb/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index faee870..5cc8b63 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -667,6 +667,51 @@ class DataFrame(object):
 else:
 raise TypeError("numPartitions should be an int or Column")
 
+@since("2.4.0")
+def repartitionByRange(self, numPartitions, *cols):
+"""
+Returns a new :class:`DataFrame` partitioned by the given partitioning 
expressions. The
+resulting DataFrame is range partitioned.
+
+``numPartitions`` can be an int to specify the target number of 
partitions or a Column.
+If it is a Column, it will be used as the first partitioning column. 
If not specified,
+the default number of partitions is used.
+
+At least one partition-by expression must be specified.
+When no explicit sort order is specified, "ascending nulls first" is 
assumed.
+
+>>> df.repartitionByRange(2, "age").rdd.getNumPartitions()
+2
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+|  2|Alice|
+|  5|  Bob|
++---+-+
+>>> df.repartitionByRange(1, "age").rdd.getNumPartitions()
+1
+>>> data = df.repartitionByRange("age")
+>>> df.show()
++---+-+
+|age| name|
++---+-+
+|  2|Alice|
+|  5|  Bob|
++---+-+
+"""
+if isinstance(numPartitions, int):
+if len(cols) == 0:
+return ValueError("At least one partition-by expression must 
be specified.")
+else:
+return DataFrame(
+self._jdf.repartitionByRange(numPartitions, 
self._jcols(*cols)), self.sql_ctx)
+elif isinstance(numPartitions, (basestring, Column)):
+cols = (numPartitions,) + cols
+return DataFrame(self._jdf.repartitionByRange(self._jcols(*cols)), 
self.sql_ctx)
+else:
+raise TypeError("numPartitions should be an int, string or Column")
+
 @since(1.3)
 def distinct(self):
 """Returns a new :class:`DataFrame` containing the distinct rows in 
this :class:`DataFrame`.

http://git-wip-us.apache.org/repos/asf/spark/blob/eacb62fb/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 4bc59fd..fe89bd0 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -2148,6 +2148,34 @@ class SQLTests(ReusedSQLTestCase):
 result = df.select(functions.expr("length(a)")).collect()[0].asDict()
 self.assertEqual(13, result["length(a)"])
 
+def test_repartitionByRange_dataframe(self):
+schema = StructType([
+StructField("name", StringType(), True),
+StructField("age", IntegerType(), True),
+StructField("height", DoubleType(), True)])
+
+df1 = self.spark.createDataFrame(
+[(u'Bob', 27, 66.0), (u'Alice', 10, 10.0), (u'Bob', 10, 66.0)], 
schema)
+df2 = self.spark.createDataFrame(
+[(u'Alice', 10, 10.0), (u'Bob', 10, 66.0), (u'Bob', 27, 66.0)], 
schema)
+
+# test repartitionByRange(numPartitions, *cols)
+df3 = df1.repartitionByRange(2, "name", "age")
+self.assertEqual(df3.rdd.getNumPartitions(), 2)
+

spark git commit: [SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps in Arrow codepath to deal with dst

2018-02-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 0783876c8 -> a34fce19b


[SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps 
in Arrow codepath to deal with dst

## What changes were proposed in this pull request?
When tz_localize a tz-naive timetamp, pandas will throw exception if the 
timestamp is during daylight saving time period, e.g., `2015-11-01 01:30:00`. 
This PR fixes this issue by setting `ambiguous=False` when calling tz_localize, 
which is the same default behavior of pytz.

## How was this patch tested?
Add `test_timestamp_dst`

Author: Li Jin 

Closes #20537 from icexelloss/SPARK-23314.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a34fce19
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a34fce19
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a34fce19

Branch: refs/heads/master
Commit: a34fce19bc0ee5a7e36c6ecba75d2aeb70fdcbc7
Parents: 0783876
Author: Li Jin 
Authored: Sun Feb 11 17:31:35 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 17:31:35 2018 +0900

--
 python/pyspark/sql/tests.py | 39 +++
 python/pyspark/sql/types.py | 37 ++---
 2 files changed, 73 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a34fce19/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 1087c3f..4bc59fd 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3670,6 +3670,21 @@ class ArrowTests(ReusedSQLTestCase):
 self.assertEqual(pdf_col_names, df.columns)
 self.assertEqual(pdf_col_names, df_arrow.columns)
 
+# Regression test for SPARK-23314
+def test_timestamp_dst(self):
+import pandas as pd
+# Daylight saving time for Los Angeles for 2015 is Sun, Nov 1 at 2:00 
am
+dt = [datetime.datetime(2015, 11, 1, 0, 30),
+  datetime.datetime(2015, 11, 1, 1, 30),
+  datetime.datetime(2015, 11, 1, 2, 30)]
+pdf = pd.DataFrame({'time': dt})
+
+df_from_python = self.spark.createDataFrame(dt, 
'timestamp').toDF('time')
+df_from_pandas = self.spark.createDataFrame(pdf)
+
+self.assertPandasEqual(pdf, df_from_python.toPandas())
+self.assertPandasEqual(pdf, df_from_pandas.toPandas())
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,
@@ -4311,6 +4326,18 @@ class ScalarPandasUDFTests(ReusedSQLTestCase):
 self.assertEquals(expected.collect(), res1.collect())
 self.assertEquals(expected.collect(), res2.collect())
 
+# Regression test for SPARK-23314
+def test_timestamp_dst(self):
+from pyspark.sql.functions import pandas_udf
+# Daylight saving time for Los Angeles for 2015 is Sun, Nov 1 at 2:00 
am
+dt = [datetime.datetime(2015, 11, 1, 0, 30),
+  datetime.datetime(2015, 11, 1, 1, 30),
+  datetime.datetime(2015, 11, 1, 2, 30)]
+df = self.spark.createDataFrame(dt, 'timestamp').toDF('time')
+foo_udf = pandas_udf(lambda x: x, 'timestamp')
+result = df.withColumn('time', foo_udf(df.time))
+self.assertEquals(df.collect(), result.collect())
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,
@@ -4482,6 +4509,18 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase):
 with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
 df.groupby('id').apply(f).collect()
 
+# Regression test for SPARK-23314
+def test_timestamp_dst(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+# Daylight saving time for Los Angeles for 2015 is Sun, Nov 1 at 2:00 
am
+dt = [datetime.datetime(2015, 11, 1, 0, 30),
+  datetime.datetime(2015, 11, 1, 1, 30),
+  datetime.datetime(2015, 11, 1, 2, 30)]
+df = self.spark.createDataFrame(dt, 'timestamp').toDF('time')
+foo_udf = pandas_udf(lambda pdf: pdf, 'time timestamp', 
PandasUDFType.GROUPED_MAP)
+result = df.groupby('time').apply(foo_udf).sort('time')
+self.assertPandasEqual(df.toPandas(), result.toPandas())
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,

http://git-wip-us.apache.org/repos/asf/spark/blob/a34fce19/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 2599dc5..f7141b4 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1759,8 +1759,38 @@ def

spark git commit: [SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps in Arrow codepath to deal with dst

2018-02-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b7571b9bf -> 9fa7b0e10


[SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps 
in Arrow codepath to deal with dst

## What changes were proposed in this pull request?
When tz_localize a tz-naive timetamp, pandas will throw exception if the 
timestamp is during daylight saving time period, e.g., `2015-11-01 01:30:00`. 
This PR fixes this issue by setting `ambiguous=False` when calling tz_localize, 
which is the same default behavior of pytz.

## How was this patch tested?
Add `test_timestamp_dst`

Author: Li Jin 

Closes #20537 from icexelloss/SPARK-23314.

(cherry picked from commit a34fce19bc0ee5a7e36c6ecba75d2aeb70fdcbc7)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9fa7b0e1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9fa7b0e1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9fa7b0e1

Branch: refs/heads/branch-2.3
Commit: 9fa7b0e107c283557648160195ce179077752e4c
Parents: b7571b9
Author: Li Jin 
Authored: Sun Feb 11 17:31:35 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 17:31:48 2018 +0900

--
 python/pyspark/sql/tests.py | 39 +++
 python/pyspark/sql/types.py | 37 ++---
 2 files changed, 73 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9fa7b0e1/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 95a057d..0f76c96 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3644,6 +3644,21 @@ class ArrowTests(ReusedSQLTestCase):
 self.assertEqual(pdf_col_names, df.columns)
 self.assertEqual(pdf_col_names, df_arrow.columns)
 
+# Regression test for SPARK-23314
+def test_timestamp_dst(self):
+import pandas as pd
+# Daylight saving time for Los Angeles for 2015 is Sun, Nov 1 at 2:00 
am
+dt = [datetime.datetime(2015, 11, 1, 0, 30),
+  datetime.datetime(2015, 11, 1, 1, 30),
+  datetime.datetime(2015, 11, 1, 2, 30)]
+pdf = pd.DataFrame({'time': dt})
+
+df_from_python = self.spark.createDataFrame(dt, 
'timestamp').toDF('time')
+df_from_pandas = self.spark.createDataFrame(pdf)
+
+self.assertPandasEqual(pdf, df_from_python.toPandas())
+self.assertPandasEqual(pdf, df_from_pandas.toPandas())
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,
@@ -4285,6 +4300,18 @@ class ScalarPandasUDFTests(ReusedSQLTestCase):
 self.assertEquals(expected.collect(), res1.collect())
 self.assertEquals(expected.collect(), res2.collect())
 
+# Regression test for SPARK-23314
+def test_timestamp_dst(self):
+from pyspark.sql.functions import pandas_udf
+# Daylight saving time for Los Angeles for 2015 is Sun, Nov 1 at 2:00 
am
+dt = [datetime.datetime(2015, 11, 1, 0, 30),
+  datetime.datetime(2015, 11, 1, 1, 30),
+  datetime.datetime(2015, 11, 1, 2, 30)]
+df = self.spark.createDataFrame(dt, 'timestamp').toDF('time')
+foo_udf = pandas_udf(lambda x: x, 'timestamp')
+result = df.withColumn('time', foo_udf(df.time))
+self.assertEquals(df.collect(), result.collect())
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,
@@ -4462,6 +4489,18 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase):
 with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
 df.groupby('id').apply(f).collect()
 
+# Regression test for SPARK-23314
+def test_timestamp_dst(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+# Daylight saving time for Los Angeles for 2015 is Sun, Nov 1 at 2:00 
am
+dt = [datetime.datetime(2015, 11, 1, 0, 30),
+  datetime.datetime(2015, 11, 1, 1, 30),
+  datetime.datetime(2015, 11, 1, 2, 30)]
+df = self.spark.createDataFrame(dt, 'timestamp').toDF('time')
+foo_udf = pandas_udf(lambda pdf: pdf, 'time timestamp', 
PandasUDFType.GROUPED_MAP)
+result = df.groupby('time').apply(foo_udf).sort('time')
+self.assertPandasEqual(df.toPandas(), result.toPandas())
+
 
 if __name__ == "__main__":
 from pyspark.sql.tests import *

http://git-wip-us.apache.org/repos/asf/spark/blob/9fa7b0e1/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 2599dc5..f7141b4 100644
---

spark git commit: [SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark

2018-02-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f17b936f0 -> 407f67249


[SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark

## What changes were proposed in this pull request?
Deprecating the field `name` in PySpark is not expected. This PR is to revert 
the change.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20595 from gatorsmile/removeDeprecate.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/407f6724
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/407f6724
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/407f6724

Branch: refs/heads/master
Commit: 407f67249639709c40c46917700ed6dd736daa7d
Parents: f17b936
Author: gatorsmile 
Authored: Tue Feb 13 15:05:13 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Feb 13 15:05:13 2018 +0900

--
 python/pyspark/sql/types.py | 3 ---
 1 file changed, 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/407f6724/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index e25941c..cd85740 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -455,9 +455,6 @@ class StructType(DataType):
 Iterating a :class:`StructType` will iterate its :class:`StructField`\\s.
 A contained :class:`StructField` can be accessed by name or position.
 
-.. note:: `names` attribute is deprecated in 2.3. Use `fieldNames` method 
instead
-to get a list of field names.
-
 >>> struct1 = StructType([StructField("f1", StringType(), True)])
 >>> struct1["f1"]
 StructField(f1,StringType,true)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark

2018-02-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 43f5e4067 -> 3737c3d32


[SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark

## What changes were proposed in this pull request?
Deprecating the field `name` in PySpark is not expected. This PR is to revert 
the change.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20595 from gatorsmile/removeDeprecate.

(cherry picked from commit 407f67249639709c40c46917700ed6dd736daa7d)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3737c3d3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3737c3d3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3737c3d3

Branch: refs/heads/branch-2.3
Commit: 3737c3d32bb92e73cadaf3b1b9759d9be00b288d
Parents: 43f5e40
Author: gatorsmile 
Authored: Tue Feb 13 15:05:13 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Feb 13 15:05:33 2018 +0900

--
 python/pyspark/sql/types.py | 3 ---
 1 file changed, 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3737c3d3/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index e25941c..cd85740 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -455,9 +455,6 @@ class StructType(DataType):
 Iterating a :class:`StructType` will iterate its :class:`StructField`\\s.
 A contained :class:`StructField` can be accessed by name or position.
 
-.. note:: `names` attribute is deprecated in 2.3. Use `fieldNames` method 
instead
-to get a list of field names.
-
 >>> struct1 = StructType([StructField("f1", StringType(), True)])
 >>> struct1["f1"]
 StructField(f1,StringType,true)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or dateutil.

2018-02-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 f3a9a7f6b -> b7571b9bf


[SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or 
dateutil.

## What changes were proposed in this pull request?

Currently we use `tzlocal()` to get Python local timezone, but it sometimes 
causes unexpected behavior.
I changed the way to get Python local timezone to use pytz if the timezone is 
specified in environment variable, or timezone file via dateutil .

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN 

Closes #20559 from ueshin/issues/SPARK-23360/master.

(cherry picked from commit 97a224a855c4410b2dfb9c0bcc6aae583bd28e92)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b7571b9b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b7571b9b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b7571b9b

Branch: refs/heads/branch-2.3
Commit: b7571b9bfcced2e08f87e815c2ea9474bfd5fe2a
Parents: f3a9a7f
Author: Takuya UESHIN 
Authored: Sun Feb 11 01:08:02 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 01:08:16 2018 +0900

--
 python/pyspark/sql/tests.py | 28 
 python/pyspark/sql/types.py | 23 +++
 2 files changed, 47 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b7571b9b/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 59e08c7..95a057d 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -2862,6 +2862,34 @@ class SQLTests(ReusedSQLTestCase):
 "d": [pd.Timestamp.now().date()]})
 self.spark.createDataFrame(pdf)
 
+# Regression test for SPARK-23360
+@unittest.skipIf(not _have_pandas, _pandas_requirement_message)
+def test_create_dateframe_from_pandas_with_dst(self):
+import pandas as pd
+from datetime import datetime
+
+pdf = pd.DataFrame({'time': [datetime(2015, 10, 31, 22, 30)]})
+
+df = self.spark.createDataFrame(pdf)
+self.assertPandasEqual(pdf, df.toPandas())
+
+orig_env_tz = os.environ.get('TZ', None)
+orig_session_tz = self.spark.conf.get('spark.sql.session.timeZone')
+try:
+tz = 'America/Los_Angeles'
+os.environ['TZ'] = tz
+time.tzset()
+self.spark.conf.set('spark.sql.session.timeZone', tz)
+
+df = self.spark.createDataFrame(pdf)
+self.assertPandasEqual(pdf, df.toPandas())
+finally:
+del os.environ['TZ']
+if orig_env_tz is not None:
+os.environ['TZ'] = orig_env_tz
+time.tzset()
+self.spark.conf.set('spark.sql.session.timeZone', orig_session_tz)
+
 
 class HiveSparkSubmitTests(SparkSubmitTests):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b7571b9b/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 093dae5..2599dc5 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1709,6 +1709,21 @@ def _check_dataframe_convert_date(pdf, schema):
 return pdf
 
 
+def _get_local_timezone():
+""" Get local timezone using pytz with environment variable, or dateutil.
+
+If there is a 'TZ' environment variable, pass it to pandas to use pytz and 
use it as timezone
+string, otherwise use the special word 'dateutil/:' which means that 
pandas uses dateutil and
+it reads system configuration to know the system local timezone.
+
+See also:
+- https://github.com/pandas-dev/pandas/blob/0.19.x/pandas/tslib.pyx#L1753
+- https://github.com/dateutil/dateutil/blob/2.6.1/dateutil/tz/tz.py#L1338
+"""
+import os
+return os.environ.get('TZ', 'dateutil/:')
+
+
 def _check_dataframe_localize_timestamps(pdf, timezone):
 """
 Convert timezone aware timestamps to timezone-naive in the specified 
timezone or local timezone
@@ -1721,7 +1736,7 @@ def _check_dataframe_localize_timestamps(pdf, timezone):
 require_minimum_pandas_version()
 
 from pandas.api.types import is_datetime64tz_dtype
-tz = timezone or 'tzlocal()'
+tz = timezone or _get_local_timezone()
 for column, series in pdf.iteritems():
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64tz_dtype(series.dtype):
@@ -1744,7 +1759,7 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 from pandas.api.types

spark git commit: [SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or dateutil.

2018-02-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6d7c38330 -> 97a224a85


[SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or 
dateutil.

## What changes were proposed in this pull request?

Currently we use `tzlocal()` to get Python local timezone, but it sometimes 
causes unexpected behavior.
I changed the way to get Python local timezone to use pytz if the timezone is 
specified in environment variable, or timezone file via dateutil .

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN 

Closes #20559 from ueshin/issues/SPARK-23360/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/97a224a8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/97a224a8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/97a224a8

Branch: refs/heads/master
Commit: 97a224a855c4410b2dfb9c0bcc6aae583bd28e92
Parents: 6d7c383
Author: Takuya UESHIN 
Authored: Sun Feb 11 01:08:02 2018 +0900
Committer: hyukjinkwon 
Committed: Sun Feb 11 01:08:02 2018 +0900

--
 python/pyspark/sql/tests.py | 28 
 python/pyspark/sql/types.py | 23 +++
 2 files changed, 47 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/97a224a8/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 6ace169..1087c3f 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -2868,6 +2868,34 @@ class SQLTests(ReusedSQLTestCase):
 "d": [pd.Timestamp.now().date()]})
 self.spark.createDataFrame(pdf)
 
+# Regression test for SPARK-23360
+@unittest.skipIf(not _have_pandas, _pandas_requirement_message)
+def test_create_dateframe_from_pandas_with_dst(self):
+import pandas as pd
+from datetime import datetime
+
+pdf = pd.DataFrame({'time': [datetime(2015, 10, 31, 22, 30)]})
+
+df = self.spark.createDataFrame(pdf)
+self.assertPandasEqual(pdf, df.toPandas())
+
+orig_env_tz = os.environ.get('TZ', None)
+orig_session_tz = self.spark.conf.get('spark.sql.session.timeZone')
+try:
+tz = 'America/Los_Angeles'
+os.environ['TZ'] = tz
+time.tzset()
+self.spark.conf.set('spark.sql.session.timeZone', tz)
+
+df = self.spark.createDataFrame(pdf)
+self.assertPandasEqual(pdf, df.toPandas())
+finally:
+del os.environ['TZ']
+if orig_env_tz is not None:
+os.environ['TZ'] = orig_env_tz
+time.tzset()
+self.spark.conf.set('spark.sql.session.timeZone', orig_session_tz)
+
 
 class HiveSparkSubmitTests(SparkSubmitTests):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/97a224a8/python/pyspark/sql/types.py
--
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 093dae5..2599dc5 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1709,6 +1709,21 @@ def _check_dataframe_convert_date(pdf, schema):
 return pdf
 
 
+def _get_local_timezone():
+""" Get local timezone using pytz with environment variable, or dateutil.
+
+If there is a 'TZ' environment variable, pass it to pandas to use pytz and 
use it as timezone
+string, otherwise use the special word 'dateutil/:' which means that 
pandas uses dateutil and
+it reads system configuration to know the system local timezone.
+
+See also:
+- https://github.com/pandas-dev/pandas/blob/0.19.x/pandas/tslib.pyx#L1753
+- https://github.com/dateutil/dateutil/blob/2.6.1/dateutil/tz/tz.py#L1338
+"""
+import os
+return os.environ.get('TZ', 'dateutil/:')
+
+
 def _check_dataframe_localize_timestamps(pdf, timezone):
 """
 Convert timezone aware timestamps to timezone-naive in the specified 
timezone or local timezone
@@ -1721,7 +1736,7 @@ def _check_dataframe_localize_timestamps(pdf, timezone):
 require_minimum_pandas_version()
 
 from pandas.api.types import is_datetime64tz_dtype
-tz = timezone or 'tzlocal()'
+tz = timezone or _get_local_timezone()
 for column, series in pdf.iteritems():
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64tz_dtype(series.dtype):
@@ -1744,7 +1759,7 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?

spark git commit: [SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests

2018-02-05 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a24c03138 -> 8141c3e3d


[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in 
PySpark SQL tests

## What changes were proposed in this pull request?

This PR proposes to log if PyArrow and Pandas are installed or not so we can 
check if related tests are going to be skipped or not.

## How was this patch tested?

Manually tested:

I don't have PyArrow installed in PyPy.
```bash
$ ./run-tests --python-executables=python3
```

```
...
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python3' in 
'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3' in 
'pyspark-sql' module.
Starting test(python3): pyspark.mllib.tests
Starting test(python3): pyspark.sql.tests
Starting test(python3): pyspark.streaming.tests
Starting test(python3): pyspark.tests
```

```bash
$ ./run-tests --modules=pyspark-streaming
```

```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-streaming']
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.streaming.util
Starting test(python2.7): pyspark.streaming.tests
Starting test(python2.7): pyspark.streaming.util
```

```bash
$ ./run-tests
```

```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 
'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 
'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 
'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not 
found.
Will test Pandas related features against Python executable 'pypy' in 
'pyspark-sql' module.
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests
```

```bash
$ ./run-tests --modules=pyspark-sql --python-executables=pypy
```

```
...
Will test against the following Python executables: ['pypy']
Will test the following Python modules: ['pyspark-sql']
Will skip PyArrow related features against Python executable 'pypy' in 
'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not 
found.
Will test Pandas related features against Python executable 'pypy' in 
'pyspark-sql' module.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.sql.catalog
Starting test(pypy): pyspark.sql.column
Starting test(pypy): pyspark.sql.conf
```

After some modification to produce other cases:

```bash
$ ./run-tests
```

```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will skip PyArrow related features against Python executable 'python2.7' in 
'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was 
found.
Will skip Pandas related features against Python executable 'python2.7' in 
'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was 
found.
Will skip PyArrow related features against Python executable 'pypy' in 
'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not 
found.
Will skip Pandas related features against Python executable 'pypy' in 
'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was 
found.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests
```

```bash
./run-tests-with-coverage
```
```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 
'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 
'pyspark-sql' module.
Coverage is not installed in Python executable 'pypy' but 
'COVERAGE_PROCESS_START' environment variable is set, exiting.
```

Author: hyukjinkwon 

Closes #20473 from HyukjinKwon/SPARK-23300.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8141c3e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8141c3e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8141c3e3

Branch: refs/heads/master
Commit:

spark git commit: [SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by createOrReplaceTempView

2018-02-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master c36fecc3b -> 9775df67f


[SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by 
createOrReplaceTempView

## What changes were proposed in this pull request?
Replace `registerTempTable` by `createOrReplaceTempView`.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20523 from gatorsmile/updateExamples.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9775df67
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9775df67
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9775df67

Branch: refs/heads/master
Commit: 9775df67f924663598d51723a878557ddafb8cfd
Parents: c36fecc
Author: gatorsmile 
Authored: Wed Feb 7 23:24:16 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Feb 7 23:24:16 2018 +0900

--
 python/pyspark/sql/udf.py  | 2 +-
 .../src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9775df67/python/pyspark/sql/udf.py
--
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 0f759c4..08c6b9e 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -356,7 +356,7 @@ class UDFRegistration(object):
 
 >>> spark.udf.registerJavaUDAF("javaUDAF", 
"test.org.apache.spark.sql.MyDoubleAvg")
 >>> df = spark.createDataFrame([(1, "a"),(2, "b"), (3, "a")],["id", 
"name"])
->>> df.registerTempTable("df")
+>>> df.createOrReplaceTempView("df")
 >>> spark.sql("SELECT name, javaUDAF(id) as avg from df group by 
name").collect()
 [Row(name=u'b', avg=102.0), Row(name=u'a', avg=102.0)]
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/9775df67/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
index ddbaa45..08dc129 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
@@ -46,7 +46,7 @@ public class JavaUDAFSuite {
   @SuppressWarnings("unchecked")
   @Test
   public void udf1Test() {
-spark.range(1, 10).toDF("value").registerTempTable("df");
+spark.range(1, 10).toDF("value").createOrReplaceTempView("df");
 spark.udf().registerJavaUDAF("myDoubleAvg", MyDoubleAvg.class.getName());
 Row result = spark.sql("SELECT myDoubleAvg(value) as my_avg from 
df").head();
 Assert.assertEquals(105.0, result.getDouble(0), 1.0e-6);


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by createOrReplaceTempView

2018-02-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 874d3f89f -> cb22e830b


[SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by 
createOrReplaceTempView

## What changes were proposed in this pull request?
Replace `registerTempTable` by `createOrReplaceTempView`.

## How was this patch tested?
N/A

Author: gatorsmile 

Closes #20523 from gatorsmile/updateExamples.

(cherry picked from commit 9775df67f924663598d51723a878557ddafb8cfd)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb22e830
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb22e830
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb22e830

Branch: refs/heads/branch-2.3
Commit: cb22e830b0af3f2d760beffea9a79a6d349e4661
Parents: 874d3f8
Author: gatorsmile 
Authored: Wed Feb 7 23:24:16 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Feb 7 23:24:30 2018 +0900

--
 python/pyspark/sql/udf.py  | 2 +-
 .../src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cb22e830/python/pyspark/sql/udf.py
--
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 82a28c8..5a848c2 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -348,7 +348,7 @@ class UDFRegistration(object):
 
 >>> spark.udf.registerJavaUDAF("javaUDAF", 
"test.org.apache.spark.sql.MyDoubleAvg")
 >>> df = spark.createDataFrame([(1, "a"),(2, "b"), (3, "a")],["id", 
"name"])
->>> df.registerTempTable("df")
+>>> df.createOrReplaceTempView("df")
 >>> spark.sql("SELECT name, javaUDAF(id) as avg from df group by 
name").collect()
 [Row(name=u'b', avg=102.0), Row(name=u'a', avg=102.0)]
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/cb22e830/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
index ddbaa45..08dc129 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDAFSuite.java
@@ -46,7 +46,7 @@ public class JavaUDAFSuite {
   @SuppressWarnings("unchecked")
   @Test
   public void udf1Test() {
-spark.range(1, 10).toDF("value").registerTempTable("df");
+spark.range(1, 10).toDF("value").createOrReplaceTempView("df");
 spark.udf().registerJavaUDAF("myDoubleAvg", MyDoubleAvg.class.getName());
 Row result = spark.sql("SELECT myDoubleAvg(value) as my_avg from 
df").head();
 Assert.assertEquals(105.0, result.getDouble(0), 1.0e-6);


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)

2018-02-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 9775df67f -> 71cfba04a


[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark 
tests (to skip or test)

## What changes were proposed in this pull request?

This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark 
tests to skip or test.

We declared the extra dependencies:

https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204

In case of PyArrow:

Currently we only check if pyarrow is installed or not without checking the 
version. It already fails to run tests. For example, if PyArrow 0.7.0 is 
installed:

```
==
ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in 
test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in 
require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; 
however, your version was 0.7.0.

--
Ran 33 tests in 8.098s

FAILED (errors=33)
```

In case of Pandas:

There are few tests for old Pandas which were tested only when Pandas version 
was lower, and I rewrote them to be tested when both Pandas version is lower 
and missing.

## How was this patch tested?

Manually tested by modifying the condition:

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... 
skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... 
skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 
0.19.2.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... 
skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... 
skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... 
skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... 
skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 
0.8.0.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... 
skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... 
skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```

Author: hyukjinkwon 

Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/71cfba04
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/71cfba04
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/71cfba04

Branch: refs/heads/master
Commit: 71cfba04aeec5ae9b85a507b13996e80f8750edc
Parents: 9775df6
Author: hyukjinkwon 
Authored: Wed Feb 7 23:28:10 2018 +0900
Committer: hyukjinkwon 
Committed: Wed Feb 7 23:28:10 2018 +0900

--
 pom.xml |  4 ++
 python/pyspark/sql/dataframe.py |  3 ++
 python/pyspark/sql/session.py   |  3 ++
 python/pyspark/sql/tests.py | 87 
 python/pyspark/sql/utils.py | 30 +
 python/setup.py | 10 -
 6 files changed, 89 insertions(+), 48 deletions(-)

spark git commit: [SPARK-23240][PYTHON] Better error message when extraneous data in pyspark.daemon's stdout

2018-02-20 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master aadf9535b -> 862fa697d


[SPARK-23240][PYTHON] Better error message when extraneous data in 
pyspark.daemon's stdout

## What changes were proposed in this pull request?

Print more helpful message when daemon module's stdout is empty or contains a 
bad port number.

## How was this patch tested?

Manually recreated the environmental issues that caused the mysterious 
exceptions at one site. Tested that the expected messages are logged.

Also, ran all scala unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Bruce Robbins 

Closes #20424 from bersprockets/SPARK-23240_prop2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/862fa697
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/862fa697
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/862fa697

Branch: refs/heads/master
Commit: 862fa697d829cdddf0f25e5613c91b040f9d9652
Parents: aadf953
Author: Bruce Robbins 
Authored: Tue Feb 20 20:26:26 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Feb 20 20:26:26 2018 +0900

--
 .../spark/api/python/PythonWorkerFactory.scala  | 29 ++--
 1 file changed, 26 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/862fa697/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
index 30976ac..2340580 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.api.python
 
-import java.io.{DataInputStream, DataOutputStream, InputStream, 
OutputStreamWriter}
+import java.io.{DataInputStream, DataOutputStream, EOFException, InputStream, 
OutputStreamWriter}
 import java.net.{InetAddress, ServerSocket, Socket, SocketException}
 import java.nio.charset.StandardCharsets
 import java.util.Arrays
@@ -182,7 +182,8 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
 
   try {
 // Create and start the daemon
-val pb = new ProcessBuilder(Arrays.asList(pythonExec, "-m", 
daemonModule))
+val command = Arrays.asList(pythonExec, "-m", daemonModule)
+val pb = new ProcessBuilder(command)
 val workerEnv = pb.environment()
 workerEnv.putAll(envVars.asJava)
 workerEnv.put("PYTHONPATH", pythonPath)
@@ -191,7 +192,29 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
 daemon = pb.start()
 
 val in = new DataInputStream(daemon.getInputStream)
-daemonPort = in.readInt()
+try {
+  daemonPort = in.readInt()
+} catch {
+  case _: EOFException =>
+throw new SparkException(s"No port number in $daemonModule's 
stdout")
+}
+
+// test that the returned port number is within a valid range.
+// note: this does not cover the case where the port number
+// is arbitrary data but is also coincidentally within range
+if (daemonPort < 1 || daemonPort > 0x) {
+  val exceptionMessage = f"""
+|Bad data in $daemonModule's standard output. Invalid port number:
+|  $daemonPort (0x$daemonPort%08x)
+|Python command to execute the daemon was:
+|  ${command.asScala.mkString(" ")}
+|Check that you don't have any unexpected modules or libraries in
+|your PYTHONPATH:
+|  $pythonPath
+|Also, check if you have a sitecustomize.py module in your python 
path,
+|or in your python installation, that is printing to standard 
output"""
+  throw new SparkException(exceptionMessage.stripMargin)
+}
 
 // Redirect daemon stdout and stderr
 redirectStreamsToStderr(in, daemon.getErrorStream)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22843][R] Adds localCheckpoint in R

2017-12-28 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master ded6d27e4 -> 76e8a1d7e


[SPARK-22843][R] Adds localCheckpoint in R

## What changes were proposed in this pull request?

This PR proposes to add `localCheckpoint(..)` in R API.

```r
df <- localCheckpoint(createDataFrame(iris))
```

## How was this patch tested?

Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon 

Closes #20073 from HyukjinKwon/SPARK-22843.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76e8a1d7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76e8a1d7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76e8a1d7

Branch: refs/heads/master
Commit: 76e8a1d7e2619c1e6bd75c399314d2583a86b93b
Parents: ded6d27
Author: hyukjinkwon 
Authored: Thu Dec 28 20:17:26 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Dec 28 20:17:26 2017 +0900

--
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/DataFrame.R   | 27 +++
 R/pkg/R/generics.R|  4 
 R/pkg/tests/fulltests/test_sparkSQL.R | 22 ++
 4 files changed, 54 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/76e8a1d7/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index dce64e1..4b699de 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -133,6 +133,7 @@ exportMethods("arrange",
   "isStreaming",
   "join",
   "limit",
+  "localCheckpoint",
   "merge",
   "mutate",
   "na.omit",

http://git-wip-us.apache.org/repos/asf/spark/blob/76e8a1d7/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index b8d732a..ace49da 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -3782,6 +3782,33 @@ setMethod("checkpoint",
 dataFrame(df)
   })
 
+#' localCheckpoint
+#'
+#' Returns a locally checkpointed version of this SparkDataFrame. 
Checkpointing can be used to
+#' truncate the logical plan, which is especially useful in iterative 
algorithms where the plan
+#' may grow exponentially. Local checkpoints are stored in the executors using 
the caching
+#' subsystem and therefore they are not reliable.
+#'
+#' @param x A SparkDataFrame
+#' @param eager whether to locally checkpoint this SparkDataFrame immediately
+#' @return a new locally checkpointed SparkDataFrame
+#' @family SparkDataFrame functions
+#' @aliases localCheckpoint,SparkDataFrame-method
+#' @rdname localCheckpoint
+#' @name localCheckpoint
+#' @export
+#' @examples
+#'\dontrun{
+#' df <- localCheckpoint(df)
+#' }
+#' @note localCheckpoint since 2.3.0
+setMethod("localCheckpoint",
+  signature(x = "SparkDataFrame"),
+  function(x, eager = TRUE) {
+df <- callJMethod(x@sdf, "localCheckpoint", as.logical(eager))
+dataFrame(df)
+  })
+
 #' cube
 #'
 #' Create a multi-dimensional cube for the SparkDataFrame using the specified 
columns.

http://git-wip-us.apache.org/repos/asf/spark/blob/76e8a1d7/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 5ddaa66..d5d0bc9 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -611,6 +611,10 @@ setGeneric("isStreaming", function(x) { 
standardGeneric("isStreaming") })
 #' @export
 setGeneric("limit", function(x, num) {standardGeneric("limit") })
 
+#' @rdname localCheckpoint
+#' @export
+setGeneric("localCheckpoint", function(x, eager = TRUE) { 
standardGeneric("localCheckpoint") })
+
 #' @rdname merge
 #' @export
 setGeneric("merge")

http://git-wip-us.apache.org/repos/asf/spark/blob/76e8a1d7/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 6cc0188..650e7c0 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -957,6 +957,28 @@ test_that("setCheckpointDir(), checkpoint() on a 
DataFrame", {
   }
 })
 
+test_that("localCheckpoint() on a DataFrame", {
+  if (windows_with_hadoop()) {
+# Checkpoint directory shouldn't matter in localCheckpoint.
+checkpointDir <- file.path(tempdir(), "lcproot")
+expect_true(length(list.files(path = checkpointDir, all.files = TRUE, 
recursive = TRUE)) == 0)
+setCheckpointDir(checkpointDir)
+
+textPath <- tempfile(pattern = "textPath", fileext = ".txt")
+writeLines(mockLines, textPath)
+# Read it lazily and

spark git commit: [SPARK-21208][R] Adds setLocalProperty and getLocalProperty in R

2017-12-28 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 76e8a1d7e -> 1eebfbe19


[SPARK-21208][R] Adds setLocalProperty and getLocalProperty in R

## What changes were proposed in this pull request?

This PR adds `setLocalProperty` and `getLocalProperty`in R.

```R
> df <- createDataFrame(iris)
> setLocalProperty("spark.job.description", "Hello world!")
> count(df)
> setLocalProperty("spark.job.description", "Hi !!")
> count(df)
```

https://user-images.githubusercontent.com/6477701/34335213-60655a7c-e990-11e7-88aa-12debe311627.png;>

```R
> print(getLocalProperty("spark.job.description"))
NULL
> setLocalProperty("spark.job.description", "Hello world!")
> print(getLocalProperty("spark.job.description"))
[1] "Hello world!"
> setLocalProperty("spark.job.description", "Hi !!")
> print(getLocalProperty("spark.job.description"))
[1] "Hi !!"
```

## How was this patch tested?

Manually tested and a test in `R/pkg/tests/fulltests/test_context.R`.

Author: hyukjinkwon 

Closes #20075 from HyukjinKwon/SPARK-21208.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1eebfbe1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1eebfbe1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1eebfbe1

Branch: refs/heads/master
Commit: 1eebfbe192060af3c81cd086bc5d5a7e80d09e77
Parents: 76e8a1d
Author: hyukjinkwon 
Authored: Thu Dec 28 20:18:47 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Dec 28 20:18:47 2017 +0900

--
 R/pkg/NAMESPACE  |  4 ++-
 R/pkg/R/sparkR.R | 45 +++
 R/pkg/tests/fulltests/test_context.R | 33 ++-
 3 files changed, 80 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1eebfbe1/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 4b699de..ce3eec0 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -76,7 +76,9 @@ exportMethods("glm",
 export("setJobGroup",
"clearJobGroup",
"cancelJobGroup",
-   "setJobDescription")
+   "setJobDescription",
+   "setLocalProperty",
+   "getLocalProperty")
 
 # Export Utility methods
 export("setLogLevel")

http://git-wip-us.apache.org/repos/asf/spark/blob/1eebfbe1/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index fb5f1d2..965471f 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -560,10 +560,55 @@ cancelJobGroup <- function(sc, groupId) {
 #'}
 #' @note setJobDescription since 2.3.0
 setJobDescription <- function(value) {
+  if (!is.null(value)) {
+value <- as.character(value)
+  }
   sc <- getSparkContext()
   invisible(callJMethod(sc, "setJobDescription", value))
 }
 
+#' Set a local property that affects jobs submitted from this thread, such as 
the
+#' Spark fair scheduler pool.
+#'
+#' @param key The key for a local property.
+#' @param value The value for a local property.
+#' @rdname setLocalProperty
+#' @name setLocalProperty
+#' @examples
+#'\dontrun{
+#' setLocalProperty("spark.scheduler.pool", "poolA")
+#'}
+#' @note setLocalProperty since 2.3.0
+setLocalProperty <- function(key, value) {
+  if (is.null(key) || is.na(key)) {
+stop("key should not be NULL or NA.")
+  }
+  if (!is.null(value)) {
+value <- as.character(value)
+  }
+  sc <- getSparkContext()
+  invisible(callJMethod(sc, "setLocalProperty", as.character(key), value))
+}
+
+#' Get a local property set in this thread, or \code{NULL} if it is missing. 
See
+#' \code{setLocalProperty}.
+#'
+#' @param key The key for a local property.
+#' @rdname getLocalProperty
+#' @name getLocalProperty
+#' @examples
+#'\dontrun{
+#' getLocalProperty("spark.scheduler.pool")
+#'}
+#' @note getLocalProperty since 2.3.0
+getLocalProperty <- function(key) {
+  if (is.null(key) || is.na(key)) {
+stop("key should not be NULL or NA.")
+  }
+  sc <- getSparkContext()
+  callJMethod(sc, "getLocalProperty", as.character(key))
+}
+
 sparkConfToSubmitOps <- new.env()
 sparkConfToSubmitOps[["spark.driver.memory"]]   <- "--driver-memory"
 sparkConfToSubmitOps[["spark.driver.extraClassPath"]]   <- 
"--driver-class-path"

http://git-wip-us.apache.org/repos/asf/spark/blob/1eebfbe1/R/pkg/tests/fulltests/test_context.R
--
diff --git a/R/pkg/tests/fulltests/test_context.R 
b/R/pkg/tests/fulltests/test_context.R
index 77635c5..f0d0a51 100644
--- a/R/pkg/tests/fulltests/test_context.R
+++ b/R/pkg/tests/fulltests/test_context.R
@@ -100,7 +100,6 @@ test_that("job group functions can be called", {
   setJobGroup("groupId", "job description",

spark git commit: [SPARK-21552][SQL] Add DecimalType support to ArrowWriter.

2017-12-26 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 0e6833006 -> eb386be1e


[SPARK-21552][SQL] Add DecimalType support to ArrowWriter.

## What changes were proposed in this pull request?

Decimal type is not yet supported in `ArrowWriter`.
This is adding the decimal type support.

## How was this patch tested?

Added a test to `ArrowConvertersSuite`.

Author: Takuya UESHIN 

Closes #18754 from ueshin/issues/SPARK-21552.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb386be1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb386be1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb386be1

Branch: refs/heads/master
Commit: eb386be1ed383323da6e757f63f3b8a7ced38cc4
Parents: 0e68330
Author: Takuya UESHIN 
Authored: Tue Dec 26 21:37:25 2017 +0900
Committer: hyukjinkwon 
Committed: Tue Dec 26 21:37:25 2017 +0900

--
 python/pyspark/sql/tests.py | 61 --
 python/pyspark/sql/types.py |  2 +-
 .../spark/sql/execution/arrow/ArrowWriter.scala | 21 ++
 .../execution/arrow/ArrowConvertersSuite.scala  | 67 +++-
 .../sql/execution/arrow/ArrowWriterSuite.scala  |  2 +
 5 files changed, 131 insertions(+), 22 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eb386be1/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index b977160..b811a0f 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3142,6 +3142,7 @@ class ArrowTests(ReusedSQLTestCase):
 @classmethod
 def setUpClass(cls):
 from datetime import datetime
+from decimal import Decimal
 ReusedSQLTestCase.setUpClass()
 
 # Synchronize default timezone between Python and Java
@@ -3158,11 +3159,15 @@ class ArrowTests(ReusedSQLTestCase):
 StructField("3_long_t", LongType(), True),
 StructField("4_float_t", FloatType(), True),
 StructField("5_double_t", DoubleType(), True),
-StructField("6_date_t", DateType(), True),
-StructField("7_timestamp_t", TimestampType(), True)])
-cls.data = [(u"a", 1, 10, 0.2, 2.0, datetime(1969, 1, 1), 
datetime(1969, 1, 1, 1, 1, 1)),
-(u"b", 2, 20, 0.4, 4.0, datetime(2012, 2, 2), 
datetime(2012, 2, 2, 2, 2, 2)),
-(u"c", 3, 30, 0.8, 6.0, datetime(2100, 3, 3), 
datetime(2100, 3, 3, 3, 3, 3))]
+StructField("6_decimal_t", DecimalType(38, 18), True),
+StructField("7_date_t", DateType(), True),
+StructField("8_timestamp_t", TimestampType(), True)])
+cls.data = [(u"a", 1, 10, 0.2, 2.0, Decimal("2.0"),
+ datetime(1969, 1, 1), datetime(1969, 1, 1, 1, 1, 1)),
+(u"b", 2, 20, 0.4, 4.0, Decimal("4.0"),
+ datetime(2012, 2, 2), datetime(2012, 2, 2, 2, 2, 2)),
+(u"c", 3, 30, 0.8, 6.0, Decimal("6.0"),
+ datetime(2100, 3, 3), datetime(2100, 3, 3, 3, 3, 3))]
 
 @classmethod
 def tearDownClass(cls):
@@ -3190,10 +3195,11 @@ class ArrowTests(ReusedSQLTestCase):
 return pd.DataFrame(data=data_dict)
 
 def test_unsupported_datatype(self):
-schema = StructType([StructField("decimal", DecimalType(), True)])
+schema = StructType([StructField("map", MapType(StringType(), 
IntegerType()), True)])
 df = self.spark.createDataFrame([(None,)], schema=schema)
 with QuietTest(self.sc):
-self.assertRaises(Exception, lambda: df.toPandas())
+with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
+df.toPandas()
 
 def test_null_conversion(self):
 df_null = self.spark.createDataFrame([tuple([None for _ in 
range(len(self.data[0]))])] +
@@ -3293,7 +3299,7 @@ class ArrowTests(ReusedSQLTestCase):
 self.assertNotEqual(result_ny, result_la)
 
 # Correct result_la by adjusting 3 hours difference between Los 
Angeles and New York
-result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == 
'7_timestamp_t' else v
+result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == 
'8_timestamp_t' else v
   for k, v in row.asDict().items()})
for row in result_la]
 self.assertEqual(result_ny, result_la_corrected)
@@ -3317,11 +3323,11 @@ class ArrowTests(ReusedSQLTestCase):
 def test_createDataFrame_with_names(self):
 pdf = self.create_pandas_data_frame()
 # Test that schema as a list of column names gets applied
-

spark git commit: [SPARK-22874][PYSPARK][SQL] Modify checking pandas version to use LooseVersion.

2017-12-22 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 8df1da396 -> 13190a4f6


[SPARK-22874][PYSPARK][SQL] Modify checking pandas version to use LooseVersion.

## What changes were proposed in this pull request?

Currently we check pandas version by capturing if `ImportError` for the 
specific imports is raised or not but we can compare `LooseVersion` of the 
version strings as the same as we're checking pyarrow version.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN 

Closes #20054 from ueshin/issues/SPARK-22874.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13190a4f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13190a4f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13190a4f

Branch: refs/heads/master
Commit: 13190a4f60c081a68812df6df1d8262779cd6fcb
Parents: 8df1da3
Author: Takuya UESHIN 
Authored: Fri Dec 22 20:09:51 2017 +0900
Committer: hyukjinkwon 
Committed: Fri Dec 22 20:09:51 2017 +0900

--
 python/pyspark/sql/dataframe.py |  4 ++--
 python/pyspark/sql/session.py   | 15 +++
 python/pyspark/sql/tests.py |  7 ---
 python/pyspark/sql/types.py | 33 +
 python/pyspark/sql/udf.py   |  4 ++--
 python/pyspark/sql/utils.py | 11 ++-
 6 files changed, 38 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/13190a4f/python/pyspark/sql/dataframe.py
--
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 440684d..95eca76 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1906,9 +1906,9 @@ class DataFrame(object):
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
 try:
 from pyspark.sql.types import 
_check_dataframe_localize_timestamps
-from pyspark.sql.utils import _require_minimum_pyarrow_version
+from pyspark.sql.utils import require_minimum_pyarrow_version
 import pyarrow
-_require_minimum_pyarrow_version()
+require_minimum_pyarrow_version()
 tables = self._collectAsArrow()
 if tables:
 table = pyarrow.concat_tables(tables)

http://git-wip-us.apache.org/repos/asf/spark/blob/13190a4f/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 86db16e..6e5eec4 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -493,15 +493,14 @@ class SparkSession(object):
 data types will be used to coerce the data in Pandas to Arrow 
conversion.
 """
 from pyspark.serializers import ArrowSerializer, _create_batch
-from pyspark.sql.types import from_arrow_schema, to_arrow_type, \
-_old_pandas_exception_message, TimestampType
-from pyspark.sql.utils import _require_minimum_pyarrow_version
-try:
-from pandas.api.types import is_datetime64_dtype, 
is_datetime64tz_dtype
-except ImportError as e:
-raise ImportError(_old_pandas_exception_message(e))
+from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
TimestampType
+from pyspark.sql.utils import require_minimum_pandas_version, \
+require_minimum_pyarrow_version
+
+require_minimum_pandas_version()
+require_minimum_pyarrow_version()
 
-_require_minimum_pyarrow_version()
+from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
 
 # Determine arrow types to coerce data when creating batches
 if isinstance(schema, StructType):

http://git-wip-us.apache.org/repos/asf/spark/blob/13190a4f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 6fdfda1..b977160 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -53,7 +53,8 @@ _have_old_pandas = False
 try:
 import pandas
 try:
-import pandas.api
+from pyspark.sql.utils import require_minimum_pandas_version
+require_minimum_pandas_version()
 _have_pandas = True
 except:
 _have_old_pandas = True
@@ -2600,7 +2601,7 @@ class SQLTests(ReusedSQLTestCase):
 @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
 def test_to_pandas_old(self):
 with QuietTest(self.sc):
-with self.assertRaisesRegexp(ImportError,

spark git commit: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file.

2017-12-27 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6674acd1e -> b8bfce51a


[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file.

## What changes were proposed in this pull request?

This is a follow-up pr of #19884 updating setup.py file to add pyarrow 
dependency.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN 

Closes #20089 from ueshin/issues/SPARK-22324/fup1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b8bfce51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b8bfce51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b8bfce51

Branch: refs/heads/master
Commit: b8bfce51abf28c66ba1fc67b0f25fe1617c81025
Parents: 6674acd
Author: Takuya UESHIN 
Authored: Wed Dec 27 20:51:26 2017 +0900
Committer: hyukjinkwon 
Committed: Wed Dec 27 20:51:26 2017 +0900

--
 python/README.md | 2 +-
 python/setup.py  | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b8bfce51/python/README.md
--
diff --git a/python/README.md b/python/README.md
index 84ec881..3f17fdb 100644
--- a/python/README.md
+++ b/python/README.md
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace all 
of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but additional 
sub-packages have their own requirements (including numpy and pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but some 
additional sub-packages have their own extra requirements for some features 
(including numpy, pandas, and pyarrow).

http://git-wip-us.apache.org/repos/asf/spark/blob/b8bfce51/python/setup.py
--
diff --git a/python/setup.py b/python/setup.py
index 310670e..251d452 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -201,7 +201,7 @@ try:
 extras_require={
 'ml': ['numpy>=1.7'],
 'mllib': ['numpy>=1.7'],
-'sql': ['pandas>=0.19.2']
+'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']
 },
 classifiers=[
 'Development Status :: 5 - Production/Stable',
@@ -210,6 +210,7 @@ try:
 'Programming Language :: Python :: 3',
 'Programming Language :: Python :: 3.4',
 'Programming Language :: Python :: 3.5',
+'Programming Language :: Python :: 3.6',
 'Programming Language :: Python :: Implementation :: CPython',
 'Programming Language :: Python :: Implementation :: PyPy']
 )


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21616][SPARKR][DOCS] update R migration guide and vignettes

2018-01-01 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f5b7714e0 -> 7a702d8d5


[SPARK-21616][SPARKR][DOCS] update R migration guide and vignettes

## What changes were proposed in this pull request?

update R migration guide and vignettes

## How was this patch tested?

manually

Author: Felix Cheung 

Closes #20106 from felixcheung/rreleasenote23.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7a702d8d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7a702d8d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7a702d8d

Branch: refs/heads/master
Commit: 7a702d8d5ed830de5d2237f136b08bd18deae037
Parents: f5b7714
Author: Felix Cheung 
Authored: Tue Jan 2 07:00:31 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 2 07:00:31 2018 +0900

--
 R/pkg/tests/fulltests/test_Windows.R | 1 +
 R/pkg/vignettes/sparkr-vignettes.Rmd | 3 +--
 docs/sparkr.md   | 6 ++
 3 files changed, 8 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7a702d8d/R/pkg/tests/fulltests/test_Windows.R
--
diff --git a/R/pkg/tests/fulltests/test_Windows.R 
b/R/pkg/tests/fulltests/test_Windows.R
index b2ec6c6..209827d 100644
--- a/R/pkg/tests/fulltests/test_Windows.R
+++ b/R/pkg/tests/fulltests/test_Windows.R
@@ -14,6 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+
 context("Windows-specific tests")
 
 test_that("sparkJars tag in SparkContext", {

http://git-wip-us.apache.org/repos/asf/spark/blob/7a702d8d/R/pkg/vignettes/sparkr-vignettes.Rmd
--
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd 
b/R/pkg/vignettes/sparkr-vignettes.Rmd
index 8c4ea2f..2e66242 100644
--- a/R/pkg/vignettes/sparkr-vignettes.Rmd
+++ b/R/pkg/vignettes/sparkr-vignettes.Rmd
@@ -391,8 +391,7 @@ We convert `mpg` to `kmpg` (kilometers per gallon). 
`carsSubDF` is a `SparkDataF
 
 ```{r}
 carsSubDF <- select(carsDF, "model", "mpg")
-schema <- structType(structField("model", "string"), structField("mpg", 
"double"),
- structField("kmpg", "double"))
+schema <- "model STRING, mpg DOUBLE, kmpg DOUBLE"
 out <- dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61) }, schema)
 head(collect(out))
 ```

http://git-wip-us.apache.org/repos/asf/spark/blob/7a702d8d/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index a3254e7..997ea60 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -657,3 +657,9 @@ You can inspect the search path in R with 
[`search()`](https://stat.ethz.ch/R-ma
  - By default, derby.log is now saved to `tempdir()`. This will be created 
when instantiating the SparkSession with `enableHiveSupport` set to `TRUE`.
  - `spark.lda` was not setting the optimizer correctly. It has been corrected.
  - Several model summary outputs are updated to have `coefficients` as 
`matrix`. This includes `spark.logit`, `spark.kmeans`, `spark.glm`. Model 
summary outputs for `spark.gaussianMixture` have added log-likelihood as 
`loglik`.
+
+## Upgrading to SparkR 2.3.0
+
+ - The `stringsAsFactors` parameter was previously ignored with `collect`, for 
example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has 
been corrected.
+ - For `summary`, option for statistics to compute has been added. Its output 
is changed from that from `describe`.
+ - A warning can be raised if versions of SparkR package and the Spark JVM do 
not match.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] Fix a bunch of typos

2018-01-01 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 7a702d8d5 -> c284c4e1f


[MINOR] Fix a bunch of typos


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c284c4e1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c284c4e1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c284c4e1

Branch: refs/heads/master
Commit: c284c4e1f6f684ca8db1cc446fdcc43b46e3413c
Parents: 7a702d8
Author: Sean Owen 
Authored: Sun Dec 31 17:00:41 2017 -0600
Committer: hyukjinkwon 
Committed: Tue Jan 2 07:10:19 2018 +0900

--
 bin/find-spark-home | 2 +-
 .../java/org/apache/spark/util/kvstore/LevelDBIterator.java | 2 +-
 .../org/apache/spark/network/protocol/MessageWithHeader.java| 4 ++--
 .../main/java/org/apache/spark/network/sasl/SaslEncryption.java | 4 ++--
 .../org/apache/spark/network/util/TransportFrameDecoder.java| 2 +-
 .../network/shuffle/ExternalShuffleBlockResolverSuite.java  | 2 +-
 .../src/main/java/org/apache/spark/util/sketch/BloomFilter.java | 2 +-
 .../java/org/apache/spark/unsafe/array/ByteArrayMethods.java| 2 +-
 core/src/main/scala/org/apache/spark/SparkContext.scala | 2 +-
 core/src/main/scala/org/apache/spark/status/storeTypes.scala| 2 +-
 .../test/scala/org/apache/spark/util/FileAppenderSuite.scala| 2 +-
 dev/github_jira_sync.py | 2 +-
 dev/lint-python | 2 +-
 examples/src/main/python/ml/linearsvc.py| 2 +-
 .../scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala| 2 +-
 .../scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala| 2 +-
 .../org/apache/spark/streaming/kafka010/JavaKafkaRDDSuite.java  | 2 +-
 .../scala/org/apache/spark/streaming/kinesis/KinesisUtils.scala | 4 ++--
 .../main/java/org/apache/spark/launcher/ChildProcAppHandle.java | 2 +-
 .../main/scala/org/apache/spark/ml/tuning/CrossValidator.scala  | 2 +-
 python/pyspark/ml/image.py  | 2 +-
 .../scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala   | 2 +-
 .../apache/spark/sql/catalyst/expressions/UnsafeArrayData.java  | 2 +-
 .../scala/org/apache/spark/sql/catalyst/analysis/view.scala | 2 +-
 .../apache/spark/sql/catalyst/expressions/objects/objects.scala | 5 +++--
 .../catalyst/expressions/aggregate/CountMinSketchAggSuite.scala | 2 +-
 .../spark/sql/sources/v2/streaming/MicroBatchWriteSupport.java  | 2 +-
 .../org/apache/spark/sql/execution/ui/static/spark-sql-viz.css  | 2 +-
 .../org/apache/spark/sql/execution/datasources/FileFormat.scala | 2 +-
 .../spark/sql/execution/datasources/csv/CSVInferSchema.scala| 2 +-
 .../org/apache/spark/sql/execution/joins/HashedRelation.scala   | 2 +-
 .../execution/streaming/StreamingSymmetricHashJoinHelper.scala  | 2 +-
 .../apache/spark/sql/execution/ui/SQLAppStatusListener.scala| 2 +-
 .../scala/org/apache/spark/sql/expressions/Aggregator.scala | 2 +-
 .../main/scala/org/apache/spark/sql/streaming/progress.scala| 2 +-
 .../src/test/java/test/org/apache/spark/sql/MyDoubleAvg.java| 2 +-
 .../sql-tests/inputs/typeCoercion/native/implicitTypeCasts.sql  | 2 +-
 .../sql/execution/streaming/CompactibleFileStreamLogSuite.scala | 2 +-
 .../org/apache/spark/sql/sources/fakeExternalSources.scala  | 2 +-
 .../org/apache/spark/sql/streaming/FileStreamSinkSuite.scala| 2 +-
 .../src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala | 2 +-
 .../src/main/scala/org/apache/spark/sql/hive/HiveShim.scala | 2 +-
 sql/hive/src/test/resources/data/conf/hive-log4j.properties | 2 +-
 .../scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala  | 2 +-
 .../main/scala/org/apache/spark/streaming/util/StateMap.scala   | 2 +-
 45 files changed, 50 insertions(+), 49 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c284c4e1/bin/find-spark-home
--
diff --git a/bin/find-spark-home b/bin/find-spark-home
index fa78407..617dbaa 100755
--- a/bin/find-spark-home
+++ b/bin/find-spark-home
@@ -21,7 +21,7 @@
 
 FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"
 
-# Short cirtuit if the user already has this set.
+# Short circuit if the user already has this set.
 if [ ! -z "${SPARK_HOME}" ]; then
exit 0
 elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then

http://git-wip-us.apache.org/repos/asf/spark/blob/c284c4e1/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java
--
diff --git 
a/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java

spark git commit: [SPARK-21893][SPARK-22142][TESTS][FOLLOWUP] Enables PySpark tests for Flume and Kafka in Jenkins

2018-01-01 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1c9f95cb7 -> e734a4b9c


[SPARK-21893][SPARK-22142][TESTS][FOLLOWUP] Enables PySpark tests for Flume and 
Kafka in Jenkins

## What changes were proposed in this pull request?

This PR proposes to enable PySpark tests for Flume and Kafka in Jenkins by 
explicitly setting the environment variables in `modules.py`.

Seems we are not taking the dependencies into account when calculating 
environment variables:

https://github.com/apache/spark/blob/3a07eff5af601511e97a05e6fea0e3d48f74c4f0/dev/run-tests.py#L554-L561

## How was this patch tested?

Manual tests with Jenkins in https://github.com/apache/spark/pull/20126.

**Before** - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85559/consoleFull

```
[info] Setup the following environment variables for tests:
...
```

**After** - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85560/consoleFull

```
[info] Setup the following environment variables for tests:
ENABLE_KAFKA_0_8_TESTS=1
ENABLE_FLUME_TESTS=1
...
```

Author: hyukjinkwon 

Closes #20128 from HyukjinKwon/SPARK-21893.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e734a4b9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e734a4b9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e734a4b9

Branch: refs/heads/master
Commit: e734a4b9c23463a7fea61011027a822bc9e11c98
Parents: 1c9f95c
Author: hyukjinkwon 
Authored: Tue Jan 2 07:20:05 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 2 07:20:05 2018 +0900

--
 dev/sparktestsupport/modules.py | 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e734a4b9/dev/sparktestsupport/modules.py
--
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 44f990e..f834563 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -418,6 +418,10 @@ pyspark_streaming = Module(
 source_file_regexes=[
 "python/pyspark/streaming"
 ],
+environ={
+"ENABLE_FLUME_TESTS": "1",
+"ENABLE_KAFKA_0_8_TESTS": "1"
+},
 python_test_goals=[
 "pyspark.streaming.util",
 "pyspark.streaming.tests",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22530][PYTHON][SQL] Adding Arrow support for ArrayType

2018-01-01 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master c284c4e1f -> 1c9f95cb7


[SPARK-22530][PYTHON][SQL] Adding Arrow support for ArrayType

## What changes were proposed in this pull request?

This change adds `ArrayType` support for working with Arrow in pyspark when 
creating a DataFrame, calling `toPandas()`, and using vectorized `pandas_udf`.

## How was this patch tested?

Added new Python unit tests using Array data.

Author: Bryan Cutler 

Closes #20114 from BryanCutler/arrow-ArrayType-support-SPARK-22530.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c9f95cb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c9f95cb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c9f95cb

Branch: refs/heads/master
Commit: 1c9f95cb771ac78775a77edd1abfeb2d8ae2a124
Parents: c284c4e
Author: Bryan Cutler 
Authored: Tue Jan 2 07:13:27 2018 +0900
Committer: hyukjinkwon 
Committed: Tue Jan 2 07:13:27 2018 +0900

--
 python/pyspark/sql/tests.py | 47 +++-
 python/pyspark/sql/types.py |  4 ++
 .../execution/vectorized/ArrowColumnVector.java | 13 +-
 3 files changed, 61 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1c9f95cb/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 1c34c89..67bdb3d 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3372,6 +3372,31 @@ class ArrowTests(ReusedSQLTestCase):
 schema_rt = from_arrow_schema(arrow_schema)
 self.assertEquals(self.schema, schema_rt)
 
+def test_createDataFrame_with_array_type(self):
+import pandas as pd
+pdf = pd.DataFrame({"a": [[1, 2], [3, 4]], "b": [[u"x", u"y"], [u"y", 
u"z"]]})
+df, df_arrow = self._createDataFrame_toggle(pdf)
+result = df.collect()
+result_arrow = df_arrow.collect()
+expected = [tuple(list(e) for e in rec) for rec in 
pdf.to_records(index=False)]
+for r in range(len(expected)):
+for e in range(len(expected[r])):
+self.assertTrue(expected[r][e] == result_arrow[r][e] and
+result[r][e] == result_arrow[r][e])
+
+def test_toPandas_with_array_type(self):
+expected = [([1, 2], [u"x", u"y"]), ([3, 4], [u"y", u"z"])]
+array_schema = StructType([StructField("a", ArrayType(IntegerType())),
+   StructField("b", ArrayType(StringType()))])
+df = self.spark.createDataFrame(expected, schema=array_schema)
+pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
+result = [tuple(list(e) for e in rec) for rec in 
pdf.to_records(index=False)]
+result_arrow = [tuple(list(e) for e in rec) for rec in 
pdf_arrow.to_records(index=False)]
+for r in range(len(expected)):
+for e in range(len(expected[r])):
+self.assertTrue(expected[r][e] == result_arrow[r][e] and
+result[r][e] == result_arrow[r][e])
+
 
 @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
 class PandasUDFTests(ReusedSQLTestCase):
@@ -3651,6 +3676,24 @@ class VectorizedUDFTests(ReusedSQLTestCase):
 bool_f(col('bool')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_array_type(self):
+from pyspark.sql.functions import pandas_udf, col
+data = [([1, 2],), ([3, 4],)]
+array_schema = StructType([StructField("array", 
ArrayType(IntegerType()))])
+df = self.spark.createDataFrame(data, schema=array_schema)
+array_f = pandas_udf(lambda x: x, ArrayType(IntegerType()))
+result = df.select(array_f(col('array')))
+self.assertEquals(df.collect(), result.collect())
+
+def test_vectorized_udf_null_array(self):
+from pyspark.sql.functions import pandas_udf, col
+data = [([1, 2],), (None,), (None,), ([3, 4],), (None,)]
+array_schema = StructType([StructField("array", 
ArrayType(IntegerType()))])
+df = self.spark.createDataFrame(data, schema=array_schema)
+array_f = pandas_udf(lambda x: x, ArrayType(IntegerType()))
+result = df.select(array_f(col('array')))
+self.assertEquals(df.collect(), result.collect())
+
 def test_vectorized_udf_complex(self):
 from pyspark.sql.functions import pandas_udf, col, expr
 df = self.spark.range(10).select(
@@ -3705,7 +3748,7 @@ class VectorizedUDFTests(ReusedSQLTestCase):
 def test_vectorized_udf_wrong_return_type(self):
 from pyspark.sql.functions

spark git commit: [SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF

2018-07-27 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6424b146c -> e8752095a


[SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF

## What changes were proposed in this pull request?

This PR add supports for using mixed Python UDF and Scalar Pandas UDF, in the 
following two cases:

(1)
```
from pyspark.sql.functions import udf, pandas_udf

udf('int')
def f1(x):
return x + 1

pandas_udf('int')
def f2(x):
return x + 1

df = spark.range(0, 1).toDF('v') \
.withColumn('foo', f1(col('v'))) \
.withColumn('bar', f2(col('v')))

```

QueryPlan:
```
>>> df.explain(True)
== Parsed Logical Plan ==
'Project [v#2L, foo#5, f2('v) AS bar#9]
+- AnalysisBarrier
  +- Project [v#2L, f1(v#2L) AS foo#5]
 +- Project [id#0L AS v#2L]
+- Range (0, 1, step=1, splits=Some(4))

== Analyzed Logical Plan ==
v: bigint, foo: int, bar: int
Project [v#2L, foo#5, f2(v#2L) AS bar#9]
+- Project [v#2L, f1(v#2L) AS foo#5]
   +- Project [id#0L AS v#2L]
  +- Range (0, 1, step=1, splits=Some(4))

== Optimized Logical Plan ==
Project [id#0L AS v#2L, f1(id#0L) AS foo#5, f2(id#0L) AS bar#9]
+- Range (0, 1, step=1, splits=Some(4))

== Physical Plan ==
*(2) Project [id#0L AS v#2L, pythonUDF0#13 AS foo#5, pythonUDF0#14 AS bar#9]
+- ArrowEvalPython [f2(id#0L)], [id#0L, pythonUDF0#13, pythonUDF0#14]
   +- BatchEvalPython [f1(id#0L)], [id#0L, pythonUDF0#13]
  +- *(1) Range (0, 1, step=1, splits=4)
```

(2)
```
from pyspark.sql.functions import udf, pandas_udf
udf('int')
def f1(x):
return x + 1

pandas_udf('int')
def f2(x):
return x + 1

df = spark.range(0, 1).toDF('v')
df = df.withColumn('foo', f2(f1(df['v'])))
```

QueryPlan:
```
>>> df.explain(True)
== Parsed Logical Plan ==
Project [v#21L, f2(f1(v#21L)) AS foo#46]
+- AnalysisBarrier
  +- Project [v#21L, f1(f2(v#21L)) AS foo#39]
 +- Project [v#21L, ((v#21L)) AS foo#32]
+- Project [v#21L, ((v#21L)) AS foo#25]
   +- Project [id#19L AS v#21L]
  +- Range (0, 1, step=1, splits=Some(4))

== Analyzed Logical Plan ==
v: bigint, foo: int
Project [v#21L, f2(f1(v#21L)) AS foo#46]
+- Project [v#21L, f1(f2(v#21L)) AS foo#39]
   +- Project [v#21L, ((v#21L)) AS foo#32]
  +- Project [v#21L, ((v#21L)) AS foo#25]
 +- Project [id#19L AS v#21L]
+- Range (0, 1, step=1, splits=Some(4))

== Optimized Logical Plan ==
Project [id#19L AS v#21L, f2(f1(id#19L)) AS foo#46]
+- Range (0, 1, step=1, splits=Some(4))

== Physical Plan ==
*(2) Project [id#19L AS v#21L, pythonUDF0#50 AS foo#46]
+- ArrowEvalPython [f2(pythonUDF0#49)], [id#19L, pythonUDF0#49, pythonUDF0#50]
   +- BatchEvalPython [f1(id#19L)], [id#19L, pythonUDF0#49]
  +- *(1) Range (0, 1, step=1, splits=4)
```

## How was this patch tested?

New tests are added to BatchEvalPythonExecSuite and ScalarPandasUDFTests

Author: Li Jin 

Closes #21650 from icexelloss/SPARK-24624-mix-udf.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e8752095
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e8752095
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e8752095

Branch: refs/heads/master
Commit: e8752095a00aba453a92bc822131c001602f0829
Parents: 6424b14
Author: Li Jin 
Authored: Sat Jul 28 13:41:07 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Jul 28 13:41:07 2018 +0800

--
 python/pyspark/sql/tests.py | 186 +--
 .../execution/python/ExtractPythonUDFs.scala|  42 +++--
 .../python/BatchEvalPythonExecSuite.scala   |   7 +
 .../python/ExtractPythonUDFsSuite.scala |  92 +
 4 files changed, 304 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e8752095/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 2d6b9f0..a294d70 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -4763,17 +4763,6 @@ class ScalarPandasUDFTests(ReusedSQLTestCase):
 'Result vector from pandas_udf was not the required 
length'):
 df.select(raise_exception(col('id'))).collect()
 
-def test_vectorized_udf_mix_udf(self):
-from pyspark.sql.functions import pandas_udf, udf, col
-df = self.spark.range(10)
-row_by_row_udf = udf(lambda x: x, LongType())
-pd_udf = pandas_udf(lambda x: x, LongType())
-with QuietTest(self.sc):
-with self.assertRaisesRegexp(
-Exception,
-'Can not mix vectorized and non-vectorized UDFs'):
-df.select(row_by_row_udf(col('id')), 
pd_udf(col('id'))).collect()
-
 def test_vectorized_udf_chained(self):
 from

spark git commit: [SPARK-24924][SQL][FOLLOW-UP] Add mapping for built-in Avro data source

2018-07-27 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e8752095a -> c6a3db2fb


[SPARK-24924][SQL][FOLLOW-UP] Add mapping for built-in Avro data source

## What changes were proposed in this pull request?
Add one more test case for `com.databricks.spark.avro`.

## How was this patch tested?
N/A

Author: Xiao Li 

Closes #21906 from gatorsmile/avro.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c6a3db2f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c6a3db2f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c6a3db2f

Branch: refs/heads/master
Commit: c6a3db2fb6d9df1a377a1d3385343f70f9e237e4
Parents: e875209
Author: Xiao Li 
Authored: Sat Jul 28 13:43:32 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Jul 28 13:43:32 2018 +0800

--
 .../src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala  | 7 +++
 1 file changed, 7 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c6a3db2f/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
--
diff --git 
a/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala 
b/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
index 2f478c7..f59c2cc 100644
--- a/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
+++ b/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
@@ -394,6 +394,13 @@ class AvroSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 assert(results.length === 8)
   }
 
+  test("old avro data source name works") {
+val results =
+  spark.read.format("com.databricks.spark.avro")
+.load(episodesAvro).select("title").collect()
+assert(results.length === 8)
+  }
+
   test("support of various data types") {
 // This test uses data from test.avro. You can see the data and the schema 
of this file in
 // test.json and test.avsc


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24945][SQL] Switching to uniVocity 2.7.3

2018-08-02 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 7cf16a7fa -> b3f2911ee


[SPARK-24945][SQL] Switching to uniVocity 2.7.3

## What changes were proposed in this pull request?

In the PR, I propose to upgrade uniVocity parser from **2.6.3** to **2.7.3**. 
The recent version includes a fix for the SPARK-24645 issue and has better 
performance.

Before changes:
```
Parsing quoted values:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

One quoted string   6 / 34122  0.0  
666727.0   1.0X

Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

Select 1000 columns 90287 / 91713  0.0   
90286.9   1.0X
Select 100 columns  31826 / 36589  0.0   
31826.4   2.8X
Select one column   25738 / 25872  0.0   
25737.9   3.5X
count()   6931 / 7269  0.1
6931.5  13.0X
```
after:
```
Parsing quoted values:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

One quoted string   33411 / 33510  0.0  
668211.4   1.0X

Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

Select 1000 columns 88028 / 89311  0.0   
88028.1   1.0X
Select 100 columns  29010 / 32755  0.0   
29010.1   3.0X
Select one column   22936 / 22953  0.0   
22936.5   3.8X
count()   6657 / 6740  0.2
6656.6  13.5X
```
Closes #21892

## How was this patch tested?

It was tested by `CSVSuite` and `CSVBenchmarks`

Author: Maxim Gekk 

Closes #21969 from MaxGekk/univocity-2_7_3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b3f2911e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b3f2911e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b3f2911e

Branch: refs/heads/master
Commit: b3f2911eebeb418631ce296f68a7cc68083659cd
Parents: 7cf16a7
Author: Maxim Gekk 
Authored: Fri Aug 3 08:33:28 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Aug 3 08:33:28 2018 +0800

--
 dev/deps/spark-deps-hadoop-2.6 | 2 +-
 dev/deps/spark-deps-hadoop-2.7 | 2 +-
 dev/deps/spark-deps-hadoop-3.1 | 2 +-
 sql/core/pom.xml   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b3f2911e/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 4ef61b2..54cdcfc 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -191,7 +191,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.6.3.jar
+univocity-parsers-2.7.3.jar
 validation-api-1.1.0.Final.jar
 xbean-asm6-shaded-4.8.jar
 xercesImpl-2.9.1.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/b3f2911e/dev/deps/spark-deps-hadoop-2.7
--
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index a74ce1f..fda13db 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -192,7 +192,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.6.3.jar
+univocity-parsers-2.7.3.jar
 validation-api-1.1.0.Final.jar
 xbean-asm6-shaded-4.8.jar
 xercesImpl-2.9.1.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/b3f2911e/dev/deps/spark-deps-hadoop-3.1
--
diff --git a/dev/deps/spark-deps-hadoop-3.1 b/dev/deps/spark-deps-hadoop-3.1
index e0fcca0..90602fc 100644
--- a/dev/deps/spark-deps-hadoop-3.1
+++ b/dev/deps/spark-deps-hadoop-3.1
@@ -212,7 +212,7 @@ stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
 token-provider-1.0.1.jar
-univocity-parsers-2.6.3.jar
+univocity-parsers-2.7.3.jar
 validation-api-1.1.0.Final.jar
 woodstox-core-5.0.3.jar
 xbean-asm6-shaded-4.8.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/b3f2911e/sql/core/pom.xml

spark git commit: [SPARK-24773] Avro: support logical timestamp type with different precisions

2018-08-02 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 29077a1d1 -> 7cf16a7fa


[SPARK-24773] Avro: support logical timestamp type with different precisions

## What changes were proposed in this pull request?

Support reading/writing Avro logical timestamp type with different precisions
https://avro.apache.org/docs/1.8.2/spec.html#Timestamp+%28millisecond+precision%29

To specify the output timestamp type, use Dataframe option 
`outputTimestampType`  or SQL config `spark.sql.avro.outputTimestampType`.  The 
supported values are
* `TIMESTAMP_MICROS`
* `TIMESTAMP_MILLIS`

The default output type is `TIMESTAMP_MICROS`
## How was this patch tested?

Unit test

Author: Gengliang Wang 

Closes #21935 from gengliangwang/avro_timestamp.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7cf16a7f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7cf16a7f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7cf16a7f

Branch: refs/heads/master
Commit: 7cf16a7fa4eb4145c0c5d1dd2555f78a2fdd8d8b
Parents: 29077a1
Author: Gengliang Wang 
Authored: Fri Aug 3 08:32:08 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Aug 3 08:32:08 2018 +0800

--
 .../spark/sql/avro/AvroDeserializer.scala   |  15 ++-
 .../apache/spark/sql/avro/AvroFileFormat.scala  |   4 +-
 .../org/apache/spark/sql/avro/AvroOptions.scala |  11 ++
 .../apache/spark/sql/avro/AvroSerializer.scala  |  12 ++-
 .../spark/sql/avro/SchemaConverters.scala   |  33 --
 external/avro/src/test/resources/timestamp.avro | Bin 0 -> 375 bytes
 .../org/apache/spark/sql/avro/AvroSuite.scala   | 107 +--
 .../org/apache/spark/sql/internal/SQLConf.scala |  18 
 8 files changed, 178 insertions(+), 22 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7cf16a7f/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index b31149a..394a62b 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
@@ -23,6 +23,7 @@ import scala.collection.JavaConverters._
 import scala.collection.mutable.ArrayBuffer
 
 import org.apache.avro.{Schema, SchemaBuilder}
+import org.apache.avro.LogicalTypes.{TimestampMicros, TimestampMillis}
 import org.apache.avro.Schema.Type._
 import org.apache.avro.generic._
 import org.apache.avro.util.Utf8
@@ -86,8 +87,18 @@ class AvroDeserializer(rootAvroType: Schema, 
rootCatalystType: DataType) {
   case (LONG, LongType) => (updater, ordinal, value) =>
 updater.setLong(ordinal, value.asInstanceOf[Long])
 
-  case (LONG, TimestampType) => (updater, ordinal, value) =>
-updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
+  case (LONG, TimestampType) => avroType.getLogicalType match {
+case _: TimestampMillis => (updater, ordinal, value) =>
+  updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
+case _: TimestampMicros => (updater, ordinal, value) =>
+  updater.setLong(ordinal, value.asInstanceOf[Long])
+case null => (updater, ordinal, value) =>
+  // For backward compatibility, if the Avro type is Long and it is 
not logical type,
+  // the value is processed as timestamp type with millisecond 
precision.
+  updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
+case other => throw new IncompatibleSchemaException(
+  s"Cannot convert Avro logical type ${other} to Catalyst Timestamp 
type.")
+  }
 
   case (LONG, DateType) => (updater, ordinal, value) =>
 updater.setInt(ordinal, (value.asInstanceOf[Long] / 
DateTimeUtils.MILLIS_PER_DAY).toInt)

http://git-wip-us.apache.org/repos/asf/spark/blob/7cf16a7f/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
index 6776516..6ffcf37 100755
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
@@ -113,8 +113,8 @@ private[avro] class AvroFileFormat extends FileFormat
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = {
 val parsedOptions = new AvroOptions(options, 
spark.sessionState.newHadoopConf())
-val outputAvroSchema =

spark git commit: [SAPRK-25011][ML] add prefix to all in fpm.py

2018-08-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 19a453191 -> ebf33a333


[SAPRK-25011][ML] add prefix to __all__ in fpm.py

## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-25011

add prefix to __all__ in fpm.py

## How was this patch tested?

existing unit test.

Author: Yuhao Yang 

Closes #21981 from hhbyyh/prefixall.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebf33a33
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebf33a33
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebf33a33

Branch: refs/heads/master
Commit: ebf33a333e9f7ad46f37233eee843e31028a1d62
Parents: 19a4531
Author: Yuhao Yang 
Authored: Fri Aug 3 15:02:41 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Aug 3 15:02:41 2018 +0800

--
 python/pyspark/ml/fpm.py | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ebf33a33/python/pyspark/ml/fpm.py
--
diff --git a/python/pyspark/ml/fpm.py b/python/pyspark/ml/fpm.py
index fd19fd9..f939442 100644
--- a/python/pyspark/ml/fpm.py
+++ b/python/pyspark/ml/fpm.py
@@ -21,7 +21,7 @@ from pyspark.ml.util import *
 from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams, _jvm
 from pyspark.ml.param.shared import *
 
-__all__ = ["FPGrowth", "FPGrowthModel"]
+__all__ = ["FPGrowth", "FPGrowthModel", "PrefixSpan"]
 
 
 class HasMinSupport(Params):
@@ -313,14 +313,15 @@ class PrefixSpan(JavaParams):
 def findFrequentSequentialPatterns(self, dataset):
 """
 .. note:: Experimental
+
 Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
 
 :param dataset: A dataframe containing a sequence column which is
 `ArrayType(ArrayType(T))` type, T is the item type for 
the input dataset.
 :return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
  The schema of it will be:
-  - `sequence: ArrayType(ArrayType(T))` (T is the item type)
-  - `freq: Long`
+ - `sequence: ArrayType(ArrayType(T))` (T is the item type)
+ - `freq: Long`
 
 >>> from pyspark.ml.fpm import PrefixSpan
 >>> from pyspark.sql import Row


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24952][SQL] Support LZMA2 compression by Avro datasource

2018-07-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 2fbe294cf -> d20c10fdf


[SPARK-24952][SQL] Support LZMA2 compression by Avro datasource

## What changes were proposed in this pull request?

In the PR, I propose to support `LZMA2` (`XZ`) and `BZIP2` compressions by 
`AVRO` datasource  in write since the codecs may have better characteristics 
like compression ratio and speed comparing to already supported `snappy` and 
`deflate` codecs.

## How was this patch tested?

It was tested manually and by an existing test which was extended to check the 
`xz` and `bzip2` compressions.

Author: Maxim Gekk 

Closes #21902 from MaxGekk/avro-xz-bzip2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d20c10fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d20c10fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d20c10fd

Branch: refs/heads/master
Commit: d20c10fdf382acf43a7e6a541923bd078e19ca75
Parents: 2fbe294
Author: Maxim Gekk 
Authored: Tue Jul 31 09:12:57 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Jul 31 09:12:57 2018 +0800

--
 .../apache/spark/sql/avro/AvroFileFormat.scala  | 40 +---
 .../org/apache/spark/sql/avro/AvroOptions.scala |  2 +-
 .../org/apache/spark/sql/avro/AvroSuite.scala   | 14 ++-
 .../org/apache/spark/sql/internal/SQLConf.scala |  6 ++-
 4 files changed, 36 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d20c10fd/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
index e0159b9..7db452b 100755
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
@@ -23,7 +23,8 @@ import java.net.URI
 import scala.util.control.NonFatal
 
 import org.apache.avro.Schema
-import org.apache.avro.file.{DataFileConstants, DataFileReader}
+import org.apache.avro.file.DataFileConstants._
+import org.apache.avro.file.DataFileReader
 import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
 import org.apache.avro.mapred.{AvroOutputFormat, FsInput}
 import org.apache.avro.mapreduce.AvroJob
@@ -116,27 +117,22 @@ private[avro] class AvroFileFormat extends FileFormat
   dataSchema, nullable = false, parsedOptions.recordName, 
parsedOptions.recordNamespace)
 
 AvroJob.setOutputKeySchema(job, outputAvroSchema)
-val COMPRESS_KEY = "mapred.output.compress"
-
-parsedOptions.compression match {
-  case "uncompressed" =>
-logInfo("writing uncompressed Avro records")
-job.getConfiguration.setBoolean(COMPRESS_KEY, false)
-
-  case "snappy" =>
-logInfo("compressing Avro output using Snappy")
-job.getConfiguration.setBoolean(COMPRESS_KEY, true)
-job.getConfiguration.set(AvroJob.CONF_OUTPUT_CODEC, 
DataFileConstants.SNAPPY_CODEC)
-
-  case "deflate" =>
-val deflateLevel = spark.sessionState.conf.avroDeflateLevel
-logInfo(s"compressing Avro output using deflate (level=$deflateLevel)")
-job.getConfiguration.setBoolean(COMPRESS_KEY, true)
-job.getConfiguration.set(AvroJob.CONF_OUTPUT_CODEC, 
DataFileConstants.DEFLATE_CODEC)
-job.getConfiguration.setInt(AvroOutputFormat.DEFLATE_LEVEL_KEY, 
deflateLevel)
-
-  case unknown: String =>
-logError(s"unsupported compression codec $unknown")
+
+if (parsedOptions.compression == "uncompressed") {
+  job.getConfiguration.setBoolean("mapred.output.compress", false)
+} else {
+  job.getConfiguration.setBoolean("mapred.output.compress", true)
+  logInfo(s"Compressing Avro output using the ${parsedOptions.compression} 
codec")
+  val codec = parsedOptions.compression match {
+case DEFLATE_CODEC =>
+  val deflateLevel = spark.sessionState.conf.avroDeflateLevel
+  logInfo(s"Avro compression level $deflateLevel will be used for 
$DEFLATE_CODEC codec.")
+  job.getConfiguration.setInt(AvroOutputFormat.DEFLATE_LEVEL_KEY, 
deflateLevel)
+  DEFLATE_CODEC
+case codec @ (SNAPPY_CODEC | BZIP2_CODEC | XZ_CODEC) => codec
+case unknown => throw new IllegalArgumentException(s"Invalid 
compression codec: $unknown")
+  }
+  job.getConfiguration.set(AvroJob.CONF_OUTPUT_CODEC, codec)
 }
 
 new AvroOutputWriterFactory(dataSchema, outputAvroSchema.toString)

http://git-wip-us.apache.org/repos/asf/spark/blob/d20c10fd/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

spark git commit: [SPARK-24956][BUILD][FOLLOWUP] Upgrade Maven version to 3.5.4 for AppVeyor as well

2018-07-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master d20c10fdf -> f1550aaf1


[SPARK-24956][BUILD][FOLLOWUP] Upgrade Maven version to 3.5.4 for AppVeyor as 
well

## What changes were proposed in this pull request?

Maven version was upgraded and AppVeyor should also use upgraded maven version.

Currently, it looks broken by this:

https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/2458-master

```
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed 
with message:
Detected Maven Version: 3.3.9 is not in the allowed range 3.5.4.
[INFO] 
[INFO] Reactor Summary:
```

## How was this patch tested?

AppVeyor tests

Author: hyukjinkwon 

Closes #21920 from HyukjinKwon/SPARK-24956.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f1550aaf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f1550aaf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f1550aaf

Branch: refs/heads/master
Commit: f1550aaf1506c0115c8d95cd8bc784ed6c734ea5
Parents: d20c10f
Author: hyukjinkwon 
Authored: Tue Jul 31 09:14:29 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Jul 31 09:14:29 2018 +0800

--
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f1550aaf/dev/appveyor-install-dependencies.ps1
--
diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index e6afb18..8a04b62 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -81,7 +81,7 @@ if (!(Test-Path $tools)) {
 # == Maven
 Push-Location $tools
 
-$mavenVer = "3.3.9"
+$mavenVer = "3.5.4"
 Start-FileDownload 
"https://archive.apache.org/dist/maven/maven-3/$mavenVer/binaries/apache-maven-$mavenVer-bin.zip;
 "maven.zip"
 
 # extract


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide

2018-07-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f1550aaf1 -> 8141d5592


[SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide

## What changes were proposed in this pull request?

Update Pandas UDFs section in sql-programming-guide. Add section for grouped 
aggregate pandas UDF.

## How was this patch tested?

Author: Li Jin 

Closes #21887 from icexelloss/SPARK-23633-sql-programming-guide.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8141d559
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8141d559
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8141d559

Branch: refs/heads/master
Commit: 8141d55926e95c06cd66bf82098895e1ed419449
Parents: f1550aa
Author: Li Jin 
Authored: Tue Jul 31 10:10:38 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Jul 31 10:10:38 2018 +0800

--
 docs/sql-programming-guide.md | 19 +++
 examples/src/main/python/sql/arrow.py | 37 ++
 python/pyspark/sql/functions.py   |  5 ++--
 3 files changed, 59 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8141d559/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index cff521c..5f1eee8 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1811,6 +1811,25 @@ The following example shows how to use 
`groupby().apply()` to subtract the mean
 For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and
 
[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
 
+### Grouped Aggregate
+
+Grouped aggregate Pandas UDFs are similar to Spark aggregate functions. 
Grouped aggregate Pandas UDFs are used with `groupBy().agg()` and
+[`pyspark.sql.Window`](api/python/pyspark.sql.html#pyspark.sql.Window). It 
defines an aggregation from one or more `pandas.Series`
+to a scalar value, where each `pandas.Series` represents a column within the 
group or window.
+
+Note that this type of UDF does not support partial aggregation and all data 
for a group or window will be loaded into memory. Also,
+only unbounded window is supported with Grouped aggregate Pandas UDFs 
currently.
+
+The following example shows how to use this type of UDF to compute mean with 
groupBy and window operations:
+
+
+
+{% include_example grouped_agg_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
+
 ## Usage Notes
 
 ### Supported SQL Types

http://git-wip-us.apache.org/repos/asf/spark/blob/8141d559/examples/src/main/python/sql/arrow.py
--
diff --git a/examples/src/main/python/sql/arrow.py 
b/examples/src/main/python/sql/arrow.py
index 4c5aefb..6c4510d 100644
--- a/examples/src/main/python/sql/arrow.py
+++ b/examples/src/main/python/sql/arrow.py
@@ -113,6 +113,43 @@ def grouped_map_pandas_udf_example(spark):
 # $example off:grouped_map_pandas_udf$
 
 
+def grouped_agg_pandas_udf_example(spark):
+# $example on:grouped_agg_pandas_udf$
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+from pyspark.sql import Window
+
+df = spark.createDataFrame(
+[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+("id", "v"))
+
+@pandas_udf("double", PandasUDFType.GROUPED_AGG)
+def mean_udf(v):
+return v.mean()
+
+df.groupby("id").agg(mean_udf(df['v'])).show()
+# +---+---+
+# | id|mean_udf(v)|
+# +---+---+
+# |  1|1.5|
+# |  2|6.0|
+# +---+---+
+
+w = Window \
+.partitionBy('id') \
+.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
+df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
+# +---++--+
+# | id|   v|mean_v|
+# +---++--+
+# |  1| 1.0|   1.5|
+# |  1| 2.0|   1.5|
+# |  2| 3.0|   6.0|
+# |  2| 5.0|   6.0|
+# |  2|10.0|   6.0|
+# +---++--+
+# $example off:grouped_agg_pandas_udf$
+
+
 if __name__ == "__main__":
 spark = SparkSession \
 .builder \

http://git-wip-us.apache.org/repos/asf/spark/blob/8141d559/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 0a88e48..dd7daf9 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2810,8 +2810,9 @@ def pandas_udf(f=None, returnType=None,

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 8850 matches

Mail list logo