from:"gurwls223"

spark git commit: [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 219922422 -> cb90617f8


[SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator

## What changes were proposed in this pull request?

If we use accumulators in more than one UDFs, it is possible to overwrite 
deserialized accumulators and its values. We should check if an accumulator was 
deserialized before overwriting it in accumulator registry.

## How was this patch tested?

Added test.

Closes #22635 from viirya/SPARK-25591.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb90617f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb90617f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb90617f

Branch: refs/heads/master
Commit: cb90617f894fd51a092710271823ec7d1cd3a668
Parents: 2199224
Author: Liang-Chi Hsieh 
Authored: Mon Oct 8 15:18:08 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:18:08 2018 +0800

--
 python/pyspark/accumulators.py | 12 
 python/pyspark/sql/tests.py| 25 +
 2 files changed, 33 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cb90617f/python/pyspark/accumulators.py
--
diff --git a/python/pyspark/accumulators.py b/python/pyspark/accumulators.py
index 30ad042..00ec094 100644
--- a/python/pyspark/accumulators.py
+++ b/python/pyspark/accumulators.py
@@ -109,10 +109,14 @@ _accumulatorRegistry = {}
 
 def _deserialize_accumulator(aid, zero_value, accum_param):
 from pyspark.accumulators import _accumulatorRegistry
-accum = Accumulator(aid, zero_value, accum_param)
-accum._deserialized = True
-_accumulatorRegistry[aid] = accum
-return accum
+# If this certain accumulator was deserialized, don't overwrite it.
+if aid in _accumulatorRegistry:
+return _accumulatorRegistry[aid]
+else:
+accum = Accumulator(aid, zero_value, accum_param)
+accum._deserialized = True
+_accumulatorRegistry[aid] = accum
+return accum
 
 
 class Accumulator(object):

http://git-wip-us.apache.org/repos/asf/spark/blob/cb90617f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index d3c29d0..ac87ccd 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3603,6 +3603,31 @@ class SQLTests(ReusedSQLTestCase):
 self.assertEquals(None, df._repr_html_())
 self.assertEquals(expected, df.__repr__())
 
+# SPARK-25591
+def test_same_accumulator_in_udfs(self):
+from pyspark.sql.functions import udf
+
+data_schema = StructType([StructField("a", IntegerType(), True),
+  StructField("b", IntegerType(), True)])
+data = self.spark.createDataFrame([[1, 2]], schema=data_schema)
+
+test_accum = self.sc.accumulator(0)
+
+def first_udf(x):
+test_accum.add(1)
+return x
+
+def second_udf(x):
+test_accum.add(100)
+return x
+
+func_udf = udf(first_udf, IntegerType())
+func_udf2 = udf(second_udf, IntegerType())
+data = data.withColumn("out1", func_udf(data["a"]))
+data = data.withColumn("out2", func_udf2(data["b"]))
+data.collect()
+self.assertEqual(test_accum.value, 101)
+
 
 class HiveSparkSubmitTests(SparkSubmitTests):
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 c8b94099a -> 4214ddd34


[SPARK-25673][BUILD] Remove Travis CI which enables Java lint check

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/12980 added Travis CI file mainly for 
linter because we disabled Java lint check in Jenkins.

It's enabled as of https://github.com/apache/spark/pull/21399 and now SBT runs 
it. Looks we can now remove the file added before.

## How was this patch tested?

N/A

Closes #22665

Closes #22667 from HyukjinKwon/SPARK-25673.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 219922422003e59cc8b3bece60778536759fa669)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4214ddd3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4214ddd3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4214ddd3

Branch: refs/heads/branch-2.4
Commit: 4214ddd34514351a58cf6a0254f33c6d5c8fd924
Parents: c8b9409
Author: hyukjinkwon 
Authored: Mon Oct 8 15:07:06 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:07:35 2018 +0800

--
 .travis.yml | 50 --
 1 file changed, 50 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4214ddd3/.travis.yml
--
diff --git a/.travis.yml b/.travis.yml
deleted file mode 100644
index 05b94ade..000
--- a/.travis.yml
+++ /dev/null
@@ -1,50 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Spark provides this Travis CI configuration file to help contributors
-# check Scala/Java style conformance and JDK7/8 compilation easily
-# during their preparing pull requests.
-#   - Scalastyle is executed during `maven install` implicitly.
-#   - Java Checkstyle is executed by `lint-java`.
-# See the related discussion here.
-# https://github.com/apache/spark/pull/12980
-
-# 1. Choose OS (Ubuntu 14.04.3 LTS Server Edition 64bit, ~2 CORE, 7.5GB RAM)
-sudo: required
-dist: trusty
-
-# 2. Choose language and target JDKs for parallel builds.
-language: java
-jdk:
-  - oraclejdk8
-
-# 3. Setup cache directory for SBT and Maven.
-cache:
-  directories:
-  - $HOME/.sbt
-  - $HOME/.m2
-
-# 4. Turn off notifications.
-notifications:
-  email: false
-
-# 5. Run maven install before running lint-java.
-install:
-  - export MAVEN_SKIP_RC=1
-  - build/mvn -T 4 -q -DskipTests -Pkubernetes -Pmesos -Pyarn -Pkinesis-asl 
-Phive -Phive-thriftserver install
-
-# 6. Run lint-java.
-script:
-  - dev/lint-java


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master ebd899b8a -> 219922422


[SPARK-25673][BUILD] Remove Travis CI which enables Java lint check

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/12980 added Travis CI file mainly for 
linter because we disabled Java lint check in Jenkins.

It's enabled as of https://github.com/apache/spark/pull/21399 and now SBT runs 
it. Looks we can now remove the file added before.

## How was this patch tested?

N/A

Closes #22665

Closes #22667 from HyukjinKwon/SPARK-25673.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21992242
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21992242
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21992242

Branch: refs/heads/master
Commit: 219922422003e59cc8b3bece60778536759fa669
Parents: ebd899b
Author: hyukjinkwon 
Authored: Mon Oct 8 15:07:06 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:07:06 2018 +0800

--
 .travis.yml | 50 --
 1 file changed, 50 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/21992242/.travis.yml
--
diff --git a/.travis.yml b/.travis.yml
deleted file mode 100644
index 05b94ade..000
--- a/.travis.yml
+++ /dev/null
@@ -1,50 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Spark provides this Travis CI configuration file to help contributors
-# check Scala/Java style conformance and JDK7/8 compilation easily
-# during their preparing pull requests.
-#   - Scalastyle is executed during `maven install` implicitly.
-#   - Java Checkstyle is executed by `lint-java`.
-# See the related discussion here.
-# https://github.com/apache/spark/pull/12980
-
-# 1. Choose OS (Ubuntu 14.04.3 LTS Server Edition 64bit, ~2 CORE, 7.5GB RAM)
-sudo: required
-dist: trusty
-
-# 2. Choose language and target JDKs for parallel builds.
-language: java
-jdk:
-  - oraclejdk8
-
-# 3. Setup cache directory for SBT and Maven.
-cache:
-  directories:
-  - $HOME/.sbt
-  - $HOME/.m2
-
-# 4. Turn off notifications.
-notifications:
-  email: false
-
-# 5. Run maven install before running lint-java.
-install:
-  - export MAVEN_SKIP_RC=1
-  - build/mvn -T 4 -q -DskipTests -Pkubernetes -Pmesos -Pyarn -Pkinesis-asl 
-Phive -Phive-thriftserver install
-
-# 6. Run lint-java.
-script:
-  - dev/lint-java


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 4214ddd34 -> 692ddb3f9


[SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator

## What changes were proposed in this pull request?

If we use accumulators in more than one UDFs, it is possible to overwrite 
deserialized accumulators and its values. We should check if an accumulator was 
deserialized before overwriting it in accumulator registry.

## How was this patch tested?

Added test.

Closes #22635 from viirya/SPARK-25591.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: hyukjinkwon 
(cherry picked from commit cb90617f894fd51a092710271823ec7d1cd3a668)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/692ddb3f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/692ddb3f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/692ddb3f

Branch: refs/heads/branch-2.4
Commit: 692ddb3f92ad6ee5ceca2f5ee4ea67d636c32d88
Parents: 4214ddd
Author: Liang-Chi Hsieh 
Authored: Mon Oct 8 15:18:08 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:18:27 2018 +0800

--
 python/pyspark/accumulators.py | 12 
 python/pyspark/sql/tests.py| 25 +
 2 files changed, 33 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/692ddb3f/python/pyspark/accumulators.py
--
diff --git a/python/pyspark/accumulators.py b/python/pyspark/accumulators.py
index 30ad042..00ec094 100644
--- a/python/pyspark/accumulators.py
+++ b/python/pyspark/accumulators.py
@@ -109,10 +109,14 @@ _accumulatorRegistry = {}
 
 def _deserialize_accumulator(aid, zero_value, accum_param):
 from pyspark.accumulators import _accumulatorRegistry
-accum = Accumulator(aid, zero_value, accum_param)
-accum._deserialized = True
-_accumulatorRegistry[aid] = accum
-return accum
+# If this certain accumulator was deserialized, don't overwrite it.
+if aid in _accumulatorRegistry:
+return _accumulatorRegistry[aid]
+else:
+accum = Accumulator(aid, zero_value, accum_param)
+accum._deserialized = True
+_accumulatorRegistry[aid] = accum
+return accum
 
 
 class Accumulator(object):

http://git-wip-us.apache.org/repos/asf/spark/blob/692ddb3f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index e991032..b05de54 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3556,6 +3556,31 @@ class SQLTests(ReusedSQLTestCase):
 self.assertEquals(None, df._repr_html_())
 self.assertEquals(expected, df.__repr__())
 
+# SPARK-25591
+def test_same_accumulator_in_udfs(self):
+from pyspark.sql.functions import udf
+
+data_schema = StructType([StructField("a", IntegerType(), True),
+  StructField("b", IntegerType(), True)])
+data = self.spark.createDataFrame([[1, 2]], schema=data_schema)
+
+test_accum = self.sc.accumulator(0)
+
+def first_udf(x):
+test_accum.add(1)
+return x
+
+def second_udf(x):
+test_accum.add(100)
+return x
+
+func_udf = udf(first_udf, IntegerType())
+func_udf2 = udf(second_udf, IntegerType())
+data = data.withColumn("out1", func_udf(data["a"]))
+data = data.withColumn("out2", func_udf2(data["b"]))
+data.collect()
+self.assertEqual(test_accum.value, 101)
+
 
 class HiveSparkSubmitTests(SparkSubmitTests):
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 692ddb3f9 -> 193ce77fc


[SPARK-25677][DOC] spark.io.compression.codec = 
org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException 
Exception

## What changes were proposed in this pull request?
Documentation is updated with proper classname 
org.apache.spark.io.ZStdCompressionCodec

## How was this patch tested?
we used the  spark.io.compression.codec = 
org.apache.spark.io.ZStdCompressionCodec
and verified the logs.

Closes #22669 from shivusondur/CompressionIssue.

Authored-by: shivusondur 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 1a6815cd9f421a106f8d96a36a53042a00f02386)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/193ce77f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/193ce77f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/193ce77f

Branch: refs/heads/branch-2.4
Commit: 193ce77fccf54cfdacdc011db13655c28e524458
Parents: 692ddb3
Author: shivusondur 
Authored: Mon Oct 8 15:43:08 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:43:35 2018 +0800

--
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/193ce77f/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 5577393..613e214 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -954,7 +954,7 @@ Apart from these, the following properties are also 
available, and may be useful
 org.apache.spark.io.LZ4CompressionCodec,
 org.apache.spark.io.LZFCompressionCodec,
 org.apache.spark.io.SnappyCompressionCodec,
-and org.apache.spark.io.ZstdCompressionCodec.
+and org.apache.spark.io.ZStdCompressionCodec.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master cb90617f8 -> 1a6815cd9


[SPARK-25677][DOC] spark.io.compression.codec = 
org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException 
Exception

## What changes were proposed in this pull request?
Documentation is updated with proper classname 
org.apache.spark.io.ZStdCompressionCodec

## How was this patch tested?
we used the  spark.io.compression.codec = 
org.apache.spark.io.ZStdCompressionCodec
and verified the logs.

Closes #22669 from shivusondur/CompressionIssue.

Authored-by: shivusondur 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1a6815cd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1a6815cd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1a6815cd

Branch: refs/heads/master
Commit: 1a6815cd9f421a106f8d96a36a53042a00f02386
Parents: cb90617
Author: shivusondur 
Authored: Mon Oct 8 15:43:08 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:43:08 2018 +0800

--
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1a6815cd/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 5577393..613e214 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -954,7 +954,7 @@ Apart from these, the following properties are also 
available, and may be useful
 org.apache.spark.io.LZ4CompressionCodec,
 org.apache.spark.io.LZFCompressionCodec,
 org.apache.spark.io.SnappyCompressionCodec,
-and org.apache.spark.io.ZstdCompressionCodec.
+and org.apache.spark.io.ZStdCompressionCodec.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1a6815cd9 -> a853a8020


[SPARK-25666][PYTHON] Internally document type conversion between Python data 
and SQL types in normal UDFs

### What changes were proposed in this pull request?

We are facing some problems about type conversions between Python data and SQL 
types in UDFs (Pandas UDFs as well).
It's even difficult to identify the problems (see 
https://github.com/apache/spark/pull/20163 and 
https://github.com/apache/spark/pull/22610).

This PR targets to internally document the type conversion table. Some of them 
looks buggy and we should fix them.

```python
import sys
import array
import datetime
from decimal import Decimal

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import udf

if sys.version >= '3':
long = int

data = [
None,
True,
1,
long(1),
"a",
u"a",
datetime.date(1970, 1, 1),
datetime.datetime(1970, 1, 1, 0, 0),
1.0,
array.array("i", [1]),
[1],
(1,),
bytearray([65, 66, 67]),
Decimal(1),
{"a": 1},
Row(kwargs=1),
Row("namedtuple")(1),
]

types =  [
BooleanType(),
ByteType(),
ShortType(),
IntegerType(),
LongType(),
StringType(),
DateType(),
TimestampType(),
FloatType(),
DoubleType(),
ArrayType(IntegerType()),
BinaryType(),
DecimalType(10, 0),
MapType(StringType(), IntegerType()),
StructType([StructField("_1", IntegerType())]),
]

df = spark.range(1)
results = []
count = 0
total = len(types) * len(data)
spark.sparkContext.setLogLevel("FATAL")
for t in types:
result = []
for v in data:
try:
row = df.select(udf(lambda: v, t)()).first()
ret_str = repr(row[0])
except Exception:
ret_str = "X"
result.append(ret_str)
progress = "SQL Type: [%s]\n  Python Value: [%s(%s)]\n  Result Python 
Value: [%s]" % (
t.simpleString(), str(v), type(v).__name__, ret_str)
count += 1
print("%s/%s:\n  %s" % (count, total, progress))
results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % 
(str(v), type(v).__name__), data))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, 
False)
print("\n".join(map(lambda line: "# %s  # noqa" % line, 
strings.strip().split("\n"
```

This table was generated under Python 2 but the code above is Python 3 
compatible as well.

## How was this patch tested?

Manually tested and lint check.

Closes #22655 from HyukjinKwon/SPARK-25666.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a853a802
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a853a802
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a853a802

Branch: refs/heads/master
Commit: a853a80202032083ad411eec5ec97b304f732a61
Parents: 1a6815c
Author: hyukjinkwon 
Authored: Mon Oct 8 15:47:15 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 8 15:47:15 2018 +0800

--
 python/pyspark/sql/functions.py | 33 +
 1 file changed, 33 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a853a802/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index be089ee..5425d31 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2733,6 +2733,39 @@ def udf(f=None, returnType=StringType()):
 | 8|  JOHN DOE|  22|
 +--+--++
 """
+
+# The following table shows most of Python data and SQL type conversions 
in normal UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and might 
be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25666's PR to see the codes in order to generate the 
table below.
+#
+# 
+-+--+--+--+---+---+---++-+--+--+-++-++--+--+--+
  # noqa
+# |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)| 
a(str)| a(unicode)|1970-01-01(date)|1970-01-01 
00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| 
(1,)(tuple)|   ABC(bytearray)|  1(Decimal)|{'a': 
1}(dict)|Row(kwargs=1)(Row)|Row(namedtuple=1)(Row)|  # noqa
+#

spark git commit: [SPARK-25684][SQL] Organize header related codes in CSV datasource

2018-10-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a00181418 -> 39872af88


[SPARK-25684][SQL] Organize header related codes in CSV datasource

## What changes were proposed in this pull request?

1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is).

- Historically and at the first place of refactoring (which I did), I 
intended to put all CSV specific handling (like options), filtering, extracting 
header, etc.

- See `JsonDataSource`. Now `CSVDataSource` is quite consistent with 
`JsonDataSource`. Since CSV's code path is quite complicated, we might better 
match them as possible as we can.

2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that.

- The checking header and column pruning stuff were added (per 
https://github.com/apache/spark/pull/20894 and 
https://github.com/apache/spark/pull/21296) but some of codes such as 
https://github.com/apache/spark/pull/22123 are duplicated

- Also, checking header code is basically here and there. We better put 
them in a single place, which was quite error-prone. See 
(https://github.com/apache/spark/pull/22656).

3. Move `CSVDataSource.checkHeaderColumnNames` to 
`CSVHeaderChecker.checkHeaderColumnNames` (as is).

- Similar reasons above with 1.

## How was this patch tested?

Existing tests should cover this.

Closes #22676 from HyukjinKwon/refactoring-csv.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39872af8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39872af8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39872af8

Branch: refs/heads/master
Commit: 39872af882e3d73667acfab93c9de962c9c8939d
Parents: a001814
Author: hyukjinkwon 
Authored: Fri Oct 12 09:16:41 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 12 09:16:41 2018 +0800

--
 .../org/apache/spark/sql/DataFrameReader.scala  |  18 +--
 .../datasources/csv/CSVDataSource.scala | 161 ++-
 .../datasources/csv/CSVFileFormat.scala |  11 +-
 .../datasources/csv/CSVHeaderChecker.scala  | 131 +++
 .../execution/datasources/csv/CSVUtils.scala|  44 -
 .../datasources/csv/UnivocityParser.scala   |  34 ++--
 6 files changed, 217 insertions(+), 182 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39872af8/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 7269446..3af70b5 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -505,20 +505,14 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val actualSchema =
   StructType(schema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
 
-val linesWithoutHeader = if (parsedOptions.headerFlag && 
maybeFirstLine.isDefined) {
-  val firstLine = maybeFirstLine.get
-  val parser = new CsvParser(parsedOptions.asParserSettings)
-  val columnNames = parser.parseLine(firstLine)
-  CSVDataSource.checkHeaderColumnNames(
+val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
+  val headerChecker = new CSVHeaderChecker(
 actualSchema,
-columnNames,
-csvDataset.getClass.getCanonicalName,
-parsedOptions.enforceSchema,
-sparkSession.sessionState.conf.caseSensitiveAnalysis)
+parsedOptions,
+source = s"CSV source: $csvDataset")
+  headerChecker.checkHeaderColumnNames(firstLine)
   filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, 
parsedOptions))
-} else {
-  filteredLines.rdd
-}
+}.getOrElse(filteredLines.rdd)
 
 val parsed = linesWithoutHeader.mapPartitions { iter =>
   val rawParser = new UnivocityParser(actualSchema, parsedOptions)

http://git-wip-us.apache.org/repos/asf/spark/blob/39872af8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
index b93f418..0b5a719 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
@@ -51,11 +51,8 @@ abstract class

spark git commit: [SPARK-25372][YARN][K8S][FOLLOW-UP] Deprecate and generalize keytab / principal config

2018-10-14 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6c3f2c6a6 -> 9426fd0c2


[SPARK-25372][YARN][K8S][FOLLOW-UP] Deprecate and generalize keytab / principal 
config

## What changes were proposed in this pull request?
Update the next version of Spark from 2.5 to 3.0

## How was this patch tested?
N/A

Closes #22717 from gatorsmile/followupSPARK-25372.

Authored-by: gatorsmile 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9426fd0c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9426fd0c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9426fd0c

Branch: refs/heads/master
Commit: 9426fd0c244480e52881e4bc8b36bd261ec851c7
Parents: 6c3f2c6
Author: gatorsmile 
Authored: Sun Oct 14 15:20:01 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Oct 14 15:20:01 2018 +0800

--
 core/src/main/scala/org/apache/spark/SparkConf.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9426fd0c/core/src/main/scala/org/apache/spark/SparkConf.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index 81aa31d..5166543 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -729,9 +729,9 @@ private[spark] object SparkConf extends Logging {
 EXECUTOR_MEMORY_OVERHEAD.key -> Seq(
   AlternateConfig("spark.yarn.executor.memoryOverhead", "2.3")),
 KEYTAB.key -> Seq(
-  AlternateConfig("spark.yarn.keytab", "2.5")),
+  AlternateConfig("spark.yarn.keytab", "3.0")),
 PRINCIPAL.key -> Seq(
-  AlternateConfig("spark.yarn.principal", "2.5"))
+  AlternateConfig("spark.yarn.principal", "3.0"))
   )
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pushdown test time costs in Jenkins

2018-10-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master fdaa99897 -> 5c7f6b663


[SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pushdown test time costs 
in Jenkins

## What changes were proposed in this pull request?

Only test these 4 cases is enough:
https://github.com/apache/spark/blob/be2238fb502b0f49a8a1baa6da9bc3e99540b40e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L269-L279

## How was this patch tested?

Manual tests on my local machine.
before:
```
- filter pushdown - decimal (13 seconds, 683 milliseconds)
```
after:
```
- filter pushdown - decimal (9 seconds, 713 milliseconds)
```

Closes #22636 from wangyum/SPARK-25629.

Authored-by: Yuming Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c7f6b66
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c7f6b66
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c7f6b66

Branch: refs/heads/master
Commit: 5c7f6b66368a956accfc34636c84ca3825f8d0b1
Parents: fdaa998
Author: Yuming Wang 
Authored: Tue Oct 16 12:30:02 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 16 12:30:02 2018 +0800

--
 .../parquet/ParquetFilterSuite.scala| 67 ++--
 1 file changed, 33 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5c7f6b66/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
index 01e41b3..9cfc943 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
@@ -524,41 +524,40 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 
   test("filter pushdown - decimal") {
-Seq(true, false).foreach { legacyFormat =>
+Seq(
+  (false, Decimal.MAX_INT_DIGITS), // int32Writer
+  (false, Decimal.MAX_LONG_DIGITS), // int64Writer
+  (true, Decimal.MAX_LONG_DIGITS), // binaryWriterUsingUnscaledLong
+  (false, DecimalType.MAX_PRECISION) // binaryWriterUsingUnscaledBytes
+).foreach { case (legacyFormat, precision) =>
   withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> 
legacyFormat.toString) {
-Seq(
-  s"a decimal(${Decimal.MAX_INT_DIGITS}, 2)",  // 32BitDecimalType
-  s"a decimal(${Decimal.MAX_LONG_DIGITS}, 2)", // 64BitDecimalType
-  "a decimal(38, 18)"  // ByteArrayDecimalType
-).foreach { schemaDDL =>
-  val schema = StructType.fromDDL(schemaDDL)
-  val rdd =
-spark.sparkContext.parallelize((1 to 4).map(i => Row(new 
java.math.BigDecimal(i
-  val dataFrame = spark.createDataFrame(rdd, schema)
-  testDecimalPushDown(dataFrame) { implicit df =>
-assert(df.schema === schema)
-checkFilterPredicate('a.isNull, classOf[Eq[_]], Seq.empty[Row])
-checkFilterPredicate('a.isNotNull, classOf[NotEq[_]], (1 to 
4).map(Row.apply(_)))
-
-checkFilterPredicate('a === 1, classOf[Eq[_]], 1)
-checkFilterPredicate('a <=> 1, classOf[Eq[_]], 1)
-checkFilterPredicate('a =!= 1, classOf[NotEq[_]], (2 to 
4).map(Row.apply(_)))
-
-checkFilterPredicate('a < 2, classOf[Lt[_]], 1)
-checkFilterPredicate('a > 3, classOf[Gt[_]], 4)
-checkFilterPredicate('a <= 1, classOf[LtEq[_]], 1)
-checkFilterPredicate('a >= 4, classOf[GtEq[_]], 4)
-
-checkFilterPredicate(Literal(1) === 'a, classOf[Eq[_]], 1)
-checkFilterPredicate(Literal(1) <=> 'a, classOf[Eq[_]], 1)
-checkFilterPredicate(Literal(2) > 'a, classOf[Lt[_]], 1)
-checkFilterPredicate(Literal(3) < 'a, classOf[Gt[_]], 4)
-checkFilterPredicate(Literal(1) >= 'a, classOf[LtEq[_]], 1)
-checkFilterPredicate(Literal(4) <= 'a, classOf[GtEq[_]], 4)
-
-checkFilterPredicate(!('a < 4), classOf[GtEq[_]], 4)
-checkFilterPredicate('a < 2 || 'a > 3, classOf[Operators.Or], 
Seq(Row(1), Row(4)))
-  }
+val schema = StructType.fromDDL(s"a decimal($precision, 2)")
+val rdd =
+  spark.sparkContext.parallelize((1 to 4).map(i => Row(new 
java.math.BigDecimal(i
+val dataFrame = spark.createDataFrame(rdd, schema)
+testDecimalPushDown(dataFrame) { implicit df =>
+

spark git commit: [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 5c7f6b663 -> e028fd3ae


[SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count

## What changes were proposed in this pull request?

AFAIK multi-column count is not widely supported by the mainstream 
databases(postgres doesn't support), and the SQL standard doesn't define it 
clearly, as near as I can tell.

Since Spark supports it, we should clearly document the current behavior and 
add tests to verify it.

## How was this patch tested?

N/A

Closes #22728 from cloud-fan/doc.

Authored-by: Wenchen Fan 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e028fd3a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e028fd3a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e028fd3a

Branch: refs/heads/master
Commit: e028fd3aed9e5e4c478f307f0a467b54b73ff0d5
Parents: 5c7f6b6
Author: Wenchen Fan 
Authored: Tue Oct 16 15:13:01 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 16 15:13:01 2018 +0800

--
 .../catalyst/expressions/aggregate/Count.scala  |  2 +-
 .../test/resources/sql-tests/inputs/count.sql   | 27 ++
 .../resources/sql-tests/results/count.sql.out   | 55 
 3 files changed, 83 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e028fd3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
index 40582d0..8cab8e4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
@@ -52,7 +52,7 @@ abstract class CountLike extends DeclarativeAggregate {
   usage = """
 _FUNC_(*) - Returns the total number of retrieved rows, including rows 
containing null.
 
-_FUNC_(expr) - Returns the number of rows for which the supplied 
expression is non-null.
+_FUNC_(expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are all non-null.
 
 _FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for which 
the supplied expression(s) are unique and non-null.
   """)

http://git-wip-us.apache.org/repos/asf/spark/blob/e028fd3a/sql/core/src/test/resources/sql-tests/inputs/count.sql
--
diff --git a/sql/core/src/test/resources/sql-tests/inputs/count.sql 
b/sql/core/src/test/resources/sql-tests/inputs/count.sql
new file mode 100644
index 000..9f9ee4a
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/inputs/count.sql
@@ -0,0 +1,27 @@
+-- Test data.
+CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
+(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null)
+AS testData(a, b);
+
+-- count with single expression
+SELECT
+  count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, 
b))
+FROM testData;
+
+-- distinct count with single expression
+SELECT
+  count(DISTINCT 1),
+  count(DISTINCT null),
+  count(DISTINCT a),
+  count(DISTINCT b),
+  count(DISTINCT (a + b)),
+  count(DISTINCT (a, b))
+FROM testData;
+
+-- count with multiple expressions
+SELECT count(a, b), count(b, a), count(testData.*) FROM testData;
+
+-- distinct count with multiple expressions
+SELECT
+  count(DISTINCT a, b), count(DISTINCT b, a), count(DISTINCT *), 
count(DISTINCT testData.*)
+FROM testData;

http://git-wip-us.apache.org/repos/asf/spark/blob/e028fd3a/sql/core/src/test/resources/sql-tests/results/count.sql.out
--
diff --git a/sql/core/src/test/resources/sql-tests/results/count.sql.out 
b/sql/core/src/test/resources/sql-tests/results/count.sql.out
new file mode 100644
index 000..b8a86d4
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/results/count.sql.out
@@ -0,0 +1,55 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 5
+
+
+-- !query 0
+CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
+(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null)
+AS testData(a, b)
+-- !query 0 schema
+struct<>
+-- !query 0 output
+
+
+
+-- !query 1
+SELECT
+  count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, 
b))
+FROM testData
+-- !query 1 schema
+struct
+-- !query 1 output
+7  7   0   5   5   4   7
+
+
+-- !query 2
+SELECT
+  count(DISTINCT 1),
+  count(DISTINCT null),
+

spark git commit: [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 8bc7ab03d -> 77156f8c8


[SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count

## What changes were proposed in this pull request?

AFAIK multi-column count is not widely supported by the mainstream 
databases(postgres doesn't support), and the SQL standard doesn't define it 
clearly, as near as I can tell.

Since Spark supports it, we should clearly document the current behavior and 
add tests to verify it.

## How was this patch tested?

N/A

Closes #22728 from cloud-fan/doc.

Authored-by: Wenchen Fan 
Signed-off-by: hyukjinkwon 
(cherry picked from commit e028fd3aed9e5e4c478f307f0a467b54b73ff0d5)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/77156f8c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/77156f8c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/77156f8c

Branch: refs/heads/branch-2.4
Commit: 77156f8c81720ec7364b386a95ef1b30713fe55c
Parents: 8bc7ab0
Author: Wenchen Fan 
Authored: Tue Oct 16 15:13:01 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 16 15:13:19 2018 +0800

--
 .../catalyst/expressions/aggregate/Count.scala  |  2 +-
 .../test/resources/sql-tests/inputs/count.sql   | 27 ++
 .../resources/sql-tests/results/count.sql.out   | 55 
 3 files changed, 83 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/77156f8c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
index 40582d0..8cab8e4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala
@@ -52,7 +52,7 @@ abstract class CountLike extends DeclarativeAggregate {
   usage = """
 _FUNC_(*) - Returns the total number of retrieved rows, including rows 
containing null.
 
-_FUNC_(expr) - Returns the number of rows for which the supplied 
expression is non-null.
+_FUNC_(expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are all non-null.
 
 _FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for which 
the supplied expression(s) are unique and non-null.
   """)

http://git-wip-us.apache.org/repos/asf/spark/blob/77156f8c/sql/core/src/test/resources/sql-tests/inputs/count.sql
--
diff --git a/sql/core/src/test/resources/sql-tests/inputs/count.sql 
b/sql/core/src/test/resources/sql-tests/inputs/count.sql
new file mode 100644
index 000..9f9ee4a
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/inputs/count.sql
@@ -0,0 +1,27 @@
+-- Test data.
+CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
+(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null)
+AS testData(a, b);
+
+-- count with single expression
+SELECT
+  count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, 
b))
+FROM testData;
+
+-- distinct count with single expression
+SELECT
+  count(DISTINCT 1),
+  count(DISTINCT null),
+  count(DISTINCT a),
+  count(DISTINCT b),
+  count(DISTINCT (a + b)),
+  count(DISTINCT (a, b))
+FROM testData;
+
+-- count with multiple expressions
+SELECT count(a, b), count(b, a), count(testData.*) FROM testData;
+
+-- distinct count with multiple expressions
+SELECT
+  count(DISTINCT a, b), count(DISTINCT b, a), count(DISTINCT *), 
count(DISTINCT testData.*)
+FROM testData;

http://git-wip-us.apache.org/repos/asf/spark/blob/77156f8c/sql/core/src/test/resources/sql-tests/results/count.sql.out
--
diff --git a/sql/core/src/test/resources/sql-tests/results/count.sql.out 
b/sql/core/src/test/resources/sql-tests/results/count.sql.out
new file mode 100644
index 000..b8a86d4
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/results/count.sql.out
@@ -0,0 +1,55 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 5
+
+
+-- !query 0
+CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
+(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null)
+AS testData(a, b)
+-- !query 0 schema
+struct<>
+-- !query 0 output
+
+
+
+-- !query 1
+SELECT
+  count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, 
b))
+FROM testData
+-- !query 1 schema
+struct
+-- !query 1 output
+7  7   0   5

spark git commit: [SQL][CATALYST][MINOR] update some error comments

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a9f685bb7 -> e9332f600


[SQL][CATALYST][MINOR] update some error comments

## What changes were proposed in this pull request?

this PR correct some comment error:
1. change from "as low a possible" to "as low as possible" in 
RewriteDistinctAggregates.scala
2. delete redundant word âwithâ in HiveTableScanExecâs  doExecute()  
method

## How was this patch tested?

Existing unit tests.

Closes #22694 from CarolinePeng/update_comment.

Authored-by: å½ç¿00244106 <00244106@zte.intra>
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e9332f60
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e9332f60
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e9332f60

Branch: refs/heads/master
Commit: e9332f600eb4f275b3bff368863a68c2a4349182
Parents: a9f685b
Author: å½ç¿00244106 <00244106@zte.intra>
Authored: Wed Oct 17 12:45:13 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 17 12:45:13 2018 +0800

--
 .../spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala | 4 ++--
 .../org/apache/spark/sql/hive/execution/HiveTableScanExec.scala  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e9332f60/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index 4448ace..b946800 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -95,7 +95,7 @@ import org.apache.spark.sql.types.IntegerType
  *
  * This rule duplicates the input data by two or more times (# distinct groups 
+ an optional
  * non-distinct group). This will put quite a bit of memory pressure of the 
used aggregate and
- * exchange operators. Keeping the number of distinct groups as low a possible 
should be priority,
+ * exchange operators. Keeping the number of distinct groups as low as 
possible should be priority,
  * we could improve this in the current rule by applying more advanced 
expression canonicalization
  * techniques.
  */
@@ -241,7 +241,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 groupByAttrs ++ distinctAggChildAttrs ++ Seq(gid) ++ 
regularAggChildAttrMap.map(_._2),
 a.child)
 
-  // Construct the first aggregate operator. This de-duplicates the all 
the children of
+  // Construct the first aggregate operator. This de-duplicates all the 
children of
   // distinct operators, and applies the regular aggregate operators.
   val firstAggregateGroupBy = groupByAttrs ++ distinctAggChildAttrs :+ gid
   val firstAggregate = Aggregate(

http://git-wip-us.apache.org/repos/asf/spark/blob/e9332f60/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
index b3795b4..92c6632 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
@@ -182,7 +182,7 @@ case class HiveTableScanExec(
 
   protected override def doExecute(): RDD[InternalRow] = {
 // Using dummyCallSite, as getCallSite can turn out to be expensive with
-// with multiple partitions.
+// multiple partitions.
 val rdd = if (!relation.isPartitioned) {
   Utils.withDummyCallSite(sqlContext.sparkContext) {
 hadoopReader.makeRDDForTable(hiveQlTable)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-16 Thread gurwls223

[SPARK-25393][SQL] Adding new function from_csv()

## What changes were proposed in this pull request?

The PR adds new function `from_csv()` similar to `from_json()` to parse columns 
with CSV strings. I added the following methods:
```Scala
def from_csv(e: Column, schema: StructType, options: Map[String, String]): 
Column
```
and this signature to call it from Python, R and Java:
```Scala
def from_csv(e: Column, schema: String, options: java.util.Map[String, 
String]): Column
```

## How was this patch tested?

Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests.

Closes #22379 from MaxGekk/from_csv.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Maxim Gekk 
Co-authored-by: Hyukjin Kwon 
Co-authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e9af9460
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e9af9460
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e9af9460

Branch: refs/heads/master
Commit: e9af9460bc008106b670abac44a869721bfde42a
Parents: 9d4dd79
Author: Maxim Gekk 
Authored: Wed Oct 17 09:32:05 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 17 09:32:05 2018 +0800

--
 R/pkg/NAMESPACE |   1 +
 R/pkg/R/functions.R |  40 ++-
 R/pkg/R/generics.R  |   4 +
 R/pkg/tests/fulltests/test_sparkSQL.R   |   7 +
 python/pyspark/sql/functions.py |  37 +-
 sql/catalyst/pom.xml|   6 +
 .../catalyst/analysis/FunctionRegistry.scala|   5 +-
 .../spark/sql/catalyst/csv/CSVExprUtils.scala   |  82 +
 .../sql/catalyst/csv/CSVHeaderChecker.scala | 131 +++
 .../spark/sql/catalyst/csv/CSVOptions.scala | 217 
 .../sql/catalyst/csv/UnivocityParser.scala  | 351 ++
 .../sql/catalyst/expressions/ExprUtils.scala|  45 +++
 .../catalyst/expressions/csvExpressions.scala   | 120 +++
 .../catalyst/expressions/jsonExpressions.scala  |  21 +-
 .../sql/catalyst/util/FailureSafeParser.scala   |  80 +
 .../sql/catalyst/csv/CSVExprUtilsSuite.scala|  61 
 .../expressions/CsvExpressionsSuite.scala   | 158 +
 .../org/apache/spark/sql/DataFrameReader.scala  |   5 +-
 .../datasources/FailureSafeParser.scala |  82 -
 .../datasources/csv/CSVDataSource.scala |   1 +
 .../datasources/csv/CSVFileFormat.scala |   1 +
 .../datasources/csv/CSVHeaderChecker.scala  | 131 ---
 .../datasources/csv/CSVInferSchema.scala|   1 +
 .../execution/datasources/csv/CSVOptions.scala  | 217 
 .../execution/datasources/csv/CSVUtils.scala|  67 +---
 .../datasources/csv/UnivocityGenerator.scala|   1 +
 .../datasources/csv/UnivocityParser.scala   | 352 ---
 .../datasources/json/JsonDataSource.scala   |   1 +
 .../scala/org/apache/spark/sql/functions.scala  |  32 ++
 .../sql-tests/inputs/csv-functions.sql  |   9 +
 .../sql-tests/results/csv-functions.sql.out |  69 
 .../apache/spark/sql/CsvFunctionsSuite.scala|  62 
 .../datasources/csv/CSVInferSchemaSuite.scala   |   1 +
 .../datasources/csv/CSVUtilsSuite.scala |  61 
 .../datasources/csv/UnivocityParserSuite.scala  |   2 +-
 35 files changed, 1531 insertions(+), 930 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e9af9460/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 96ff389..c512284 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -274,6 +274,7 @@ exportMethods("%<=>%",
   "floor",
   "format_number",
   "format_string",
+  "from_csv",
   "from_json",
   "from_unixtime",
   "from_utc_timestamp",

http://git-wip-us.apache.org/repos/asf/spark/blob/e9af9460/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 6a8fef5..d2ca1d6 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -188,6 +188,7 @@ NULL
 #'  \item \code{to_json}: it is the column containing the struct, 
array of the structs,
 #'  the map or array of maps.
 #'  \item \code{from_json}: it is the column containing the JSON 
string.
+#'  \item \code{from_csv}: it is the column containing the CSV string.
 #'  }
 #' @param y Column to compute on.
 #' @param value A value to compute on.
@@ -196,6 +197,13 @@ NULL
 #'  \item \code{array_position}: a value to locate in the given array.
 #'  \item \code{array_remove}: a value to remove in the given array.
 #'

[1/2] spark git commit: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 9d4dd7992 -> e9af9460b


http://git-wip-us.apache.org/repos/asf/spark/blob/e9af9460/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
deleted file mode 100644
index 492a21b..000
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
+++ /dev/null
@@ -1,217 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.execution.datasources.csv
-
-import java.nio.charset.StandardCharsets
-import java.util.{Locale, TimeZone}
-
-import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, 
UnescapedQuoteHandling}
-import org.apache.commons.lang3.time.FastDateFormat
-
-import org.apache.spark.internal.Logging
-import org.apache.spark.sql.catalyst.util._
-
-class CSVOptions(
-@transient val parameters: CaseInsensitiveMap[String],
-val columnPruning: Boolean,
-defaultTimeZoneId: String,
-defaultColumnNameOfCorruptRecord: String)
-  extends Logging with Serializable {
-
-  def this(
-parameters: Map[String, String],
-columnPruning: Boolean,
-defaultTimeZoneId: String,
-defaultColumnNameOfCorruptRecord: String = "") = {
-  this(
-CaseInsensitiveMap(parameters),
-columnPruning,
-defaultTimeZoneId,
-defaultColumnNameOfCorruptRecord)
-  }
-
-  private def getChar(paramName: String, default: Char): Char = {
-val paramValue = parameters.get(paramName)
-paramValue match {
-  case None => default
-  case Some(null) => default
-  case Some(value) if value.length == 0 => '\u'
-  case Some(value) if value.length == 1 => value.charAt(0)
-  case _ => throw new RuntimeException(s"$paramName cannot be more than 
one character")
-}
-  }
-
-  private def getInt(paramName: String, default: Int): Int = {
-val paramValue = parameters.get(paramName)
-paramValue match {
-  case None => default
-  case Some(null) => default
-  case Some(value) => try {
-value.toInt
-  } catch {
-case e: NumberFormatException =>
-  throw new RuntimeException(s"$paramName should be an integer. Found 
$value")
-  }
-}
-  }
-
-  private def getBool(paramName: String, default: Boolean = false): Boolean = {
-val param = parameters.getOrElse(paramName, default.toString)
-if (param == null) {
-  default
-} else if (param.toLowerCase(Locale.ROOT) == "true") {
-  true
-} else if (param.toLowerCase(Locale.ROOT) == "false") {
-  false
-} else {
-  throw new Exception(s"$paramName flag can be true or false")
-}
-  }
-
-  val delimiter = CSVUtils.toChar(
-parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
-  val parseMode: ParseMode =
-parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode)
-  val charset = parameters.getOrElse("encoding",
-parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
-
-  val quote = getChar("quote", '\"')
-  val escape = getChar("escape", '\\')
-  val charToEscapeQuoteEscaping = parameters.get("charToEscapeQuoteEscaping") 
match {
-case None => None
-case Some(null) => None
-case Some(value) if value.length == 0 => None
-case Some(value) if value.length == 1 => Some(value.charAt(0))
-case _ =>
-  throw new RuntimeException("charToEscapeQuoteEscaping cannot be more 
than one character")
-  }
-  val comment = getChar("comment", '\u')
-
-  val headerFlag = getBool("header")
-  val inferSchemaFlag = getBool("inferSchema")
-  val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", 
default = false)
-  val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", 
default = false)
-
-  // For write, both options were `true` by default. We leave it as `true` for
-  // backwards compatibility.
-  val ignoreLeadingWhiteSpaceFlagInWrite =

spark git commit: [SQL][CATALYST][MINOR] update some error comments

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 144cb949d -> 3591bd229


[SQL][CATALYST][MINOR] update some error comments

## What changes were proposed in this pull request?

this PR correct some comment error:
1. change from "as low a possible" to "as low as possible" in 
RewriteDistinctAggregates.scala
2. delete redundant word âwithâ in HiveTableScanExecâs  doExecute()  
method

## How was this patch tested?

Existing unit tests.

Closes #22694 from CarolinePeng/update_comment.

Authored-by: å½ç¿00244106 <00244106@zte.intra>
Signed-off-by: hyukjinkwon 
(cherry picked from commit e9332f600eb4f275b3bff368863a68c2a4349182)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3591bd22
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3591bd22
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3591bd22

Branch: refs/heads/branch-2.4
Commit: 3591bd2293f49ac8023166597704ad1bd21dabe9
Parents: 144cb94
Author: å½ç¿00244106 <00244106@zte.intra>
Authored: Wed Oct 17 12:45:13 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 17 12:45:30 2018 +0800

--
 .../spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala | 4 ++--
 .../org/apache/spark/sql/hive/execution/HiveTableScanExec.scala  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3591bd22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
index 4448ace..b946800 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
@@ -95,7 +95,7 @@ import org.apache.spark.sql.types.IntegerType
  *
  * This rule duplicates the input data by two or more times (# distinct groups 
+ an optional
  * non-distinct group). This will put quite a bit of memory pressure of the 
used aggregate and
- * exchange operators. Keeping the number of distinct groups as low a possible 
should be priority,
+ * exchange operators. Keeping the number of distinct groups as low as 
possible should be priority,
  * we could improve this in the current rule by applying more advanced 
expression canonicalization
  * techniques.
  */
@@ -241,7 +241,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
 groupByAttrs ++ distinctAggChildAttrs ++ Seq(gid) ++ 
regularAggChildAttrMap.map(_._2),
 a.child)
 
-  // Construct the first aggregate operator. This de-duplicates the all 
the children of
+  // Construct the first aggregate operator. This de-duplicates all the 
children of
   // distinct operators, and applies the regular aggregate operators.
   val firstAggregateGroupBy = groupByAttrs ++ distinctAggChildAttrs :+ gid
   val firstAggregate = Aggregate(

http://git-wip-us.apache.org/repos/asf/spark/blob/3591bd22/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
index b3795b4..92c6632 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
@@ -182,7 +182,7 @@ case class HiveTableScanExec(
 
   protected override def doExecute(): RDD[InternalRow] = {
 // Using dummyCallSite, as getCallSite can turn out to be expensive with
-// with multiple partitions.
+// multiple partitions.
 val rdd = if (!relation.isPartitioned) {
   Utils.withDummyCallSite(sqlContext.sparkContext) {
 hadoopReader.makeRDDForTable(hiveQlTable)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master d0ecff285 -> 1e6c1d8bf


[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode

## What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work 
fine in single line mode because the line separation is done by Hadoop, which 
can handle all the different types of line separators. This PR fixes it by 
enabling Univocity's line separator detection in multiline mode, which will 
detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single 
line mode.

## How was this patch tested?

Unit test with a file with crlf line endings.

Closes #22503 from justinuang/fix-clrf-multiline.

Authored-by: Justin Uang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e6c1d8b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e6c1d8b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e6c1d8b

Branch: refs/heads/master
Commit: 1e6c1d8bfb7841596452e25b870823b9a4b267f4
Parents: d0ecff2
Author: Justin Uang 
Authored: Fri Oct 19 11:13:02 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 11:13:02 2018 +0800

--
 .../org/apache/spark/sql/catalyst/csv/CSVOptions.scala  |  2 ++
 sql/core/src/test/resources/test-data/cars-crlf.csv |  7 +++
 .../spark/sql/execution/datasources/csv/CSVSuite.scala  | 12 
 3 files changed, 21 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
index 3e25d82..cdaaa17 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
@@ -212,6 +212,8 @@ class CSVOptions(
 settings.setEmptyValue(emptyValueInRead)
 settings.setMaxCharsPerColumn(maxCharsPerColumn)
 
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+settings.setLineSeparatorDetectionEnabled(multiLine == true)
+
 settings
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/core/src/test/resources/test-data/cars-crlf.csv
--
diff --git a/sql/core/src/test/resources/test-data/cars-crlf.csv 
b/sql/core/src/test/resources/test-data/cars-crlf.csv
new file mode 100644
index 000..d018d08
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/cars-crlf.csv
@@ -0,0 +1,7 @@
+
+year,make,model,comment,blank
+"2012","Tesla","S","No comment",
+
+1997,Ford,E350,"Go get one now they are going fast",
+2015,Chevy,Volt
+

http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index d59035b..d43efc8 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -52,6 +52,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with 
SQLTestUtils with Te
   private val carsNullFile = "test-data/cars-null.csv"
   private val carsEmptyValueFile = "test-data/cars-empty-value.csv"
   private val carsBlankColName = "test-data/cars-blank-column-name.csv"
+  private val carsCrlf = "test-data/cars-crlf.csv"
   private val emptyFile = "test-data/empty.csv"
   private val commentsFile = "test-data/comments.csv"
   private val disableCommentsFile = "test-data/disable_comments.csv"
@@ -220,6 +221,17 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("crlf line separators in multiline mode") {
+val cars = spark
+  .read
+  .format("csv")
+  .option("multiLine", "true")
+  .option("header", "true")
+  .load(testFile(carsCrlf))
+
+verifyCars(cars, withHeader = true)
+  }
+
   test("test aliases sep and encoding for delimiter and charset") {
 // scalastyle:off
 val cars = spark


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 36307b1e4 -> 9ed2e4204


[MINOR][DOC] Spacing items in migration guide for readability and consistency

## What changes were proposed in this pull request?

Currently, migration guide has no space between each item which looks too 
compact and hard to read. Some of items already had some spaces between them in 
the migration guide. This PR suggest to format them consistently for 
readability.

Before:

![screen shot 2018-10-18 at 10 00 04 
am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png)

After:

![screen shot 2018-10-18 at 9 53 55 
am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png)

## How was this patch tested?

Manually tested:

Closes #22761 from HyukjinKwon/minor-migration-doc.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 
(cherry picked from commit c8f7691c64a28174a54e8faa159b50a3836a7225)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9ed2e420
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9ed2e420
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9ed2e420

Branch: refs/heads/branch-2.4
Commit: 9ed2e42044a1105a1c8b5868dbb320b1b477bcf0
Parents: 36307b1
Author: hyukjinkwon 
Authored: Fri Oct 19 13:55:27 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 13:55:43 2018 +0800

--
 docs/sql-migration-guide-upgrade.md | 54 
 1 file changed, 54 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9ed2e420/docs/sql-migration-guide-upgrade.md
--
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 3476aa8..062e07b 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -70,26 +70,47 @@ displayTitle: Spark SQL Upgrading Guide
   
 
   - Since Spark 2.4, when there is a struct field in front of the IN operator 
before a subquery, the inner query must contain a struct field as well. In 
previous versions, instead, the fields of the struct were compared to the 
output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 
2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a 
in (select 1, 'a' from range(1))` is not. In previous version it was the 
opposite.
+
   - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, 
then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became 
case-sensitive and would resolve to columns (unless typed in lower case). In 
Spark 2.4 this has been fixed and the functions are no longer case-sensitive.
+
   - Since Spark 2.4, Spark will evaluate the set operations referenced in a 
query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+
   - Since Spark 2.4, Spark will display table description column Last Access 
value as UNKNOWN when the value was Jan 01 1970.
+
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for 
ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
+
   - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty

spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1e6c1d8bf -> c8f7691c6


[MINOR][DOC] Spacing items in migration guide for readability and consistency

## What changes were proposed in this pull request?

Currently, migration guide has no space between each item which looks too 
compact and hard to read. Some of items already had some spaces between them in 
the migration guide. This PR suggest to format them consistently for 
readability.

Before:

![screen shot 2018-10-18 at 10 00 04 
am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png)

After:

![screen shot 2018-10-18 at 9 53 55 
am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png)

## How was this patch tested?

Manually tested:

Closes #22761 from HyukjinKwon/minor-migration-doc.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c8f7691c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c8f7691c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c8f7691c

Branch: refs/heads/master
Commit: c8f7691c64a28174a54e8faa159b50a3836a7225
Parents: 1e6c1d8
Author: hyukjinkwon 
Authored: Fri Oct 19 13:55:27 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 13:55:27 2018 +0800

--
 docs/sql-migration-guide-upgrade.md | 54 
 1 file changed, 54 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c8f7691c/docs/sql-migration-guide-upgrade.md
--
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 7faf8bd..7871a49 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -74,26 +74,47 @@ displayTitle: Spark SQL Upgrading Guide
   
 
   - Since Spark 2.4, when there is a struct field in front of the IN operator 
before a subquery, the inner query must contain a struct field as well. In 
previous versions, instead, the fields of the struct were compared to the 
output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 
2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a 
in (select 1, 'a' from range(1))` is not. In previous version it was the 
opposite.
+
   - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, 
then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became 
case-sensitive and would resolve to columns (unless typed in lower case). In 
Spark 2.4 this has been fixed and the functions are no longer case-sensitive.
+
   - Since Spark 2.4, Spark will evaluate the set operations referenced in a 
query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+
   - Since Spark 2.4, Spark will display table description column Last Access 
value as UNKNOWN when the value was Jan 01 1970.
+
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for 
ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
+
   - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+
   - Since Spark 2.4, expression IDs in UDF arguments do not appear in column 
names. For example,

spark git commit: [SPARK-25040][SQL] Empty string for non string types should be disallowed

2018-10-22 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master c391dc65e -> 03e82e368


[SPARK-25040][SQL] Empty string for non string types should be disallowed

## What changes were proposed in this pull request?

This takes over original PR at #22019. The original proposal is to have null 
for float and double types. Later a more reasonable proposal is to disallow 
empty strings. This patch adds logic to throw exception when finding empty 
strings for non string types.

## How was this patch tested?

Added test.

Closes #22787 from viirya/SPARK-25040.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03e82e36
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03e82e36
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03e82e36

Branch: refs/heads/master
Commit: 03e82e36896afb43cc42c8d065ebe41a19ec62a7
Parents: c391dc6
Author: Liang-Chi Hsieh 
Authored: Tue Oct 23 13:43:53 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 23 13:43:53 2018 +0800

--
 docs/sql-migration-guide-upgrade.md |  2 ++
 .../spark/sql/catalyst/json/JacksonParser.scala | 19 +-
 .../execution/datasources/json/JsonSuite.scala  | 37 +++-
 3 files changed, 48 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/03e82e36/docs/sql-migration-guide-upgrade.md
--
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 68a897c..b8b9ad8 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -11,6 +11,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
comes to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
 
+  - In Spark version 2.4 and earlier, the parser of JSON data source treats 
empty strings as null for some data types such as `IntegerType`. For 
`FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. 
Since Spark 3.0, we disallow empty strings and will throw exceptions for data 
types except for `StringType` and `BinaryType`.
+
 ## Upgrading From Spark SQL 2.3 to 2.4
 
   - In Spark version 2.3 and earlier, the second parameter to array_contains 
function is implicitly promoted to the element type of first array type 
parameter. This type promotion can be lossy and may cause `array_contains` 
function to return wrong result. This problem has been addressed in 2.4 by 
employing a safer type promotion mechanism. This can cause some change in 
behavior and are illustrated in the table below.

http://git-wip-us.apache.org/repos/asf/spark/blob/03e82e36/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 984979a..918c9e7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -168,7 +168,7 @@ class JacksonParser(
 case VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT =>
   parser.getFloatValue
 
-case VALUE_STRING =>
+case VALUE_STRING if parser.getTextLength >= 1 =>
   // Special case handling for NaN and Infinity.
   parser.getText match {
 case "NaN" => Float.NaN
@@ -184,7 +184,7 @@ class JacksonParser(
 case VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT =>
   parser.getDoubleValue
 
-case VALUE_STRING =>
+case VALUE_STRING if parser.getTextLength >= 1 =>
   // Special case handling for NaN and Infinity.
   parser.getText match {
 case "NaN" => Double.NaN
@@ -211,7 +211,7 @@ class JacksonParser(
 
 case TimestampType =>
   (parser: JsonParser) => parseJsonToken[java.lang.Long](parser, dataType) 
{
-case VALUE_STRING =>
+case VALUE_STRING if parser.getTextLength >= 1 =>
   val stringValue = parser.getText
   // This one will lose microseconds parts.
   // See

spark git commit: [SPARK-25785][SQL] Add prettyNames for from_json, to_json, from_csv, and schema_of_json

2018-10-19 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 4acbda4a9 -> 3370865b0


[SPARK-25785][SQL] Add prettyNames for from_json, to_json, from_csv, and 
schema_of_json

## What changes were proposed in this pull request?

This PR adds `prettyNames` for `from_json`, `to_json`, `from_csv`, and 
`schema_of_json` so that appropriate names are used.

## How was this patch tested?

Unit tests

Closes #22773 from HyukjinKwon/minor-prettyNames.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3370865b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3370865b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3370865b

Branch: refs/heads/master
Commit: 3370865b0ebe9b04c6671631aee5917b41ceba9c
Parents: 4acbda4
Author: hyukjinkwon 
Authored: Sat Oct 20 10:15:53 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Oct 20 10:15:53 2018 +0800

--
 .../catalyst/expressions/csvExpressions.scala   |  2 +
 .../catalyst/expressions/jsonExpressions.scala  |  6 +++
 .../sql-tests/results/csv-functions.sql.out |  4 +-
 .../sql-tests/results/json-functions.sql.out| 50 ++--
 .../native/stringCastAndExpressions.sql.out |  2 +-
 5 files changed, 36 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
index a63b624..853b1ea 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
@@ -117,4 +117,6 @@ case class CsvToStructs(
   }
 
   override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
+
+  override def prettyName: String = "from_csv"
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index 9f28483..b4815b4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -610,6 +610,8 @@ case class JsonToStructs(
 case _: MapType => "entries"
 case _ => super.sql
   }
+
+  override def prettyName: String = "from_json"
 }
 
 /**
@@ -730,6 +732,8 @@ case class StructsToJson(
   override def nullSafeEval(value: Any): Any = converter(value)
 
   override def inputTypes: Seq[AbstractDataType] = TypeCollection(ArrayType, 
StructType) :: Nil
+
+  override def prettyName: String = "to_json"
 }
 
 /**
@@ -774,6 +778,8 @@ case class SchemaOfJson(
 
 UTF8String.fromString(dt.catalogString)
   }
+
+  override def prettyName: String = "schema_of_json"
 }
 
 object JsonExprUtils {

http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out
--
diff --git 
a/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out
index 15dbe36..f19f34a 100644
--- a/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out
@@ -5,7 +5,7 @@
 -- !query 0
 select from_csv('1, 3.14', 'a INT, f FLOAT')
 -- !query 0 schema
-struct>
+struct>
 -- !query 0 output
 {"a":1,"f":3.14}
 
@@ -13,7 +13,7 @@ struct>
 -- !query 1
 select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 
'dd/MM/'))
 -- !query 1 schema
-struct>
+struct>
 -- !query 1 output
 {"time":2015-08-26 00:00:00.0}
 

http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
--
diff --git 
a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
index 77e9000..868eee8 100644
--- a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out
+++

spark git commit: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-10-17 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 7d425b190 -> c3eaee776


[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

Master

## What changes were proposed in this pull request?

Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.

## How was this patch tested?

An integration test was added which mimics the Scala test for the same feature.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #21990 from RussellSpitzer/SPARK-25003-master.

Authored-by: Russell Spitzer 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c3eaee77
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c3eaee77
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c3eaee77

Branch: refs/heads/master
Commit: c3eaee776509b0a23d0ba7a575575516bab4aa4e
Parents: 7d425b1
Author: Russell Spitzer 
Authored: Thu Oct 18 12:29:09 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Oct 18 12:29:09 2018 +0800

--
 python/pyspark/sql/tests.py | 42 +++
 .../org/apache/spark/sql/SparkSession.scala | 56 +---
 2 files changed, 80 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c3eaee77/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 85712df..8065d82 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3837,6 +3837,48 @@ class QueryExecutionListenerTests(unittest.TestCase, 
SQLTestUtils):
 "The callback from the query execution listener should be 
called after 'toPandas'")
 
 
+class SparkExtensionsTest(unittest.TestCase):
+# These tests are separate because it uses 'spark.sql.extensions' which is
+# static and immutable. This can't be set or unset, for example, via 
`spark.conf`.
+
+@classmethod
+def setUpClass(cls):
+import glob
+from pyspark.find_spark_home import _find_spark_home
+
+SPARK_HOME = _find_spark_home()
+filename_pattern = (
+"sql/core/target/scala-*/test-classes/org/apache/spark/sql/"
+"SparkSessionExtensionSuite.class")
+if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)):
+raise unittest.SkipTest(
+"'org.apache.spark.sql.SparkSessionExtensionSuite' is not "
+"available. Will skip the related tests.")
+
+# Note that 'spark.sql.extensions' is a static immutable configuration.
+cls.spark = SparkSession.builder \
+.master("local[4]") \
+.appName(cls.__name__) \
+.config(
+"spark.sql.extensions",
+"org.apache.spark.sql.MyExtensions") \
+.getOrCreate()
+
+@classmethod
+def tearDownClass(cls):
+cls.spark.stop()
+
+def test_use_custom_class_for_extensions(self):
+self.assertTrue(
+
self.spark._jsparkSession.sessionState().planner().strategies().contains(
+
self.spark._jvm.org.apache.spark.sql.MySparkStrategy(self.spark._jsparkSession)),
+"MySparkStrategy not found in active planner strategies")
+self.assertTrue(
+
self.spark._jsparkSession.sessionState().analyzer().extendedResolutionRules().contains(
+
self.spark._jvm.org.apache.spark.sql.MyRule(self.spark._jsparkSession)),
+"MyRule not found in extended resolution rules")
+
+
 class SparkSessionTests(PySparkTestCase):
 
 # This test is separate because it's closely related with session's start 
and stop.

http://git-wip-us.apache.org/repos/asf/spark/blob/c3eaee77/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
index 2b847fb..71f967a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -84,8 +84,17 @@ class SparkSession private(
   // The call site where this SparkSession was constructed.
   private val creationSite: CallSite = Utils.getCallSite()
 
+  /**
+   * Constructor used in Pyspark. Contains explicit application of Spark 
Session Extensions
+   * which otherwise only occurs during

spark git commit: [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e028fd3ae -> 2c664edc0


[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates

## What changes were proposed in this pull request?

This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 
2.3.2. Currently, for column names with `.`, the pushed predicates are ignored.

**Test Data**
```scala
scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
scala> df.write.mode("overwrite").orc("/tmp/orc")
```

**Spark 2.3.2**
```scala
scala> spark.sql("set spark.sql.orc.impl=native")
scala> spark.sql("set spark.sql.orc.filterPushdown=true")
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 1542 ms

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 152 ms
```

**Spark 2.4.0 RC3**
```scala
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 4074 ms

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 1771 ms
```

## How was this patch tested?

Pass the Jenkins with a newly added test case.

Closes #22597 from dongjoon-hyun/SPARK-25579.

Authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c664edc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c664edc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c664edc

Branch: refs/heads/master
Commit: 2c664edc060a41340eb374fd44b5d32c3c06a15c
Parents: e028fd3
Author: Dongjoon Hyun 
Authored: Tue Oct 16 20:30:23 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 16 20:30:23 2018 +0800

--
 .../execution/datasources/orc/OrcFilters.scala  | 37 +++-
 .../datasources/orc/OrcQuerySuite.scala | 28 +--
 .../sql/execution/datasources/orc/OrcTest.scala | 10 ++
 3 files changed, 46 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2c664edc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
index 2b17b47..0a64981 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
@@ -67,6 +67,16 @@ private[sql] object OrcFilters {
 }
   }
 
+  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` 
characters
+  // in order to distinguish predicate pushdown for nested columns.
+  private def quoteAttributeNameIfNeeded(name: String) : String = {
+if (!name.contains("`") && name.contains(".")) {
+  s"`$name`"
+} else {
+  name
+}
+  }
+
   /**
* Create ORC filter as a SearchArgument instance.
*/
@@ -215,38 +225,47 @@ private[sql] object OrcFilters {
   // wrapped by a "parent" predicate (`And`, `Or`, or `Not`).
 
   case EqualTo(attribute, value) if 
isSearchableType(dataTypeMap(attribute)) =>
+val quotedName = quoteAttributeNameIfNeeded(attribute)
 val castedValue = castLiteralValue(value, dataTypeMap(attribute))
-Some(builder.startAnd().equals(attribute, getType(attribute), 
castedValue).end())
+Some(builder.startAnd().equals(quotedName, getType(attribute), 
castedValue).end())
 
   case EqualNullSafe(attribute, value) if 
isSearchableType(dataTypeMap(attribute)) =>
+val quotedName = quoteAttributeNameIfNeeded(attribute)
 val castedValue = castLiteralValue(value, dataTypeMap(attribute))
-Some(builder.startAnd().nullSafeEquals(attribute, getType(attribute), 
castedValue).end())
+Some(builder.startAnd().nullSafeEquals(quotedName, getType(attribute), 
castedValue).end())
 
   case LessThan(attribute, value) if 
isSearchableType(dataTypeMap(attribute)) =>
+val quotedName = quoteAttributeNameIfNeeded(attribute)
 val castedValue = castLiteralValue(value, dataTypeMap(attribute))
-Some(builder.startAnd().lessThan(attribute, getType(attribute), 
castedValue).end())
+

spark git commit: [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates

2018-10-16 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 77156f8c8 -> 144cb949d


[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates

## What changes were proposed in this pull request?

This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 
2.3.2. Currently, for column names with `.`, the pushed predicates are ignored.

**Test Data**
```scala
scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
scala> df.write.mode("overwrite").orc("/tmp/orc")
```

**Spark 2.3.2**
```scala
scala> spark.sql("set spark.sql.orc.impl=native")
scala> spark.sql("set spark.sql.orc.filterPushdown=true")
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 1542 ms

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 152 ms
```

**Spark 2.4.0 RC3**
```scala
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 4074 ms

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
++
|col.with.dot|
++
|   5|
|   7|
|   8|
++

Time taken: 1771 ms
```

## How was this patch tested?

Pass the Jenkins with a newly added test case.

Closes #22597 from dongjoon-hyun/SPARK-25579.

Authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 2c664edc060a41340eb374fd44b5d32c3c06a15c)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/144cb949
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/144cb949
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/144cb949

Branch: refs/heads/branch-2.4
Commit: 144cb949d597e6cd0e662f2320e983cb6903ecfb
Parents: 77156f8
Author: Dongjoon Hyun 
Authored: Tue Oct 16 20:30:23 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 16 20:30:40 2018 +0800

--
 .../execution/datasources/orc/OrcFilters.scala  | 37 +++-
 .../datasources/orc/OrcQuerySuite.scala | 28 +--
 .../sql/execution/datasources/orc/OrcTest.scala | 10 ++
 3 files changed, 46 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/144cb949/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
index dbafc46..5b93a60 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
@@ -67,6 +67,16 @@ private[sql] object OrcFilters {
 }
   }
 
+  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` 
characters
+  // in order to distinguish predicate pushdown for nested columns.
+  private def quoteAttributeNameIfNeeded(name: String) : String = {
+if (!name.contains("`") && name.contains(".")) {
+  s"`$name`"
+} else {
+  name
+}
+  }
+
   /**
* Create ORC filter as a SearchArgument instance.
*/
@@ -178,38 +188,47 @@ private[sql] object OrcFilters {
   // wrapped by a "parent" predicate (`And`, `Or`, or `Not`).
 
   case EqualTo(attribute, value) if 
isSearchableType(dataTypeMap(attribute)) =>
+val quotedName = quoteAttributeNameIfNeeded(attribute)
 val castedValue = castLiteralValue(value, dataTypeMap(attribute))
-Some(builder.startAnd().equals(attribute, getType(attribute), 
castedValue).end())
+Some(builder.startAnd().equals(quotedName, getType(attribute), 
castedValue).end())
 
   case EqualNullSafe(attribute, value) if 
isSearchableType(dataTypeMap(attribute)) =>
+val quotedName = quoteAttributeNameIfNeeded(attribute)
 val castedValue = castLiteralValue(value, dataTypeMap(attribute))
-Some(builder.startAnd().nullSafeEquals(attribute, getType(attribute), 
castedValue).end())
+Some(builder.startAnd().nullSafeEquals(quotedName, getType(attribute), 
castedValue).end())
 
   case LessThan(attribute, value) if 
isSearchableType(dataTypeMap(attribute)) =>
+val quotedName = quoteAttributeNameIfNeeded(attribute)
 val castedValue = castLiteralValue(value,

spark git commit: [MINOR][SQL] Avoid hardcoded configuration keys in SQLConf's `doc`

2018-10-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 5e5d886a2 -> 5bd5e1b9c


[MINOR][SQL] Avoid hardcoded configuration keys in SQLConf's `doc`

## What changes were proposed in this pull request?

This PR proposes to avoid hardcorded configuration keys in SQLConf's `doc.

## How was this patch tested?

Manually verified.

Closes #22877 from HyukjinKwon/minor-conf-name.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5bd5e1b9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5bd5e1b9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5bd5e1b9

Branch: refs/heads/master
Commit: 5bd5e1b9c84b5f7d4d67ab94e02d49ebdd02f177
Parents: 5e5d886
Author: hyukjinkwon 
Authored: Tue Oct 30 07:38:26 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 30 07:38:26 2018 +0800

--
 .../org/apache/spark/sql/internal/SQLConf.scala | 41 +++-
 1 file changed, 23 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5bd5e1b9/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 4edffce..535ec51 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -408,7 +408,8 @@ object SQLConf {
 
   val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = 
buildConf("spark.sql.parquet.filterPushdown.date")
 .doc("If true, enables Parquet filter push-down optimization for Date. " +
-  "This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+  s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
+  "enabled.")
 .internal()
 .booleanConf
 .createWithDefault(true)
@@ -416,7 +417,7 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
 buildConf("spark.sql.parquet.filterPushdown.timestamp")
   .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
-"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
 "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS 
type.")
 .internal()
 .booleanConf
@@ -425,7 +426,8 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_DECIMAL_ENABLED =
 buildConf("spark.sql.parquet.filterPushdown.decimal")
   .doc("If true, enables Parquet filter push-down optimization for 
Decimal. " +
-"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
+"enabled.")
   .internal()
   .booleanConf
   .createWithDefault(true)
@@ -433,7 +435,8 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
 buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
 .doc("If true, enables Parquet filter push-down optimization for string 
startsWith function. " +
-  "This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+  s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
+  "enabled.")
 .internal()
 .booleanConf
 .createWithDefault(true)
@@ -444,7 +447,8 @@ object SQLConf {
 "Large threshold won't necessarily provide much better performance. " +
 "The experiment argued that 300 is the limit threshold. " +
 "By setting this value to 0 this feature can be disabled. " +
-"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
+"enabled.")
   .internal()
   .intConf
   .checkValue(threshold => threshold >= 0, "The threshold must not be 
negative.")
@@ -459,14 +463,6 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
-  val PARQUET_RECORD_FILTER_ENABLED = 
buildConf("spark.sql.parquet.recordLevelFilter.enabled")
-.doc("If true, enables Parquet's native record-level filtering using the 
pushed down " +
-  "filters. This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' " +
-  "is enabled and the vectorized reader is not used. You can ensure the 
vectorized reader " +
-  "is not used by setting

spark git commit: [SPARK-25672][SQL] schema_of_csv() - schema inference from an example

2018-10-31 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master c5ef477d2 -> c9667aff4


[SPARK-25672][SQL] schema_of_csv() - schema inference from an example

## What changes were proposed in this pull request?

In the PR, I propose to add new function - *schema_of_csv()* which infers 
schema of CSV string literal. The result of the function is a string containing 
a schema in DDL format. For example:

```sql
select schema_of_csv('1|abc', map('delimiter', '|'))
```
```
struct<_c0:int,_c1:string>
```

## How was this patch tested?

Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to 
`csv-functions.sql`

Closes #22666 from MaxGekk/schema_of_csv-function.

Lead-authored-by: hyukjinkwon 
Co-authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c9667aff
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c9667aff
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c9667aff

Branch: refs/heads/master
Commit: c9667aff4f4888b650fad2ed41698025b1e84166
Parents: c5ef477
Author: hyukjinkwon 
Authored: Thu Nov 1 09:14:16 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 1 09:14:16 2018 +0800

--
 python/pyspark/sql/functions.py |  41 +++-
 .../catalyst/analysis/FunctionRegistry.scala|   3 +-
 .../spark/sql/catalyst/csv/CSVInferSchema.scala | 220 +++
 .../sql/catalyst/expressions/ExprUtils.scala|  33 ++-
 .../catalyst/expressions/csvExpressions.scala   |  54 +
 .../catalyst/expressions/jsonExpressions.scala  |  16 +-
 .../sql/catalyst/csv/CSVInferSchemaSuite.scala  | 142 
 .../sql/catalyst/csv/UnivocityParserSuite.scala | 199 +
 .../expressions/CsvExpressionsSuite.scala   |  10 +
 .../datasources/csv/CSVDataSource.scala |   2 +-
 .../datasources/csv/CSVInferSchema.scala| 214 --
 .../scala/org/apache/spark/sql/functions.scala  |  35 +++
 .../sql-tests/inputs/csv-functions.sql  |   8 +
 .../sql-tests/results/csv-functions.sql.out |  54 -
 .../apache/spark/sql/CsvFunctionsSuite.scala|  15 ++
 .../datasources/csv/CSVInferSchemaSuite.scala   | 143 
 .../datasources/csv/UnivocityParserSuite.scala  | 200 -
 17 files changed, 803 insertions(+), 586 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c9667aff/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index ca2a256..beb1a06 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2364,6 +2364,33 @@ def schema_of_json(json, options={}):
 return Column(jc)
 
 
+@ignore_unicode_prefix
+@since(3.0)
+def schema_of_csv(csv, options={}):
+"""
+Parses a CSV string and infers its schema in DDL format.
+
+:param col: a CSV string or a string literal containing a CSV string.
+:param options: options to control parsing. accepts the same options as 
the CSV datasource
+
+>>> df = spark.range(1)
+>>> df.select(schema_of_csv(lit('1|a'), 
{'sep':'|'}).alias("csv")).collect()
+[Row(csv=u'struct<_c0:int,_c1:string>')]
+>>> df.select(schema_of_csv('1|a', {'sep':'|'}).alias("csv")).collect()
+[Row(csv=u'struct<_c0:int,_c1:string>')]
+"""
+if isinstance(csv, basestring):
+col = _create_column_from_literal(csv)
+elif isinstance(csv, Column):
+col = _to_java_column(csv)
+else:
+raise TypeError("schema argument should be a column or string")
+
+sc = SparkContext._active_spark_context
+jc = sc._jvm.functions.schema_of_csv(col, options)
+return Column(jc)
+
+
 @since(1.5)
 def size(col):
 """
@@ -2664,13 +2691,13 @@ def from_csv(col, schema, options={}):
 :param schema: a string with schema in DDL format to use when parsing the 
CSV column.
 :param options: options to control parsing. accepts the same options as 
the CSV datasource
 
->>> data = [(1, '1')]
->>> df = spark.createDataFrame(data, ("key", "value"))
->>> df.select(from_csv(df.value, "a INT").alias("csv")).collect()
-[Row(csv=Row(a=1))]
->>> df = spark.createDataFrame(data, ("key", "value"))
->>> df.select(from_csv(df.value, lit("a INT")).alias("csv")).collect()
-[Row(csv=Row(a=1))]
+>>> data = [("1,2,3",)]
+>>> df = spark.createDataFrame(data, ("value",))
+>>> df.select(from_csv(df.value, "a INT, b INT, c 
INT").alias("csv")).collect()
+[Row(csv=Row(a=1, b=2, c=3))]
+>>> value = data[0][0]
+>>> df.select(from_csv(df.value, 
schema_of_csv(value)).alias("csv")).collect()
+[Row(csv=Row(_c0=1, _c1=2, _c2=3))]
 """
 
 sc = SparkContext._active_spark_context

spark git commit: [SPARK-25886][SQL][MINOR] Improve error message of `FailureSafeParser` and `from_avro` in FAILFAST mode

2018-10-31 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 3c0e9ce94 -> 57eddc718


[SPARK-25886][SQL][MINOR] Improve error message of `FailureSafeParser` and 
`from_avro` in FAILFAST mode

## What changes were proposed in this pull request?

Currently in `FailureSafeParser` and `from_avro`, the exception is created with 
such code
```
throw new SparkException("Malformed records are detected in record parsing. " +
s"Parse Mode: ${FailFastMode.name}.", e.cause)
```

1. The cause part should be `e` instead of `e.cause`
2. If `e` contains non-null message, it should be shown in 
`from_json`/`from_csv`/`from_avro`, e.g.
```
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('1' (code 
49)): was expecting a colon to separate field name and value
at [Source: (InputStreamReader); line: 1, column: 7]
```
3.Kindly show hint for trying PERMISSIVE in error message.

## How was this patch tested?
Unit test.

Closes #22895 from gengliangwang/improve_error_msg.

Authored-by: Gengliang Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57eddc71
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57eddc71
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57eddc71

Branch: refs/heads/master
Commit: 57eddc7182ece0030f6d0cc02339c0b8d8c0be5c
Parents: 3c0e9ce
Author: Gengliang Wang 
Authored: Wed Oct 31 20:22:57 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 31 20:22:57 2018 +0800

--
 .../main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala | 2 +-
 .../org/apache/spark/sql/catalyst/util/FailureSafeParser.scala| 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/57eddc71/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index ae61587..5656ac7 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -102,7 +102,7 @@ case class AvroDataToCatalyst(
 case FailFastMode =>
   throw new SparkException("Malformed records are detected in record 
parsing. " +
 s"Current parse Mode: ${FailFastMode.name}. To process malformed 
records as null " +
-"result, try setting the option 'mode' as 'PERMISSIVE'.", 
e.getCause)
+"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
 case _ =>
   throw new AnalysisException(unacceptableModeMessage(parseMode.name))
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/57eddc71/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
index fecfff5..76745b1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
@@ -73,7 +73,8 @@ class FailureSafeParser[IN](
   Iterator.empty
 case FailFastMode =>
   throw new SparkException("Malformed records are detected in record 
parsing. " +
-s"Parse Mode: ${FailFastMode.name}.", e.cause)
+s"Parse Mode: ${FailFastMode.name}. To process malformed records 
as null " +
+"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
   }
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARKR] found some extra whitespace in the R tests

2018-10-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f6ff6329e -> 243ce319a


[SPARKR] found some extra whitespace in the R tests

## What changes were proposed in this pull request?

during my ubuntu-port testing, i found some extra whitespace that for some 
reason wasn't getting caught on the centos lint-r build step.

## How was this patch tested?

the build system will test this!  i used one of my ubuntu testing builds and 
scped over the modified file.

before my fix:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-ubuntu-testing/22/console

after my fix:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-ubuntu-testing/23/console

Closes #22896 from shaneknapp/remove-extra-whitespace.

Authored-by: shane knapp 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/243ce319
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/243ce319
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/243ce319

Branch: refs/heads/master
Commit: 243ce319a06f20365d5b08d479642d75748645d9
Parents: f6ff632
Author: shane knapp 
Authored: Wed Oct 31 10:32:26 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 31 10:32:26 2018 +0800

--
 R/pkg/tests/fulltests/test_sparkSQL_eager.R | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/243ce319/R/pkg/tests/fulltests/test_sparkSQL_eager.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL_eager.R 
b/R/pkg/tests/fulltests/test_sparkSQL_eager.R
index df7354f..9b4489a 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL_eager.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL_eager.R
@@ -22,12 +22,12 @@ context("test show SparkDataFrame when eager execution is 
enabled.")
 test_that("eager execution is not enabled", {
   # Start Spark session without eager execution enabled
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
-  
+
   df <- createDataFrame(faithful)
   expect_is(df, "SparkDataFrame")
   expected <- "eruptions:double, waiting:double"
   expect_output(show(df), expected)
-  
+
   # Stop Spark session
   sparkR.session.stop()
 })
@@ -35,9 +35,9 @@ test_that("eager execution is not enabled", {
 test_that("eager execution is enabled", {
   # Start Spark session with eager execution enabled
   sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true")
-  
+
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
sparkConfig = sparkConfig)
-  
+
   df <- createDataFrame(faithful)
   expect_is(df, "SparkDataFrame")
   expected <- paste0("(+-+---+\n",
@@ -45,7 +45,7 @@ test_that("eager execution is enabled", {
  "+-+---+\n)*",
  "(only showing top 20 rows)")
   expect_output(show(df), expected)
-  
+
   # Stop Spark session
   sparkR.session.stop()
 })
@@ -55,9 +55,9 @@ test_that("eager execution is enabled with maxNumRows and 
truncate set", {
   sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true",
   spark.sql.repl.eagerEval.maxNumRows = as.integer(5),
   spark.sql.repl.eagerEval.truncate = as.integer(2))
-  
+
   sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
sparkConfig = sparkConfig)
-  
+
   df <- arrange(createDataFrame(faithful), "waiting")
   expect_is(df, "SparkDataFrame")
   expected <- paste0("(+-+---+\n",
@@ -66,7 +66,7 @@ test_that("eager execution is enabled with maxNumRows and 
truncate set", {
  "|   1.| 43|\n)*",
  "(only showing top 5 rows)")
   expect_output(show(df), expected)
-  
+
   # Stop Spark session
   sparkR.session.stop()
 })


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method

2018-10-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 891032da6 -> f6ff6329e


[SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method

## What changes were proposed in this pull request?

Refactor JSONBenchmark to use main method

use spark-submit:
`bin/spark-submit --class 
org.apache.spark.sql.execution.datasources.json.JSONBenchmark --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar
 ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar`

Generate benchmark result:
`SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.json.JSONBenchmark"`

## How was this patch tested?

manual tests

Closes #22844 from heary-cao/JSONBenchmarks.

Lead-authored-by: caoxuewen 
Co-authored-by: heary 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6ff6329
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6ff6329
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6ff6329

Branch: refs/heads/master
Commit: f6ff6329eee720e19a56b90c0ffda9da5cecca5b
Parents: 891032d
Author: caoxuewen 
Authored: Wed Oct 31 10:28:17 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 31 10:28:17 2018 +0800

--
 sql/core/benchmarks/JSONBenchmark-results.txt   |  37 
 .../datasources/json/JsonBenchmark.scala| 183 
 .../datasources/json/JsonBenchmarks.scala   | 217 ---
 3 files changed, 220 insertions(+), 217 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f6ff6329/sql/core/benchmarks/JSONBenchmark-results.txt
--
diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt 
b/sql/core/benchmarks/JSONBenchmark-results.txt
new file mode 100644
index 000..9993730
--- /dev/null
+++ b/sql/core/benchmarks/JSONBenchmark-results.txt
@@ -0,0 +1,37 @@
+
+Benchmark for performance of JSON parsing
+
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+JSON schema inferring:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+No encoding 62946 / 63310  1.6 
629.5   1.0X
+UTF-8 is set  112814 / 112866  0.9
1128.1   0.6X
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+JSON per-line parsing:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+No encoding 16468 / 16553  6.1 
164.7   1.0X
+UTF-8 is set16420 / 16441  6.1 
164.2   1.0X
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+JSON parsing of wide lines:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+No encoding 39789 / 40053  0.3
3978.9   1.0X
+UTF-8 is set39505 / 39584  0.3
3950.5   1.0X
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
+
+Select 10 columns + count() 15997 / 16015  0.6
1599.7   1.0X
+Select 1 column + count()   13280 / 13326  0.8
1328.0   1.2X
+count()   3006 / 3021  3.3 
300.6   5.3X
+
+

http://git-wip-us.apache.org/repos/asf/spark/blob/f6ff6329/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
--
diff --git

spark git commit: [SPARK-24709][SQL][2.4] use str instead of basestring in isinstance

2018-10-27 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 f575616db -> 0f74bac64


[SPARK-24709][SQL][2.4] use str instead of basestring in isinstance

## What changes were proposed in this pull request?

after backport https://github.com/apache/spark/pull/22775 to 2.4, the 2.4 sbt 
Jenkins QA job is broken, see 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console

This PR adds `if sys.version >= '3': basestring = str` which onlly exists in 
master.

## How was this patch tested?

existing test

Closes #22858 from cloud-fan/python.

Authored-by: Wenchen Fan 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f74bac6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f74bac6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f74bac6

Branch: refs/heads/branch-2.4
Commit: 0f74bac647c9f8fce112eada7913504b2c6d08fa
Parents: f575616
Author: Wenchen Fan 
Authored: Sun Oct 28 10:50:46 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Oct 28 10:50:46 2018 +0800

--
 python/pyspark/sql/functions.py | 3 +++
 1 file changed, 3 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0f74bac6/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 9583a98..e1d6ea3 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -25,6 +25,9 @@ import warnings
 if sys.version < "3":
 from itertools import imap as map
 
+if sys.version >= '3':
+basestring = str
+
 from pyspark import since, SparkContext
 from pyspark.rdd import ignore_unicode_prefix, PythonEvalType
 from pyspark.sql.column import Column, _to_java_column, _to_seq, 
_create_column_from_literal


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-04 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1a7abf3f4 -> 39399f40b


[SPARK-25638][SQL] Adding new function - to_csv()

## What changes were proposed in this pull request?

New functions takes a struct and converts it to a CSV strings using passed CSV 
options. It accepts the same CSV options as CSV data source does.

## How was this patch tested?

Added `CsvExpressionsSuite`, `CsvFunctionsSuite` as well as R, Python and SQL 
tests similar to tests for `to_json()`

Closes #22626 from MaxGekk/to_csv.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39399f40
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39399f40
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39399f40

Branch: refs/heads/master
Commit: 39399f40b861f7d8e60d0e25d2f8801343477834
Parents: 1a7abf3
Author: Maxim Gekk 
Authored: Sun Nov 4 14:57:38 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Nov 4 14:57:38 2018 +0800

--
 R/pkg/NAMESPACE |  1 +
 R/pkg/R/functions.R | 31 +--
 R/pkg/R/generics.R  |  4 +
 R/pkg/tests/fulltests/test_sparkSQL.R   |  5 ++
 python/pyspark/sql/functions.py | 22 +
 .../catalyst/analysis/FunctionRegistry.scala|  3 +-
 .../sql/catalyst/csv/UnivocityGenerator.scala   | 93 
 .../catalyst/expressions/csvExpressions.scala   | 67 ++
 .../expressions/CsvExpressionsSuite.scala   | 44 +
 .../datasources/csv/CSVFileFormat.scala |  2 +-
 .../datasources/csv/UnivocityGenerator.scala| 90 ---
 .../scala/org/apache/spark/sql/functions.scala  | 26 ++
 .../sql-tests/inputs/csv-functions.sql  |  6 ++
 .../sql-tests/results/csv-functions.sql.out | 36 +++-
 .../apache/spark/sql/CsvFunctionsSuite.scala| 14 ++-
 15 files changed, 345 insertions(+), 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/39399f40/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index f9f556e..9d4f05a 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -380,6 +380,7 @@ exportMethods("%<=>%",
   "tanh",
   "toDegrees",
   "toRadians",
+  "to_csv",
   "to_date",
   "to_json",
   "to_timestamp",

http://git-wip-us.apache.org/repos/asf/spark/blob/39399f40/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index d2ca1d6..9292363 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -187,6 +187,7 @@ NULL
 #'  \itemize{
 #'  \item \code{to_json}: it is the column containing the struct, 
array of the structs,
 #'  the map or array of maps.
+#'  \item \code{to_csv}: it is the column containing the struct.
 #'  \item \code{from_json}: it is the column containing the JSON 
string.
 #'  \item \code{from_csv}: it is the column containing the CSV string.
 #'  }
@@ -204,11 +205,11 @@ NULL
 #'  also supported for the schema.
 #'  \item \code{from_csv}: a DDL-formatted string
 #'  }
-#' @param ... additional argument(s). In \code{to_json} and \code{from_json}, 
this contains
-#'additional named properties to control how it is converted, 
accepts the same
-#'options as the JSON data source. Additionally \code{to_json} 
supports the "pretty"
-#'option which enables pretty JSON generation. In 
\code{arrays_zip}, this contains
-#'additional Columns of arrays to be merged.
+#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and 
\code{from_json},
+#'this contains additional named properties to control how it is 
converted, accepts
+#'the same options as the JSON/CSV data source. Additionally 
\code{to_json} supports
+#'the "pretty" option which enables pretty JSON generation. In 
\code{arrays_zip},
+#'this contains additional Columns of arrays to be merged.
 #' @name column_collection_functions
 #' @rdname column_collection_functions
 #' @family collection functions
@@ -1741,6 +1742,26 @@ setMethod("to_json", signature(x = "Column"),
   })
 
 #' @details
+#' \code{to_csv}: Converts a column containing a \code{structType} into a 
Column of CSV string.
+#' Resolving the Column can fail if an unsupported type is encountered.
+#'
+#' @rdname column_collection_functions
+#' @aliases to_csv to_csv,Column-method
+#' @examples
+#'
+#' \dontrun{
+#' # Converts a

spark git commit: [INFRA] Close stale PRs

2018-11-04 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 39399f40b -> 463a67668


[INFRA] Close stale PRs

Closes https://github.com/apache/spark/pull/22859
Closes https://github.com/apache/spark/pull/22849
Closes https://github.com/apache/spark/pull/22591
Closes https://github.com/apache/spark/pull/22322
Closes https://github.com/apache/spark/pull/22312
Closes https://github.com/apache/spark/pull/19590

Closes #22934 from wangyum/CloseStalePRs.

Authored-by: Yuming Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/463a6766
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/463a6766
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/463a6766

Branch: refs/heads/master
Commit: 463a6766876942e90f10d1ce2d1e36a8284bfbc2
Parents: 39399f4
Author: Yuming Wang 
Authored: Sun Nov 4 14:59:33 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Nov 4 14:59:33 2018 +0800

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25819][SQL] Support parse mode option for the function `from_avro`

2018-10-25 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 79f3babcc -> 24e8c27df


[SPARK-25819][SQL] Support parse mode option for the function `from_avro`

## What changes were proposed in this pull request?

Current the function `from_avro` throws exception on reading corrupt records.
In practice, there could be various reasons of data corruption. It would be 
good to support `PERMISSIVE` mode and allow the function from_avro to process 
all the input file/streaming, which is consistent with from_json and from_csv. 
There is no obvious down side for supporting `PERMISSIVE` mode.

Different from `from_csv` and `from_json`, the default parse mode is `FAILFAST` 
for the following reasons:
1. Since Avro is structured data format, input data is usually able to be 
parsed by certain schema.  In such case, exposing the problems of input data to 
users is better than hiding it.
2. For `PERMISSIVE` mode, we have to force the data schema as fully nullable. 
This seems quite unnecessary for Avro. Reversing non-null schema might archive 
more perf optimizations in Spark.
3. To be consistent with the behavior in Spark 2.4 .

## How was this patch tested?

Unit test

Manual previewing generated html for the Avro data source doc:

![image](https://user-images.githubusercontent.com/1097932/47510100-02558880-d8aa-11e8-9d57-a43daee4c6b9.png)

Closes #22814 from gengliangwang/improve_from_avro.

Authored-by: Gengliang Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/24e8c27d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/24e8c27d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/24e8c27d

Branch: refs/heads/master
Commit: 24e8c27dfe31e6e0a53c89e6ddc36327e537931b
Parents: 79f3bab
Author: Gengliang Wang 
Authored: Fri Oct 26 11:39:38 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 26 11:39:38 2018 +0800

--
 docs/sql-data-sources-avro.md   | 18 +++-
 .../spark/sql/avro/AvroDataToCatalyst.scala | 90 +---
 .../org/apache/spark/sql/avro/AvroOptions.scala | 16 +++-
 .../org/apache/spark/sql/avro/package.scala | 28 +-
 .../avro/AvroCatalystDataConversionSuite.scala  | 58 +++--
 .../spark/sql/avro/AvroFunctionsSuite.scala | 36 +++-
 6 files changed, 219 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/docs/sql-data-sources-avro.md
--
diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md
index d3b81f0..bfe641d 100644
--- a/docs/sql-data-sources-avro.md
+++ b/docs/sql-data-sources-avro.md
@@ -142,7 +142,10 @@ StreamingQuery query = output
 
 ## Data Source Option
 
-Data source options of Avro can be set using the `.option` method on 
`DataFrameReader` or `DataFrameWriter`.
+Data source options of Avro can be set via:
+ * the `.option` method on `DataFrameReader` or `DataFrameWriter`.
+ * the `options` parameter in function `from_avro`.
+
 
   Property 
NameDefaultMeaningScope
   
@@ -177,6 +180,19 @@ Data source options of Avro can be set using the `.option` 
method on `DataFrameR
   Currently supported codecs are uncompressed, 
snappy, deflate, bzip2 and 
xz. If the option is not set, the configuration 
spark.sql.avro.compression.codec config is taken into account.
 write
   
+  
+mode
+FAILFAST
+The mode option allows to specify parse mode for function 
from_avro.
+  Currently supported modes are:
+  
+FAILFAST: Throws an exception on processing corrupted 
record.
+PERMISSIVE: Corrupt records are processed as null 
result. Therefore, the
+data schema is forced to be fully nullable, which might be different 
from the one user provided.
+  
+
+function from_avro
+  
 
 
 ## Configuration

http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index 915769f..43d3f6e 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -17,20 +17,37 @@
 
 package org.apache.spark.sql.avro
 
+import scala.util.control.NonFatal
+
 import org.apache.avro.Schema
 import org.apache.avro.generic.GenericDatumReader
 import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
 
-import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import

spark git commit: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test.

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1117fc35f -> e80f18dbd


[SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up 
each test.

## What changes were proposed in this pull request?

Currently each test in `SQLTest` in PySpark is not cleaned properly.
We should introduce and use more `contextmanager` to be convenient to clean up 
the context properly.

## How was this patch tested?

Modified tests.

Closes #22762 from ueshin/issues/SPARK-25763/cleanup_sqltests.

Authored-by: Takuya UESHIN 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e80f18db
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e80f18db
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e80f18db

Branch: refs/heads/master
Commit: e80f18dbd8bc4c2aca9ba6dd487b50e95c55d2e6
Parents: 1117fc3
Author: Takuya UESHIN 
Authored: Fri Oct 19 00:31:01 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 00:31:01 2018 +0800

--
 python/pyspark/sql/tests.py | 556 ++-
 1 file changed, 318 insertions(+), 238 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e80f18db/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 8065d82..82dc5a6 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -225,6 +225,63 @@ class SQLTestUtils(object):
 else:
 self.spark.conf.set(key, old_value)
 
+@contextmanager
+def database(self, *databases):
+"""
+A convenient context manager to test with some specific databases. 
This drops the given
+databases if exist and sets current database to "default" when it 
exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for db in databases:
+self.spark.sql("DROP DATABASE IF EXISTS %s CASCADE" % db)
+self.spark.catalog.setCurrentDatabase("default")
+
+@contextmanager
+def table(self, *tables):
+"""
+A convenient context manager to test with some specific tables. This 
drops the given tables
+if exist when it exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for t in tables:
+self.spark.sql("DROP TABLE IF EXISTS %s" % t)
+
+@contextmanager
+def tempView(self, *views):
+"""
+A convenient context manager to test with some specific views. This 
drops the given views
+if exist when it exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for v in views:
+self.spark.catalog.dropTempView(v)
+
+@contextmanager
+def function(self, *functions):
+"""
+A convenient context manager to test with some specific functions. 
This drops the given
+functions if exist when it exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for f in functions:
+self.spark.sql("DROP FUNCTION IF EXISTS %s" % f)
+
 
 class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils):
 @classmethod
@@ -332,6 +389,7 @@ class SQLTests(ReusedSQLTestCase):
 @classmethod
 def setUpClass(cls):
 ReusedSQLTestCase.setUpClass()
+cls.spark.catalog._reset()
 cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
 os.unlink(cls.tempdir.name)
 cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
@@ -347,12 +405,6 @@ class SQLTests(ReusedSQLTestCase):
 sqlContext2 = SQLContext(self.sc)
 self.assertTrue(sqlContext1.sparkSession is sqlContext2.sparkSession)
 
-def tearDown(self):
-super(SQLTests, self).tearDown()
-
-# tear down test_bucketed_write state
-self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")
-
 def test_row_should_be_read_only(self):
 row = Row(a=1, b=2)
 self.assertEqual(1, row.a)
@@ -473,11 +525,12 @@ class SQLTests(ReusedSQLTestCase):
 self.assertEqual(row[0], 4)
 
 def test_udf2(self):
-self.spark.catalog.registerFunction("strlen", lambda string: 
len(string), IntegerType())
-self.spark.createDataFrame(self.sc.parallelize([Row(a="test")]))\
-

spark git commit: [HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character

2018-10-20 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 3b4f35f56 -> 5330c192b


[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character

## What changes were proposed in this pull request?

PIP installation requires to package bin scripts together.

https://github.com/apache/spark/blob/master/python/setup.py#L71

The recent fix introduced non-ascii compatible (non-breackable space I guess) 
at 
https://github.com/apache/spark/commit/ec96d34e74148803190db8dcf9fda527eeea9255 
fix.

This is usually not the problem but looks Jenkins's default encoding is `ascii` 
and during copying the script, there looks implicit conversion between bytes 
and strings - where the default encoding is used

https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189

## How was this patch tested?

Jenkins

Closes #22782 from HyukjinKwon/pip-failure-fix.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5330c192
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5330c192
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5330c192

Branch: refs/heads/master
Commit: 5330c192bd87eb18351e72e390baf29855d99b0a
Parents: 3b4f35f
Author: hyukjinkwon 
Authored: Sun Oct 21 02:04:45 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Oct 21 02:04:45 2018 +0800

--
 bin/docker-image-tool.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5330c192/bin/docker-image-tool.sh
--
diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh
index 001590a..7256355 100755
--- a/bin/docker-image-tool.sh
+++ b/bin/docker-image-tool.sh
@@ -79,7 +79,7 @@ function build {
   fi
 
   # Verify that Spark has actually been built/is a runnable distribution
-  #Â i.e. the Spark JARs that the Docker files will place into the image are 
present
+  # i.e. the Spark JARs that the Docker files will place into the image are 
present
   local TOTAL_JARS=$(ls $JARS/spark-* | wc -l)
   TOTAL_JARS=$(( $TOTAL_JARS ))
   if [ "${TOTAL_JARS}" -eq 0 ]; then


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25950][SQL] from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 63ca4bbe7 -> 76813cfa1


[SPARK-25950][SQL] from_csv should respect to 
spark.sql.columnNameOfCorruptRecord

## What changes were proposed in this pull request?

Fix for `CsvToStructs` to take into account SQL config 
`spark.sql.columnNameOfCorruptRecord` similar to `from_json`.

## How was this patch tested?

Added new test where `spark.sql.columnNameOfCorruptRecord` is set to corrupt 
column name different from default.

Closes #22956 from MaxGekk/csv-tests.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76813cfa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76813cfa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76813cfa

Branch: refs/heads/master
Commit: 76813cfa1e2607ea3b669a79e59b568e96395b2e
Parents: 63ca4bb
Author: Maxim Gekk 
Authored: Wed Nov 7 11:26:17 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Nov 7 11:26:17 2018 +0800

--
 .../catalyst/expressions/csvExpressions.scala   |  9 +-
 .../apache/spark/sql/CsvFunctionsSuite.scala| 31 
 2 files changed, 39 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/76813cfa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
index 74b670a..aff372b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
@@ -27,6 +27,7 @@ import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
 import org.apache.spark.sql.catalyst.csv._
 import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -92,8 +93,14 @@ case class CsvToStructs(
 }
   }
 
+  val nameOfCorruptRecord = 
SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
+
   @transient lazy val parser = {
-val parsedOptions = new CSVOptions(options, columnPruning = true, 
timeZoneId.get)
+val parsedOptions = new CSVOptions(
+  options,
+  columnPruning = true,
+  defaultTimeZoneId = timeZoneId.get,
+  defaultColumnNameOfCorruptRecord = nameOfCorruptRecord)
 val mode = parsedOptions.parseMode
 if (mode != PermissiveMode && mode != FailFastMode) {
   throw new AnalysisException(s"from_csv() doesn't support the 
${mode.name} mode. " +

http://git-wip-us.apache.org/repos/asf/spark/blob/76813cfa/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
index eb6b248..1dd8ec3 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
@@ -19,7 +19,9 @@ package org.apache.spark.sql
 
 import scala.collection.JavaConverters._
 
+import org.apache.spark.SparkException
 import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SharedSQLContext
 import org.apache.spark.sql.types._
 
@@ -86,4 +88,33 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
 
 checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") :: 
Nil)
   }
+
+  test("from_csv invalid csv - check modes") {
+withSQLConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD.key -> "_unparsed") {
+  val schema = new StructType()
+.add("a", IntegerType)
+.add("b", IntegerType)
+.add("_unparsed", StringType)
+  val badRec = "\""
+  val df = Seq(badRec, "2,12").toDS()
+
+  checkAnswer(
+df.select(from_csv($"value", schema, Map("mode" -> "PERMISSIVE"))),
+Row(Row(null, null, badRec)) :: Row(Row(2, 12, null)) :: Nil)
+
+  val exception1 = intercept[SparkException] {
+df.select(from_csv($"value", schema, Map("mode" -> 
"FAILFAST"))).collect()
+  }.getMessage
+  assert(exception1.contains(
+"Malformed records are detected in record parsing. Parse Mode: 
FAILFAST."))
+
+  val exception2 = intercept[SparkException] {
+df.select(from_csv($"value", schema, Map("mode" ->

spark git commit: [SPARK-25962][BUILD][PYTHON] Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e4561e1c5 -> a8e1c9815


[SPARK-25962][BUILD][PYTHON] Specify minimum versions for both pydocstyle and 
flake8 in 'lint-python' script

## What changes were proposed in this pull request?

This PR explicitly specifies `flake8` and `pydocstyle` versions.

- It checks flake8 binary executable
- flake8 version check >= 3.5.0
- pydocstyle >= 3.0.0 (previously it was == 3.0.0)

## How was this patch tested?

Manually tested.

Closes #22963 from HyukjinKwon/SPARK-25962.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a8e1c981
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a8e1c981
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a8e1c981

Branch: refs/heads/master
Commit: a8e1c9815fef0deb45c9a516d415cea6be511415
Parents: e4561e1
Author: hyukjinkwon 
Authored: Thu Nov 8 12:26:21 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 8 12:26:21 2018 +0800

--
 dev/lint-python | 58 +---
 1 file changed, 41 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a8e1c981/dev/lint-python
--
diff --git a/dev/lint-python b/dev/lint-python
index 2e353e1..27d87f6 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -26,9 +26,13 @@ 
PYCODESTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-report.txt"
 PYDOCSTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pydocstyle-report.txt"
 PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt"
 PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt"
+
 PYDOCSTYLEBUILD="pydocstyle"
-EXPECTED_PYDOCSTYLEVERSION="3.0.0"
-PYDOCSTYLEVERSION=$(python -c 'import pkg_resources; 
print(pkg_resources.get_distribution("pydocstyle").version)' 2> /dev/null)
+MINIMUM_PYDOCSTYLEVERSION="3.0.0"
+
+FLAKE8BUILD="flake8"
+MINIMUM_FLAKE8="3.5.0"
+
 SPHINXBUILD=${SPHINXBUILD:=sphinx-build}
 SPHINX_REPORT_PATH="$SPARK_ROOT_DIR/dev/sphinx-report.txt"
 
@@ -87,27 +91,47 @@ else
 rm "$PYCODESTYLE_REPORT_PATH"
 fi
 
-# stop the build if there are Python syntax errors or undefined names
-flake8 . --count --select=E901,E999,F821,F822,F823 --max-line-length=100 
--show-source --statistics
-flake8_status="${PIPESTATUS[0]}"
+# Check by flake8
+if hash "$FLAKE8BUILD" 2> /dev/null; then
+FLAKE8VERSION="$( $FLAKE8BUILD --version  2> /dev/null )"
+VERSION=($FLAKE8VERSION)
+IS_EXPECTED_FLAKE8=$(python -c 'from distutils.version import 
LooseVersion; \
+print(LooseVersion("""'${VERSION[0]}'""") >= 
LooseVersion("""'$MINIMUM_FLAKE8'"""))' 2> /dev/null)
+if [[ "$IS_EXPECTED_FLAKE8" == "True" ]]; then
+# stop the build if there are Python syntax errors or undefined names
+$FLAKE8BUILD . --count --select=E901,E999,F821,F822,F823 
--max-line-length=100 --show-source --statistics
+flake8_status="${PIPESTATUS[0]}"
+
+if [ "$flake8_status" -eq 0 ]; then
+lint_status=0
+else
+lint_status=1
+fi
 
-if [ "$flake8_status" -eq 0 ]; then
-lint_status=0
+if [ "$lint_status" -ne 0 ]; then
+echo "flake8 checks failed."
+exit "$lint_status"
+else
+echo "flake8 checks passed."
+fi
+else
+echo "The flake8 version needs to be "$MINIMUM_FLAKE8" at latest. Your 
current version is '"$FLAKE8VERSION"'."
+echo "flake8 checks failed."
+exit 1
+fi
 else
-lint_status=1
-fi
-
-if [ "$lint_status" -ne 0 ]; then
+echo >&2 "The flake8 command was not found."
 echo "flake8 checks failed."
-exit "$lint_status"
-else
-echo "flake8 checks passed."
+exit 1
 fi
 
 # Check python document style, skip check if pydocstyle is not installed.
 if hash "$PYDOCSTYLEBUILD" 2> /dev/null; then
-if [[ "$PYDOCSTYLEVERSION" == "$EXPECTED_PYDOCSTYLEVERSION" ]]; then
-pydocstyle --config=dev/tox.ini $DOC_PATHS_TO_CHECK >> 
"$PYDOCSTYLE_REPORT_PATH"
+PYDOCSTYLEVERSION="$( $PYDOCSTYLEBUILD --version  2> /dev/null )"
+IS_EXPECTED_PYDOCSTYLEVERSION=$(python -c 'from distutils.version import 
LooseVersion; \
+print(LooseVersion("""'$PYDOCSTYLEVERSION'""") >= 
LooseVersion("""'$MINIMUM_PYDOCSTYLEVERSION'"""))')
+if [[ "$IS_EXPECTED_PYDOCSTYLEVERSION" == "True" ]]; then
+$PYDOCSTYLEBUILD --config=dev/tox.ini $DOC_PATHS_TO_CHECK >> 
"$PYDOCSTYLE_REPORT_PATH"
 pydocstyle_status="${PIPESTATUS[0]}"
 
 if [ "$compile_status" -eq 0 -a "$pydocstyle_status" -eq 0 ]; then
@@ -121,7 +145,7 @@ if hash "$PYDOCSTYLEBUILD" 2> /dev/null; then
 fi
 
 else
-echo "The pydocstyle version needs to be latest 3.0.0. Skipping pydoc 
checks for now"
+

spark git commit: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 17449a2e6 -> ee03f760b


[SPARK-25955][TEST] Porting JSON tests for CSV functions

## What changes were proposed in this pull request?

In the PR, I propose to port existing JSON tests from `JsonFunctionsSuite` that 
are applicable for CSV, and put them to `CsvFunctionsSuite`. In particular:
- roundtrip `from_csv` to `to_csv`, and `to_csv` to `from_csv`
- using `schema_of_csv` in `from_csv`
- Java API `from_csv`
- using `from_csv` and `to_csv` in exprs.

Closes #22960 from MaxGekk/csv-additional-tests.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ee03f760
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ee03f760
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ee03f760

Branch: refs/heads/master
Commit: ee03f760b305e70a57c3b4409ec25897af348600
Parents: 17449a2
Author: Maxim Gekk 
Authored: Thu Nov 8 14:51:29 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 8 14:51:29 2018 +0800

--
 .../apache/spark/sql/CsvFunctionsSuite.scala| 47 
 1 file changed, 47 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ee03f760/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
index 1dd8ec3..b97ac38 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
@@ -117,4 +117,51 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
   "Acceptable modes are PERMISSIVE and FAILFAST."))
 }
   }
+
+  test("from_csv uses DDL strings for defining a schema - java") {
+val df = Seq("""1,"haa).toDS()
+checkAnswer(
+  df.select(
+from_csv($"value", lit("a INT, b STRING"), new 
java.util.HashMap[String, String]())),
+  Row(Row(1, "haa")) :: Nil)
+  }
+
+  test("roundtrip to_csv -> from_csv") {
+val df = Seq(Tuple1(Tuple1(1)), Tuple1(null)).toDF("struct")
+val schema = df.schema(0).dataType.asInstanceOf[StructType]
+val options = Map.empty[String, String]
+val readback = df.select(to_csv($"struct").as("csv"))
+  .select(from_csv($"csv", schema, options).as("struct"))
+
+checkAnswer(df, readback)
+  }
+
+  test("roundtrip from_csv -> to_csv") {
+val df = Seq(Some("1"), None).toDF("csv")
+val schema = new StructType().add("a", IntegerType)
+val options = Map.empty[String, String]
+val readback = df.select(from_csv($"csv", schema, options).as("struct"))
+  .select(to_csv($"struct").as("csv"))
+
+checkAnswer(df, readback)
+  }
+
+  test("infers schemas of a CSV string and pass to to from_csv") {
+val in = Seq("""0.123456789,987654321,"San Francisco).toDS()
+val options = Map.empty[String, String].asJava
+val out = in.select(from_csv('value, schema_of_csv("0.1,1,a"), options) as 
"parsed")
+val expected = StructType(Seq(StructField(
+  "parsed",
+  StructType(Seq(
+StructField("_c0", DoubleType, true),
+StructField("_c1", IntegerType, true),
+StructField("_c2", StringType, true))
+
+assert(out.schema == expected)
+  }
+
+  test("Support to_csv in SQL") {
+val df1 = Seq(Tuple1(Tuple1(1))).toDF("a")
+checkAnswer(df1.selectExpr("to_csv(a)"), Row("1") :: Nil)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25952][SQL] Passing actual schema to JacksonParser

2018-11-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master d68f3a726 -> 17449a2e6


[SPARK-25952][SQL] Passing actual schema to JacksonParser

## What changes were proposed in this pull request?

The PR fixes an issue when the corrupt record column specified via 
`spark.sql.columnNameOfCorruptRecord` or JSON options 
`columnNameOfCorruptRecord` is propagated to JacksonParser, and returned row 
breaks an assumption in `FailureSafeParser` that the row must contain only 
actual data. The issue is fixed by passing actual schema without the corrupt 
record field into `JacksonParser`.

## How was this patch tested?

Added a test with the corrupt record column in the middle of user's schema.

Closes #22958 from MaxGekk/from_json-corrupt-record-schema.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17449a2e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17449a2e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17449a2e

Branch: refs/heads/master
Commit: 17449a2e6b28ecce7a273284eab037e8aceb3611
Parents: d68f3a7
Author: Maxim Gekk 
Authored: Thu Nov 8 14:48:23 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 8 14:48:23 2018 +0800

--
 .../sql/catalyst/expressions/jsonExpressions.scala| 14 --
 .../org/apache/spark/sql/JsonFunctionsSuite.scala | 13 +
 2 files changed, 21 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/17449a2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index eafcb61..52d0677 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -569,14 +569,16 @@ case class JsonToStructs(
   throw new IllegalArgumentException(s"from_json() doesn't support the 
${mode.name} mode. " +
 s"Acceptable modes are ${PermissiveMode.name} and 
${FailFastMode.name}.")
 }
-val rawParser = new JacksonParser(nullableSchema, parsedOptions, 
allowArrayAsStructs = false)
-val createParser = CreateJacksonParser.utf8String _
-
-val parserSchema = nullableSchema match {
-  case s: StructType => s
-  case other => StructType(StructField("value", other) :: Nil)
+val (parserSchema, actualSchema) = nullableSchema match {
+  case s: StructType =>
+(s, StructType(s.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord)))
+  case other =>
+(StructType(StructField("value", other) :: Nil), other)
 }
 
+val rawParser = new JacksonParser(actualSchema, parsedOptions, 
allowArrayAsStructs = false)
+val createParser = CreateJacksonParser.utf8String _
+
 new FailureSafeParser[UTF8String](
   input => rawParser.parse(input, createParser, identity[UTF8String]),
   mode,

http://git-wip-us.apache.org/repos/asf/spark/blob/17449a2e/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
index 2b09782..d6b7338 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
@@ -578,4 +578,17 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   "Acceptable modes are PERMISSIVE and FAILFAST."))
 }
   }
+
+  test("corrupt record column in the middle") {
+val schema = new StructType()
+  .add("a", IntegerType)
+  .add("_unparsed", StringType)
+  .add("b", IntegerType)
+val badRec = """{"a" 1, "b": 11}"""
+val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS()
+
+checkAnswer(
+  df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> 
"_unparsed"))),
+  Row(Row(null, badRec, null)) :: Row(Row(2, null, 12)) :: Nil)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"

2018-11-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master ee03f760b -> 0a2e45fdb


Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"

This reverts commit a75571b46f813005a6d4b076ec39081ffab11844.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a2e45fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a2e45fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a2e45fd

Branch: refs/heads/master
Commit: 0a2e45fdb8baadf7a57eb06f319e96f95eedf298
Parents: ee03f76
Author: hyukjinkwon 
Authored: Thu Nov 8 16:32:25 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 8 16:32:25 2018 +0800

--
 .../apache/spark/sql/hive/client/IsolatedClientLoader.scala| 1 -
 .../org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala   | 6 --
 2 files changed, 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0a2e45fd/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
index 1e7a0b1..c1d8fe5 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
@@ -186,7 +186,6 @@ private[hive] class IsolatedClientLoader(
 name.startsWith("org.slf4j") ||
 name.startsWith("org.apache.log4j") || // log4j1.x
 name.startsWith("org.apache.logging.log4j") || // log4j2
-name.startsWith("org.apache.derby.") ||
 name.startsWith("org.apache.spark.") ||
 (sharesHadoopClasses && isHadoopClass) ||
 name.startsWith("scala.") ||

http://git-wip-us.apache.org/repos/asf/spark/blob/0a2e45fd/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
index 1de258f..0a522b6 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
@@ -113,10 +113,4 @@ class HiveExternalCatalogSuite extends 
ExternalCatalogSuite {
 catalog.createDatabase(newDb("dbWithNullDesc").copy(description = null), 
ignoreIfExists = false)
 assert(catalog.getDatabase("dbWithNullDesc").description == "")
   }
-
-  test("SPARK-23831: Add org.apache.derby to IsolatedClientLoader") {
-val client1 = HiveUtils.newClientForMetadata(new SparkConf, new 
Configuration)
-val client2 = HiveUtils.newClientForMetadata(new SparkConf, new 
Configuration)
-assert(!client1.equals(client2))
-  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"

2018-11-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 4c91b224a -> 947462f5a


Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"

This reverts commit a75571b46f813005a6d4b076ec39081ffab11844.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/947462f5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/947462f5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/947462f5

Branch: refs/heads/branch-2.4
Commit: 947462f5a36e2751f5a9160c676efbd4e5b08eb4
Parents: 4c91b22
Author: hyukjinkwon 
Authored: Thu Nov 8 16:32:25 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 8 16:35:41 2018 +0800

--
 .../apache/spark/sql/hive/client/IsolatedClientLoader.scala| 1 -
 .../org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala   | 6 --
 2 files changed, 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/947462f5/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
index 6a90c44..2f34f69 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
@@ -182,7 +182,6 @@ private[hive] class IsolatedClientLoader(
 name.startsWith("org.slf4j") ||
 name.startsWith("org.apache.log4j") || // log4j1.x
 name.startsWith("org.apache.logging.log4j") || // log4j2
-name.startsWith("org.apache.derby.") ||
 name.startsWith("org.apache.spark.") ||
 (sharesHadoopClasses && isHadoopClass) ||
 name.startsWith("scala.") ||

http://git-wip-us.apache.org/repos/asf/spark/blob/947462f5/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
index 1de258f..0a522b6 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
@@ -113,10 +113,4 @@ class HiveExternalCatalogSuite extends 
ExternalCatalogSuite {
 catalog.createDatabase(newDb("dbWithNullDesc").copy(description = null), 
ignoreIfExists = false)
 assert(catalog.getDatabase("dbWithNullDesc").description == "")
   }
-
-  test("SPARK-23831: Add org.apache.derby to IsolatedClientLoader") {
-val client1 = HiveUtils.newClientForMetadata(new SparkConf, new 
Configuration)
-val client2 = HiveUtils.newClientForMetadata(new SparkConf, new 
Configuration)
-assert(!client1.equals(client2))
-  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen

2018-11-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 79551f558 -> 0558d021c


[SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen

## What changes were proposed in this pull request?

Remove `BenchmarkWithCodegen` as we don't use it anymore.
More details: https://github.com/apache/spark/pull/22484#discussion_r221397904

## How was this patch tested?

N/A

Closes #22985 from wangyum/SPARK-25510.

Authored-by: Yuming Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0558d021
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0558d021
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0558d021

Branch: refs/heads/master
Commit: 0558d021cc0aeae37ef0e043d244fd0300a57cd5
Parents: 79551f5
Author: Yuming Wang 
Authored: Fri Nov 9 11:45:03 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Nov 9 11:45:03 2018 +0800

--
 .../benchmark/BenchmarkWithCodegen.scala| 54 
 1 file changed, 54 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0558d021/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala
deleted file mode 100644
index 5133150..000
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala
+++ /dev/null
@@ -1,54 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.execution.benchmark
-
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.SparkSession
-
-/**
- * Common base trait for micro benchmarks that are supposed to run standalone 
(i.e. not together
- * with other test suites).
- */
-private[benchmark] trait BenchmarkWithCodegen extends SparkFunSuite {
-
-  lazy val sparkSession = SparkSession.builder
-.master("local[1]")
-.appName("microbenchmark")
-.config("spark.sql.shuffle.partitions", 1)
-.config("spark.sql.autoBroadcastJoinThreshold", 1)
-.getOrCreate()
-
-  /** Runs function `f` with whole stage codegen on and off. */
-  def runBenchmark(name: String, cardinality: Long)(f: => Unit): Unit = {
-val benchmark = new Benchmark(name, cardinality)
-
-benchmark.addCase(s"$name wholestage off", numIters = 2) { iter =>
-  sparkSession.conf.set("spark.sql.codegen.wholeStage", value = false)
-  f
-}
-
-benchmark.addCase(s"$name wholestage on", numIters = 5) { iter =>
-  sparkSession.conf.set("spark.sql.codegen.wholeStage", value = true)
-  f
-}
-
-benchmark.run()
-  }
-
-}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON

2018-11-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 973f7c01d -> 79551f558


[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON

## What changes were proposed in this pull request?

In the PR, I propose to add new option `locale` into CSVOptions/JSONOptions to 
make parsing date/timestamps in local languages possible. Currently the locale 
is hard coded to `Locale.US`.

## How was this patch tested?

Added two tests for parsing a date from CSV/JSON - `Ð½Ð¾Ñ 2018`.

Closes #22951 from MaxGekk/locale.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79551f55
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79551f55
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79551f55

Branch: refs/heads/master
Commit: 79551f558dafed41177b605b0436e9340edf5712
Parents: 973f7c0
Author: Maxim Gekk 
Authored: Fri Nov 9 09:45:06 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Nov 9 09:45:06 2018 +0800

--
 python/pyspark/sql/readwriter.py | 15 +++
 python/pyspark/sql/streaming.py  | 14 ++
 .../spark/sql/catalyst/csv/CSVOptions.scala  |  7 +--
 .../spark/sql/catalyst/json/JSONOptions.scala|  7 +--
 .../expressions/CsvExpressionsSuite.scala| 19 ++-
 .../expressions/JsonExpressionsSuite.scala   | 19 ++-
 .../org/apache/spark/sql/DataFrameReader.scala   |  4 
 .../spark/sql/streaming/DataStreamReader.scala   |  4 
 .../org/apache/spark/sql/CsvFunctionsSuite.scala | 17 +
 .../apache/spark/sql/JsonFunctionsSuite.scala| 17 +
 10 files changed, 109 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79551f55/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 690b130..726de4a 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -177,7 +177,7 @@ class DataFrameReader(OptionUtils):
  allowNumericLeadingZero=None, 
allowBackslashEscapingAnyCharacter=None,
  mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None,
  multiLine=None, allowUnquotedControlChars=None, lineSep=None, 
samplingRatio=None,
- dropFieldIfAllNull=None, encoding=None):
+ dropFieldIfAllNull=None, encoding=None, locale=None):
 """
 Loads JSON files and returns the results as a :class:`DataFrame`.
 
@@ -249,6 +249,9 @@ class DataFrameReader(OptionUtils):
 :param dropFieldIfAllNull: whether to ignore column of all null values 
or empty
array/struct during schema inference. If 
None is set, it
uses the default value, ``false``.
+:param locale: sets a locale as language tag in IETF BCP 47 format. If 
None is set,
+   it uses the default value, ``en-US``. For instance, 
``locale`` is used while
+   parsing dates and timestamps.
 
 >>> df1 = spark.read.json('python/test_support/sql/people.json')
 >>> df1.dtypes
@@ -267,7 +270,8 @@ class DataFrameReader(OptionUtils):
 mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, 
dateFormat=dateFormat,
 timestampFormat=timestampFormat, multiLine=multiLine,
 allowUnquotedControlChars=allowUnquotedControlChars, 
lineSep=lineSep,
-samplingRatio=samplingRatio, 
dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding)
+samplingRatio=samplingRatio, 
dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding,
+locale=locale)
 if isinstance(path, basestring):
 path = [path]
 if type(path) == list:
@@ -349,7 +353,7 @@ class DataFrameReader(OptionUtils):
 negativeInf=None, dateFormat=None, timestampFormat=None, 
maxColumns=None,
 maxCharsPerColumn=None, maxMalformedLogPerPartition=None, 
mode=None,
 columnNameOfCorruptRecord=None, multiLine=None, 
charToEscapeQuoteEscaping=None,
-samplingRatio=None, enforceSchema=None, emptyValue=None):
+samplingRatio=None, enforceSchema=None, emptyValue=None, 
locale=None):
 r"""Loads a CSV file and returns the result as a  :class:`DataFrame`.
 
 This function will go through the input once to determine the input 
schema if
@@ -446,6 +450,9 @@ class DataFrameReader(OptionUtils):
   If None is set, it uses the default value, 
``1.0``.
 :param emptyValue: sets the string representation of

spark git commit: [INFRA] Close stale PRs

2018-11-10 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6cd23482d -> a3ba3a899


[INFRA] Close stale PRs

Closes https://github.com/apache/spark/pull/21766
Closes https://github.com/apache/spark/pull/21679
Closes https://github.com/apache/spark/pull/21161
Closes https://github.com/apache/spark/pull/20846
Closes https://github.com/apache/spark/pull/19434
Closes https://github.com/apache/spark/pull/18080
Closes https://github.com/apache/spark/pull/17648
Closes https://github.com/apache/spark/pull/17169

Add:
Closes #22813
Closes #21994
Closes #22005
Closes #22463

Add:
Closes #15899

Add:
Closes #22539
Closes #21868
Closes #21514
Closes #21402
Closes #21322
Closes #21257
Closes #20163
Closes #19691
Closes #18697
Closes #18636
Closes #17176

Closes #23001 from wangyum/CloseStalePRs.

Authored-by: Yuming Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3ba3a89
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3ba3a89
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3ba3a89

Branch: refs/heads/master
Commit: a3ba3a899b3b43958820dc82fcdd3a8b28653bcb
Parents: 6cd2348
Author: Yuming Wang 
Authored: Sun Nov 11 14:05:19 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Nov 11 14:05:19 2018 +0800

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25972][PYTHON] Missed JSON options in streaming.py

2018-11-11 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a3ba3a899 -> aec0af4a9


[SPARK-25972][PYTHON] Missed JSON options in streaming.py

## What changes were proposed in this pull request?

Added JSON options for `json()` in streaming.py that are presented in the 
similar method in readwriter.py. In particular, missed options are 
`dropFieldIfAllNull` and `encoding`.

Closes #22973 from MaxGekk/streaming-missed-options.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aec0af4a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aec0af4a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aec0af4a

Branch: refs/heads/master
Commit: aec0af4a952df2957e21d39d1e0546a36ab7ab86
Parents: a3ba3a8
Author: Maxim Gekk 
Authored: Sun Nov 11 21:01:29 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Nov 11 21:01:29 2018 +0800

--
 python/pyspark/sql/streaming.py | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aec0af4a/python/pyspark/sql/streaming.py
--
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 02b14ea..58ca7b8 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -404,7 +404,8 @@ class DataStreamReader(OptionUtils):
  allowComments=None, allowUnquotedFieldNames=None, 
allowSingleQuotes=None,
  allowNumericLeadingZero=None, 
allowBackslashEscapingAnyCharacter=None,
  mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None,
- multiLine=None,  allowUnquotedControlChars=None, lineSep=None, 
locale=None):
+ multiLine=None,  allowUnquotedControlChars=None, lineSep=None, 
locale=None,
+ dropFieldIfAllNull=None, encoding=None):
 """
 Loads a JSON file stream and returns the results as a 
:class:`DataFrame`.
 
@@ -472,6 +473,13 @@ class DataStreamReader(OptionUtils):
 :param locale: sets a locale as language tag in IETF BCP 47 format. If 
None is set,
it uses the default value, ``en-US``. For instance, 
``locale`` is used while
parsing dates and timestamps.
+:param dropFieldIfAllNull: whether to ignore column of all null values 
or empty
+   array/struct during schema inference. If 
None is set, it
+   uses the default value, ``false``.
+:param encoding: allows to forcibly set one of standard basic or 
extended encoding for
+ the JSON files. For example UTF-16BE, UTF-32LE. If 
None is set,
+ the encoding of input JSON will be detected 
automatically
+ when the multiLine option is set to ``true``.
 
 >>> json_sdf = spark.readStream.json(tempfile.mkdtemp(), schema = 
sdf_schema)
 >>> json_sdf.isStreaming
@@ -486,7 +494,8 @@ class DataStreamReader(OptionUtils):
 
allowBackslashEscapingAnyCharacter=allowBackslashEscapingAnyCharacter,
 mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, 
dateFormat=dateFormat,
 timestampFormat=timestampFormat, multiLine=multiLine,
-allowUnquotedControlChars=allowUnquotedControlChars, 
lineSep=lineSep, locale=locale)
+allowUnquotedControlChars=allowUnquotedControlChars, 
lineSep=lineSep, locale=locale,
+dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding)
 if isinstance(path, basestring):
 return self._df(self._jreader.json(path))
 else:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-26007][SQL] DataFrameReader.csv() respects to spark.sql.columnNameOfCorruptRecord

2018-11-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 88c826272 -> c49193437


[SPARK-26007][SQL] DataFrameReader.csv() respects to 
spark.sql.columnNameOfCorruptRecord

## What changes were proposed in this pull request?

Passing current value of SQL config `spark.sql.columnNameOfCorruptRecord` to 
`CSVOptions` inside of `DataFrameReader`.`csv()`.

## How was this patch tested?

Added a test where default value of `spark.sql.columnNameOfCorruptRecord` is 
changed.

Closes #23006 from MaxGekk/csv-corrupt-sql-config.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Dongjoon Hyun 
Co-authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4919343
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4919343
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4919343

Branch: refs/heads/master
Commit: c49193437745f072767d26e6b9099f4949cabf95
Parents: 88c8262
Author: Maxim Gekk 
Authored: Tue Nov 13 12:26:19 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Nov 13 12:26:19 2018 +0800

--
 .../apache/spark/sql/catalyst/csv/CSVOptions.scala| 14 +-
 .../sql/execution/datasources/csv/CSVSuite.scala  | 11 +++
 2 files changed, 24 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c4919343/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
index 6428235..6bb50b4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
@@ -25,6 +25,7 @@ import org.apache.commons.lang3.time.FastDateFormat
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.internal.SQLConf
 
 class CSVOptions(
 @transient val parameters: CaseInsensitiveMap[String],
@@ -36,8 +37,19 @@ class CSVOptions(
   def this(
 parameters: Map[String, String],
 columnPruning: Boolean,
+defaultTimeZoneId: String) = {
+this(
+  CaseInsensitiveMap(parameters),
+  columnPruning,
+  defaultTimeZoneId,
+  SQLConf.get.columnNameOfCorruptRecord)
+  }
+
+  def this(
+parameters: Map[String, String],
+columnPruning: Boolean,
 defaultTimeZoneId: String,
-defaultColumnNameOfCorruptRecord: String = "") = {
+defaultColumnNameOfCorruptRecord: String) = {
   this(
 CaseInsensitiveMap(parameters),
 columnPruning,

http://git-wip-us.apache.org/repos/asf/spark/blob/c4919343/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index d43efc8..2efe1dd 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -1848,4 +1848,15 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 val schema = new StructType().add("a", StringType).add("b", IntegerType)
 checkAnswer(spark.read.schema(schema).option("delimiter", 
delimiter).csv(input), Row("abc", 1))
   }
+
+  test("using spark.sql.columnNameOfCorruptRecord") {
+withSQLConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD.key -> "_unparsed") {
+  val csv = "\""
+  val df = spark.read
+.schema("a int, _unparsed string")
+.csv(Seq(csv).toDS())
+
+  checkAnswer(df, Row(null, csv))
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: [SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files

2018-11-15 Thread gurwls223

[SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files

## What changes were proposed in this pull request?

This PR continues to break down a big large file into smaller files. See 
https://github.com/apache/spark/pull/23021. It targets to follow 
https://github.com/numpy/numpy/tree/master/numpy.

Basically this PR proposes to break down `pyspark/streaming/tests.py` into ...:

```
pyspark
âââ __init__.py
...
âââ streaming
âÂ Â  âââ __init__.py
...
âÂ Â  âââ tests
âÂ Â  âÂ Â  âââ __init__.py
âÂ Â  âÂ Â  âââ test_context.py
âÂ Â  âÂ Â  âââ test_dstream.py
âÂ Â  âÂ Â  âââ test_kinesis.py
âÂ Â  âÂ Â  âââ test_listener.py
...
âââ testing
...
âÂ Â  âââ streamingutils.py
...
```

## How was this patch tested?

Existing tests should cover.

`cd python` and .`/run-tests-with-coverage`. Manually checked they are actually 
being ran.

Each test (not officially) can be ran via:

```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```

Note that if you're using Mac and Python 3, you might have to 
`OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.

Closes #23034 from HyukjinKwon/SPARK-26035.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3649fe59
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3649fe59
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3649fe59

Branch: refs/heads/master
Commit: 3649fe599f1aa27fea0abd61c18d3ffa275d267b
Parents: 9a5fda6
Author: hyukjinkwon 
Authored: Fri Nov 16 07:58:09 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Nov 16 07:58:09 2018 +0800

--
 dev/sparktestsupport/modules.py |7 +-
 python/pyspark/streaming/tests.py   | 1185 --
 python/pyspark/streaming/tests/__init__.py  |   16 +
 python/pyspark/streaming/tests/test_context.py  |  184 +++
 python/pyspark/streaming/tests/test_dstream.py  |  640 ++
 python/pyspark/streaming/tests/test_kinesis.py  |   89 ++
 python/pyspark/streaming/tests/test_listener.py |  158 +++
 python/pyspark/testing/streamingutils.py|  190 +++
 8 files changed, 1283 insertions(+), 1186 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3649fe59/dev/sparktestsupport/modules.py
--
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index d5fcc06..58b48f4 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -398,8 +398,13 @@ pyspark_streaming = Module(
 "python/pyspark/streaming"
 ],
 python_test_goals=[
+# doctests
 "pyspark.streaming.util",
-"pyspark.streaming.tests",
+# unittests
+"pyspark.streaming.tests.test_context",
+"pyspark.streaming.tests.test_dstream",
+"pyspark.streaming.tests.test_kinesis",
+"pyspark.streaming.tests.test_listener",
 ]
 )
 

http://git-wip-us.apache.org/repos/asf/spark/blob/3649fe59/python/pyspark/streaming/tests.py
--
diff --git a/python/pyspark/streaming/tests.py 
b/python/pyspark/streaming/tests.py
deleted file mode 100644
index 8df00bc..000
--- a/python/pyspark/streaming/tests.py
+++ /dev/null
@@ -1,1185 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import glob
-import os
-import sys
-from itertools import chain
-import time
-import operator
-import tempfile
-import random
-import struct
-import shutil
-from functools import reduce
-
-try:
-import xmlrunner
-except ImportError:
-xmlrunner = None
-
-if sys.version_info[:2] <= (2, 6):
-try:
-import unittest2 as unittest
-except ImportError:
-sys.stderr.write('Please install unittest2 to test with Python 2.6 or 
earlier')
-sys.exit(1)
-else:
-import unittest
-
-if sys.version >= "3":
-long = int
-
-from pyspark.context import

[1/2] spark git commit: [SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files

2018-11-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 9a5fda60e -> 3649fe599


http://git-wip-us.apache.org/repos/asf/spark/blob/3649fe59/python/pyspark/streaming/tests/test_listener.py
--
diff --git a/python/pyspark/streaming/tests/test_listener.py 
b/python/pyspark/streaming/tests/test_listener.py
new file mode 100644
index 000..7c874b6
--- /dev/null
+++ b/python/pyspark/streaming/tests/test_listener.py
@@ -0,0 +1,158 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pyspark.streaming import StreamingListener
+from pyspark.testing.streamingutils import PySparkStreamingTestCase
+
+
+class StreamingListenerTests(PySparkStreamingTestCase):
+
+duration = .5
+
+class BatchInfoCollector(StreamingListener):
+
+def __init__(self):
+super(StreamingListener, self).__init__()
+self.batchInfosCompleted = []
+self.batchInfosStarted = []
+self.batchInfosSubmitted = []
+self.streamingStartedTime = []
+
+def onStreamingStarted(self, streamingStarted):
+self.streamingStartedTime.append(streamingStarted.time)
+
+def onBatchSubmitted(self, batchSubmitted):
+self.batchInfosSubmitted.append(batchSubmitted.batchInfo())
+
+def onBatchStarted(self, batchStarted):
+self.batchInfosStarted.append(batchStarted.batchInfo())
+
+def onBatchCompleted(self, batchCompleted):
+self.batchInfosCompleted.append(batchCompleted.batchInfo())
+
+def test_batch_info_reports(self):
+batch_collector = self.BatchInfoCollector()
+self.ssc.addStreamingListener(batch_collector)
+input = [[1], [2], [3], [4]]
+
+def func(dstream):
+return dstream.map(int)
+expected = [[1], [2], [3], [4]]
+self._test_func(input, func, expected)
+
+batchInfosSubmitted = batch_collector.batchInfosSubmitted
+batchInfosStarted = batch_collector.batchInfosStarted
+batchInfosCompleted = batch_collector.batchInfosCompleted
+streamingStartedTime = batch_collector.streamingStartedTime
+
+self.wait_for(batchInfosCompleted, 4)
+
+self.assertEqual(len(streamingStartedTime), 1)
+
+self.assertGreaterEqual(len(batchInfosSubmitted), 4)
+for info in batchInfosSubmitted:
+self.assertGreaterEqual(info.batchTime().milliseconds(), 0)
+self.assertGreaterEqual(info.submissionTime(), 0)
+
+for streamId in info.streamIdToInputInfo():
+streamInputInfo = info.streamIdToInputInfo()[streamId]
+self.assertGreaterEqual(streamInputInfo.inputStreamId(), 0)
+self.assertGreaterEqual(streamInputInfo.numRecords, 0)
+for key in streamInputInfo.metadata():
+self.assertIsNotNone(streamInputInfo.metadata()[key])
+self.assertIsNotNone(streamInputInfo.metadataDescription())
+
+for outputOpId in info.outputOperationInfos():
+outputInfo = info.outputOperationInfos()[outputOpId]
+self.assertGreaterEqual(outputInfo.batchTime().milliseconds(), 
0)
+self.assertGreaterEqual(outputInfo.id(), 0)
+self.assertIsNotNone(outputInfo.name())
+self.assertIsNotNone(outputInfo.description())
+self.assertGreaterEqual(outputInfo.startTime(), -1)
+self.assertGreaterEqual(outputInfo.endTime(), -1)
+self.assertIsNone(outputInfo.failureReason())
+
+self.assertEqual(info.schedulingDelay(), -1)
+self.assertEqual(info.processingDelay(), -1)
+self.assertEqual(info.totalDelay(), -1)
+self.assertEqual(info.numRecords(), 0)
+
+self.assertGreaterEqual(len(batchInfosStarted), 4)
+for info in batchInfosStarted:
+self.assertGreaterEqual(info.batchTime().milliseconds(), 0)
+self.assertGreaterEqual(info.submissionTime(), 0)
+
+for streamId in info.streamIdToInputInfo():
+streamInputInfo = info.streamIdToInputInfo()[streamId]
+

spark git commit: [SPARK-25883][BACKPORT][SQL][MINOR] Override method `prettyName` in `from_avro`/`to_avro`

2018-11-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 96834fb77 -> 6148a77a5


[SPARK-25883][BACKPORT][SQL][MINOR] Override method `prettyName` in 
`from_avro`/`to_avro`

Back port https://github.com/apache/spark/pull/22890 to branch-2.4.
It is a bug fix for this issue:
https://issues.apache.org/jira/browse/SPARK-26063

## What changes were proposed in this pull request?

Previously in from_avro/to_avro, we override the method `simpleString` and 
`sql` for the string output. However, the override only affects the alias 
naming:
```
Project [from_avro('col,
...
, (mode,PERMISSIVE)) AS from_avro(col, struct, 
Map(mode -> PERMISSIVE))#11]
```
It only makes the alias name quite long: `from_avro(col, 
struct, Map(mode -> PERMISSIVE))`).

We should follow `from_csv`/`from_json` here, to override the method prettyName 
only, and we will get a clean alias name

```
... AS from_avro(col)#11
```

## How was this patch tested?

Manual check

Closes #23047 from gengliangwang/backport_avro_pretty_name.

Authored-by: Gengliang Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6148a77a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6148a77a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6148a77a

Branch: refs/heads/branch-2.4
Commit: 6148a77a5da9ca33fb115269f1cba29cddfc652e
Parents: 96834fb
Author: Gengliang Wang 
Authored: Fri Nov 16 08:35:00 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Nov 16 08:35:00 2018 +0800

--
 .../scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala | 8 +---
 .../scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala | 8 +---
 2 files changed, 2 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6148a77a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index 915769f..8641b9f 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -51,13 +51,7 @@ case class AvroDataToCatalyst(child: Expression, 
jsonFormatSchema: String)
 deserializer.deserialize(result)
   }
 
-  override def simpleString: String = {
-s"from_avro(${child.sql}, ${dataType.simpleString})"
-  }
-
-  override def sql: String = {
-s"from_avro(${child.sql}, ${dataType.catalogString})"
-  }
+  override def prettyName: String = "from_avro"
 
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
 val expr = ctx.addReferenceObj("this", this)

http://git-wip-us.apache.org/repos/asf/spark/blob/6148a77a/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala
 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala
index 141ff37..6ed330d 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala
@@ -52,13 +52,7 @@ case class CatalystDataToAvro(child: Expression) extends 
UnaryExpression {
 out.toByteArray
   }
 
-  override def simpleString: String = {
-s"to_avro(${child.sql}, ${child.dataType.simpleString})"
-  }
-
-  override def sql: String = {
-s"to_avro(${child.sql}, ${child.dataType.catalogString})"
-  }
+  override def prettyName: String = "to_avro"
 
   override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
 val expr = ctx.addReferenceObj("this", this)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell

2018-11-05 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 78fa1be29 -> cc38abc27


[SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell

## What changes were proposed in this pull request?

This PR targets to document `-I` option from Spark 2.4.x (previously `-i` 
option until Spark 2.3.x).
After we upgraded Scala to 2.11.12, `-i` option (`:load`) was replaced to 
`-I`(SI-7898). Existing `-i` became `:paste` which does not respect Spark's 
implicit import (for instance `toDF`, symbol as column, etc.). Therefore, `-i` 
option does not correctly from Spark 2.4.x and it's not documented.

I checked other Scala REPL options but looks not applicable or working from 
quick tests. This PR only targets to document `-I` for now.

## How was this patch tested?

Manually tested.

**Mac:**

```bash
$ ./bin/spark-shell --help
Usage: ./bin/spark-shell [options]

Scala REPL options:
  -Ipreload , enforcing line-by-line 
interpretation

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn,
  k8s://https://host:port, or local (Default: 
local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
...
```

**Windows:**

```cmd
C:\...\spark>.\bin\spark-shell --help
Usage: .\bin\spark-shell.cmd [options]

Scala REPL options:
  -Ipreload , enforcing line-by-line 
interpretation

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn,
  k8s://https://host:port, or local (Default: 
local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
...
```

Closes #22919 from HyukjinKwon/SPARK-25906.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc38abc2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc38abc2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc38abc2

Branch: refs/heads/master
Commit: cc38abc27a671f345e3b4c170977a1976a02a0d0
Parents: 78fa1be
Author: hyukjinkwon 
Authored: Tue Nov 6 10:39:58 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Nov 6 10:39:58 2018 +0800

--
 bin/spark-shell  | 5 -
 bin/spark-shell2.cmd | 8 +++-
 2 files changed, 11 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cc38abc2/bin/spark-shell
--
diff --git a/bin/spark-shell b/bin/spark-shell
index 421f36c..e920137 100755
--- a/bin/spark-shell
+++ b/bin/spark-shell
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
 
 # SPARK-4161: scala does not assume use of the java classpath,
 # so we need to add the "-Dscala.usejavacp=true" flag manually. We

http://git-wip-us.apache.org/repos/asf/spark/blob/cc38abc2/bin/spark-shell2.cmd
--
diff --git a/bin/spark-shell2.cmd b/bin/spark-shell2.cmd
index aaf7190..549bf43 100644
--- a/bin/spark-shell2.cmd
+++ b/bin/spark-shell2.cmd
@@ -20,7 +20,13 @@ rem
 rem Figure out where the Spark framework is installed
 call "%~dp0find-spark-home.cmd"
 
-set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]
+set LF=^
+
+
+rem two empty lines are required
+set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]^%LF%%LF%^%LF%%LF%^
+Scala REPL options:^%LF%%LF%^
+  -I ^   preload ^, enforcing line-by-line 
interpretation
 
 rem SPARK-4161: scala does not assume use of the java classpath,
 rem so we need to add the "-Dscala.usejavacp=true" flag manually. We


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell

2018-11-05 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 8526f2ee5 -> f98c0ad02


[SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell

## What changes were proposed in this pull request?

This PR targets to document `-I` option from Spark 2.4.x (previously `-i` 
option until Spark 2.3.x).
After we upgraded Scala to 2.11.12, `-i` option (`:load`) was replaced to 
`-I`(SI-7898). Existing `-i` became `:paste` which does not respect Spark's 
implicit import (for instance `toDF`, symbol as column, etc.). Therefore, `-i` 
option does not correctly from Spark 2.4.x and it's not documented.

I checked other Scala REPL options but looks not applicable or working from 
quick tests. This PR only targets to document `-I` for now.

## How was this patch tested?

Manually tested.

**Mac:**

```bash
$ ./bin/spark-shell --help
Usage: ./bin/spark-shell [options]

Scala REPL options:
  -Ipreload , enforcing line-by-line 
interpretation

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn,
  k8s://https://host:port, or local (Default: 
local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
...
```

**Windows:**

```cmd
C:\...\spark>.\bin\spark-shell --help
Usage: .\bin\spark-shell.cmd [options]

Scala REPL options:
  -Ipreload , enforcing line-by-line 
interpretation

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn,
  k8s://https://host:port, or local (Default: 
local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
...
```

Closes #22919 from HyukjinKwon/SPARK-25906.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 
(cherry picked from commit cc38abc27a671f345e3b4c170977a1976a02a0d0)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f98c0ad0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f98c0ad0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f98c0ad0

Branch: refs/heads/branch-2.4
Commit: f98c0ad02ea087ae79fef277801d0b71a5019b48
Parents: 8526f2e
Author: hyukjinkwon 
Authored: Tue Nov 6 10:39:58 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Nov 6 10:40:17 2018 +0800

--
 bin/spark-shell  | 5 -
 bin/spark-shell2.cmd | 8 +++-
 2 files changed, 11 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f98c0ad0/bin/spark-shell
--
diff --git a/bin/spark-shell b/bin/spark-shell
index 421f36c..e920137 100755
--- a/bin/spark-shell
+++ b/bin/spark-shell
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
 
 # SPARK-4161: scala does not assume use of the java classpath,
 # so we need to add the "-Dscala.usejavacp=true" flag manually. We

http://git-wip-us.apache.org/repos/asf/spark/blob/f98c0ad0/bin/spark-shell2.cmd
--
diff --git a/bin/spark-shell2.cmd b/bin/spark-shell2.cmd
index aaf7190..549bf43 100644
--- a/bin/spark-shell2.cmd
+++ b/bin/spark-shell2.cmd
@@ -20,7 +20,13 @@ rem
 rem Figure out where the Spark framework is installed
 call "%~dp0find-spark-home.cmd"
 
-set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]
+set LF=^
+
+
+rem two empty lines are required
+set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]^%LF%%LF%^%LF%%LF%^
+Scala REPL options:^%LF%%LF%^
+  -I ^   preload ^, enforcing line-by-line 
interpretation
 
 rem SPARK-4161: scala does not assume use of the java classpath,
 rem so we need to add the "-Dscala.usejavacp=true" flag manually. We


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[5/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/__init__.py
--
diff --git a/python/pyspark/sql/tests/__init__.py 
b/python/pyspark/sql/tests/__init__.py
new file mode 100644
index 000..cce3aca
--- /dev/null
+++ b/python/pyspark/sql/tests/__init__.py
@@ -0,0 +1,16 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#

http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_appsubmit.py
--
diff --git a/python/pyspark/sql/tests/test_appsubmit.py 
b/python/pyspark/sql/tests/test_appsubmit.py
new file mode 100644
index 000..3c71151
--- /dev/null
+++ b/python/pyspark/sql/tests/test_appsubmit.py
@@ -0,0 +1,96 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import subprocess
+import tempfile
+
+import py4j
+
+from pyspark import SparkContext
+from pyspark.tests import SparkSubmitTests
+
+
+class HiveSparkSubmitTests(SparkSubmitTests):
+
+@classmethod
+def setUpClass(cls):
+# get a SparkContext to check for availability of Hive
+sc = SparkContext('local[4]', cls.__name__)
+cls.hive_available = True
+try:
+sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
+except py4j.protocol.Py4JError:
+cls.hive_available = False
+except TypeError:
+cls.hive_available = False
+finally:
+# we don't need this SparkContext for the test
+sc.stop()
+
+def setUp(self):
+super(HiveSparkSubmitTests, self).setUp()
+if not self.hive_available:
+self.skipTest("Hive is not available.")
+
+def test_hivecontext(self):
+# This test checks that HiveContext is using Hive metastore 
(SPARK-16224).
+# It sets a metastore url and checks if there is a derby dir created by
+# Hive metastore. If this derby dir exists, HiveContext is using
+# Hive metastore.
+metastore_path = os.path.join(tempfile.mkdtemp(), 
"spark16224_metastore_db")
+metastore_URL = "jdbc:derby:;databaseName=" + metastore_path + 
";create=true"
+hive_site_dir = os.path.join(self.programDir, "conf")
+hive_site_file = self.createTempFile("hive-site.xml", ("""
+|
+|  
+|  javax.jdo.option.ConnectionURL
+|  %s
+|  
+|
+""" % metastore_URL).lstrip(), "conf")
+script = self.createTempFile("test.py", """
+|import os
+|
+|from pyspark.conf import SparkConf
+|from pyspark.context import SparkContext
+|from pyspark.sql import HiveContext
+|
+|conf = SparkConf()
+|sc = SparkContext(conf=conf)
+|hive_context = HiveContext(sc)
+|print(hive_context.sql("show databases").collect())
+""")
+proc = subprocess.Popen(
+self.sparkSubmit + ["--master", "local-cluster[1,1,1024]",
+"--driver-class-path", hive_site_dir, script],
+stdout=subprocess.PIPE)
+out, err = proc.communicate()
+self.assertEqual(0, proc.returncode)
+self.assertIn("default", out.decode('utf-8'))
+self.assertTrue(os.path.exists(metastore_path))
+
+
+if __name__ == "__main__":
+import unittest
+from

[1/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f26cd1881 -> a7a331df6


http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_udf.py
--
diff --git a/python/pyspark/sql/tests/test_udf.py 
b/python/pyspark/sql/tests/test_udf.py
new file mode 100644
index 000..630b215
--- /dev/null
+++ b/python/pyspark/sql/tests/test_udf.py
@@ -0,0 +1,654 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import functools
+import pydoc
+import shutil
+import tempfile
+import unittest
+
+from pyspark import SparkContext
+from pyspark.sql import SparkSession, Column, Row
+from pyspark.sql.functions import UserDefinedFunction
+from pyspark.sql.types import *
+from pyspark.sql.utils import AnalysisException
+from pyspark.testing.sqlutils import ReusedSQLTestCase, test_compiled, 
test_not_compiled_message
+from pyspark.tests import QuietTest
+
+
+class UDFTests(ReusedSQLTestCase):
+
+def test_udf_with_callable(self):
+d = [Row(number=i, squared=i**2) for i in range(10)]
+rdd = self.sc.parallelize(d)
+data = self.spark.createDataFrame(rdd)
+
+class PlusFour:
+def __call__(self, col):
+if col is not None:
+return col + 4
+
+call = PlusFour()
+pudf = UserDefinedFunction(call, LongType())
+res = data.select(pudf(data['number']).alias('plus_four'))
+self.assertEqual(res.agg({'plus_four': 'sum'}).collect()[0][0], 85)
+
+def test_udf_with_partial_function(self):
+d = [Row(number=i, squared=i**2) for i in range(10)]
+rdd = self.sc.parallelize(d)
+data = self.spark.createDataFrame(rdd)
+
+def some_func(col, param):
+if col is not None:
+return col + param
+
+pfunc = functools.partial(some_func, param=4)
+pudf = UserDefinedFunction(pfunc, LongType())
+res = data.select(pudf(data['number']).alias('plus_four'))
+self.assertEqual(res.agg({'plus_four': 'sum'}).collect()[0][0], 85)
+
+def test_udf(self):
+self.spark.catalog.registerFunction("twoArgs", lambda x, y: len(x) + 
y, IntegerType())
+[row] = self.spark.sql("SELECT twoArgs('test', 1)").collect()
+self.assertEqual(row[0], 5)
+
+# This is to check if a deprecated 'SQLContext.registerFunction' can 
call its alias.
+sqlContext = self.spark._wrapped
+sqlContext.registerFunction("oneArg", lambda x: len(x), IntegerType())
+[row] = sqlContext.sql("SELECT oneArg('test')").collect()
+self.assertEqual(row[0], 4)
+
+def test_udf2(self):
+with self.tempView("test"):
+self.spark.catalog.registerFunction("strlen", lambda string: 
len(string), IntegerType())
+self.spark.createDataFrame(self.sc.parallelize([Row(a="test")]))\
+.createOrReplaceTempView("test")
+[res] = self.spark.sql("SELECT strlen(a) FROM test WHERE strlen(a) 
> 1").collect()
+self.assertEqual(4, res[0])
+
+def test_udf3(self):
+two_args = self.spark.catalog.registerFunction(
+"twoArgs", UserDefinedFunction(lambda x, y: len(x) + y))
+self.assertEqual(two_args.deterministic, True)
+[row] = self.spark.sql("SELECT twoArgs('test', 1)").collect()
+self.assertEqual(row[0], u'5')
+
+def test_udf_registration_return_type_none(self):
+two_args = self.spark.catalog.registerFunction(
+"twoArgs", UserDefinedFunction(lambda x, y: len(x) + y, 
"integer"), None)
+self.assertEqual(two_args.deterministic, True)
+[row] = self.spark.sql("SELECT twoArgs('test', 1)").collect()
+self.assertEqual(row[0], 5)
+
+def test_udf_registration_return_type_not_none(self):
+with QuietTest(self.sc):
+with self.assertRaisesRegexp(TypeError, "Invalid returnType"):
+self.spark.catalog.registerFunction(
+"f", UserDefinedFunction(lambda x, y: len(x) + y, 
StringType()), StringType())
+
+def test_nondeterministic_udf(self):
+# Test that nondeterministic UDFs are evaluated only once in chained 
UDF evaluations

[6/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
deleted file mode 100644
index ea02691..000
--- a/python/pyspark/sql/tests.py
+++ /dev/null
@@ -1,7079 +0,0 @@
-# -*- encoding: utf-8 -*-
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Unit tests for pyspark.sql; additional tests are implemented as doctests in
-individual modules.
-"""
-import os
-import sys
-import subprocess
-import pydoc
-import shutil
-import tempfile
-import threading
-import pickle
-import functools
-import time
-import datetime
-import array
-import ctypes
-import warnings
-import py4j
-from contextlib import contextmanager
-
-try:
-import xmlrunner
-except ImportError:
-xmlrunner = None
-
-if sys.version_info[:2] <= (2, 6):
-try:
-import unittest2 as unittest
-except ImportError:
-sys.stderr.write('Please install unittest2 to test with Python 2.6 or 
earlier')
-sys.exit(1)
-else:
-import unittest
-
-from pyspark.util import _exception_message
-
-_pandas_requirement_message = None
-try:
-from pyspark.sql.utils import require_minimum_pandas_version
-require_minimum_pandas_version()
-except ImportError as e:
-# If Pandas version requirement is not satisfied, skip related tests.
-_pandas_requirement_message = _exception_message(e)
-
-_pyarrow_requirement_message = None
-try:
-from pyspark.sql.utils import require_minimum_pyarrow_version
-require_minimum_pyarrow_version()
-except ImportError as e:
-# If Arrow version requirement is not satisfied, skip related tests.
-_pyarrow_requirement_message = _exception_message(e)
-
-_test_not_compiled_message = None
-try:
-from pyspark.sql.utils import require_test_compiled
-require_test_compiled()
-except Exception as e:
-_test_not_compiled_message = _exception_message(e)
-
-_have_pandas = _pandas_requirement_message is None
-_have_pyarrow = _pyarrow_requirement_message is None
-_test_compiled = _test_not_compiled_message is None
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SparkSession, SQLContext, HiveContext, Column, Row
-from pyspark.sql.types import *
-from pyspark.sql.types import UserDefinedType, _infer_type, _make_type_verifier
-from pyspark.sql.types import _array_signed_int_typecode_ctype_mappings, 
_array_type_mappings
-from pyspark.sql.types import _array_unsigned_int_typecode_ctype_mappings
-from pyspark.sql.types import _merge_type
-from pyspark.tests import QuietTest, ReusedPySparkTestCase, PySparkTestCase, 
SparkSubmitTests
-from pyspark.sql.functions import UserDefinedFunction, sha2, lit
-from pyspark.sql.window import Window
-from pyspark.sql.utils import AnalysisException, ParseException, 
IllegalArgumentException
-
-
-class UTCOffsetTimezone(datetime.tzinfo):
-"""
-Specifies timezone in UTC offset
-"""
-
-def __init__(self, offset=0):
-self.ZERO = datetime.timedelta(hours=offset)
-
-def utcoffset(self, dt):
-return self.ZERO
-
-def dst(self, dt):
-return self.ZERO
-
-
-class ExamplePointUDT(UserDefinedType):
-"""
-User-defined type (UDT) for ExamplePoint.
-"""
-
-@classmethod
-def sqlType(self):
-return ArrayType(DoubleType(), False)
-
-@classmethod
-def module(cls):
-return 'pyspark.sql.tests'
-
-@classmethod
-def scalaUDT(cls):
-return 'org.apache.spark.sql.test.ExamplePointUDT'
-
-def serialize(self, obj):
-return [obj.x, obj.y]
-
-def deserialize(self, datum):
-return ExamplePoint(datum[0], datum[1])
-
-
-class ExamplePoint:
-"""
-An example class to demonstrate UDT in Scala, Java, and Python.
-"""
-
-__UDT__ = ExamplePointUDT()
-
-def __init__(self, x, y):
-self.x = x
-self.y = y
-
-def __repr__(self):
-return "ExamplePoint(%s,%s)" % (self.x, self.y)
-
-def __str__(self):
-return "(%s,%s)" % (self.x, self.y)
-
-def __eq__(self, other):
-return isinstance(other, self.__class__) and \
-other.x == self.x and other.y ==

[3/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py
--
diff --git a/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py 
b/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py
new file mode 100644
index 000..4d44388
--- /dev/null
+++ b/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py
@@ -0,0 +1,530 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import datetime
+import unittest
+
+from pyspark.sql import Row
+from pyspark.sql.types import *
+from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, 
have_pyarrow, \
+pandas_requirement_message, pyarrow_requirement_message
+from pyspark.tests import QuietTest
+
+
+@unittest.skipIf(
+not have_pandas or not have_pyarrow,
+pandas_requirement_message or pyarrow_requirement_message)
+class GroupedMapPandasUDFTests(ReusedSQLTestCase):
+
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i) for i in range(20, 30)])) \
+.withColumn("v", explode(col('vs'))).drop('vs')
+
+def test_supported_types(self):
+from decimal import Decimal
+from distutils.version import LooseVersion
+import pyarrow as pa
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+values = [
+1, 2, 3,
+4, 5, 1.1,
+2.2, Decimal(1.123),
+[1, 2, 2], True, 'hello'
+]
+output_fields = [
+('id', IntegerType()), ('byte', ByteType()), ('short', 
ShortType()),
+('int', IntegerType()), ('long', LongType()), ('float', 
FloatType()),
+('double', DoubleType()), ('decim', DecimalType(10, 3)),
+('array', ArrayType(IntegerType())), ('bool', BooleanType()), 
('str', StringType())
+]
+
+# TODO: Add BinaryType to variables above once minimum pyarrow version 
is 0.10.0
+if LooseVersion(pa.__version__) >= LooseVersion("0.10.0"):
+values.append(bytearray([0x01, 0x02]))
+output_fields.append(('bin', BinaryType()))
+
+output_schema = StructType([StructField(*x) for x in output_fields])
+df = self.spark.createDataFrame([values], schema=output_schema)
+
+# Different forms of group map pandas UDF, results of these are the 
same
+udf1 = pandas_udf(
+lambda pdf: pdf.assign(
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
+output_schema,
+PandasUDFType.GROUPED_MAP
+)
+
+udf2 = pandas_udf(
+lambda _, pdf: pdf.assign(
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
+output_schema,
+PandasUDFType.GROUPED_MAP
+)
+
+udf3 = pandas_udf(
+lambda key, pdf: pdf.assign(
+id=key[0],
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
+output_schema,
+PandasUDFType.GROUPED_MAP
+)
+
+result1 =

[7/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

## What changes were proposed in this pull request?

This is the official first attempt to break huge single `tests.py` file - I did 
it locally before few times and gave up for some reasons. Now, currently it 
really makes the unittests super hard to read and difficult to check. To me, it 
even bothers me to to scroll down the big file. It's one single 7000 lines file!

This is not only readability issue. Since one big test takes most of tests 
time, the tests don't run in parallel fully - although it will costs to start 
and stop the context.

We could pick up one example and follow. Given my investigation, the current 
style looks closer to NumPy structure and looks easier to follow. Please see 
https://github.com/numpy/numpy/tree/master/numpy.

Basically this PR proposes to break down `pyspark/sql/tests.py` into ...:

```bash
pyspark
...
âââ sql
...
âÂ Â  âââ tests  # Includes all tests broken down from 
'pyspark/sql/tests.py'
â   âÂ   â  # Each matchs to module in 'pyspark/sql'. Additionally, 
some logical group can
â   âÂ   â  # be added. For instance, 'test_arrow.py', 
'test_datasources.py' ...
âÂ Â  âÂ Â  âââ __init__.py
âÂ Â  âÂ Â  âââ test_appsubmit.py
âÂ Â  âÂ Â  âââ test_arrow.py
âÂ Â  âÂ Â  âââ test_catalog.py
âÂ Â  âÂ Â  âââ test_column.py
âÂ Â  âÂ Â  âââ test_conf.py
âÂ Â  âÂ Â  âââ test_context.py
âÂ Â  âÂ Â  âââ test_dataframe.py
âÂ Â  âÂ Â  âââ test_datasources.py
âÂ Â  âÂ Â  âââ test_functions.py
âÂ Â  âÂ Â  âââ test_group.py
âÂ Â  âÂ Â  âââ test_pandas_udf.py
âÂ Â  âÂ Â  âââ test_pandas_udf_grouped_agg.py
âÂ Â  âÂ Â  âââ test_pandas_udf_grouped_map.py
âÂ Â  âÂ Â  âââ test_pandas_udf_scalar.py
âÂ Â  âÂ Â  âââ test_pandas_udf_window.py
âÂ Â  âÂ Â  âââ test_readwriter.py
âÂ Â  âÂ Â  âââ test_serde.py
âÂ Â  âÂ Â  âââ test_session.py
âÂ Â  âÂ Â  âââ test_streaming.py
âÂ Â  âÂ Â  âââ test_types.py
âÂ Â  âÂ Â  âââ test_udf.py
âÂ Â  âÂ Â  âââ test_utils.py
...
âââ testing  # Includes testing utils that can be used in unittests.
âÂ Â  âââ __init__.py
âÂ Â  âââ sqlutils.py
...
```

## How was this patch tested?

Existing tests should cover.

`cd python` and `./run-tests-with-coverage`. Manually checked they are actually 
being ran.

Each test (not officially) can be ran via:

```
SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests.test_pandas_udf_scalar
```

Note that if you're using Mac and Python 3, you might have to 
`OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.

Closes #23021 from HyukjinKwon/SPARK-25344.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a7a331df
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a7a331df
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a7a331df

Branch: refs/heads/master
Commit: a7a331df6e6fbcb181caf2363bffc3e05bdfc009
Parents: f26cd18
Author: hyukjinkwon 
Authored: Wed Nov 14 14:51:11 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Nov 14 14:51:11 2018 +0800

--
 dev/sparktestsupport/modules.py |   25 +-
 python/pyspark/sql/tests.py | 7079 --
 python/pyspark/sql/tests/__init__.py|   16 +
 python/pyspark/sql/tests/test_appsubmit.py  |   96 +
 python/pyspark/sql/tests/test_arrow.py  |  399 +
 python/pyspark/sql/tests/test_catalog.py|  199 +
 python/pyspark/sql/tests/test_column.py |  157 +
 python/pyspark/sql/tests/test_conf.py   |   55 +
 python/pyspark/sql/tests/test_context.py|  263 +
 python/pyspark/sql/tests/test_dataframe.py  |  737 ++
 python/pyspark/sql/tests/test_datasources.py|  170 +
 python/pyspark/sql/tests/test_functions.py  |  278 +
 python/pyspark/sql/tests/test_group.py  |   45 +
 python/pyspark/sql/tests/test_pandas_udf.py |  216 +
 .../sql/tests/test_pandas_udf_grouped_agg.py|  503 ++
 .../sql/tests/test_pandas_udf_grouped_map.py|  530 ++
 .../pyspark/sql/tests/test_pandas_udf_scalar.py |  807 ++
 .../pyspark/sql/tests/test_pandas_udf_window.py |  262 +
 python/pyspark/sql/tests/test_readwriter.py |  153 +
 python/pyspark/sql/tests/test_serde.py  |  138 +
 python/pyspark/sql/tests/test_session.py|  320 +
 python/pyspark/sql/tests/test_streaming.py  |  566 ++
 python/pyspark/sql/tests/test_types.py  |  944 +++
 python/pyspark/sql/tests/test_udf.py|  654 ++
 python/pyspark/sql/tests/test_utils.py  |   54 +
 python/pyspark/testing/__init__.py  |   16 +
 python/pyspark/testing/sqlutils.py

[4/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_dataframe.py
--
diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
new file mode 100644
index 000..eba00b5
--- /dev/null
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -0,0 +1,737 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import pydoc
+import time
+import unittest
+
+from pyspark.sql import SparkSession, Row
+from pyspark.sql.types import *
+from pyspark.sql.utils import AnalysisException, IllegalArgumentException
+from pyspark.testing.sqlutils import ReusedSQLTestCase, SQLTestUtils, 
have_pyarrow, have_pandas, \
+pandas_requirement_message, pyarrow_requirement_message
+from pyspark.tests import QuietTest
+
+
+class DataFrameTests(ReusedSQLTestCase):
+
+def test_range(self):
+self.assertEqual(self.spark.range(1, 1).count(), 0)
+self.assertEqual(self.spark.range(1, 0, -1).count(), 1)
+self.assertEqual(self.spark.range(0, 1 << 40, 1 << 39).count(), 2)
+self.assertEqual(self.spark.range(-2).count(), 0)
+self.assertEqual(self.spark.range(3).count(), 3)
+
+def test_duplicated_column_names(self):
+df = self.spark.createDataFrame([(1, 2)], ["c", "c"])
+row = df.select('*').first()
+self.assertEqual(1, row[0])
+self.assertEqual(2, row[1])
+self.assertEqual("Row(c=1, c=2)", str(row))
+# Cannot access columns
+self.assertRaises(AnalysisException, lambda: df.select(df[0]).first())
+self.assertRaises(AnalysisException, lambda: df.select(df.c).first())
+self.assertRaises(AnalysisException, lambda: 
df.select(df["c"]).first())
+
+def test_freqItems(self):
+vals = [Row(a=1, b=-2.0) if i % 2 == 0 else Row(a=i, b=i * 1.0) for i 
in range(100)]
+df = self.sc.parallelize(vals).toDF()
+items = df.stat.freqItems(("a", "b"), 0.4).collect()[0]
+self.assertTrue(1 in items[0])
+self.assertTrue(-2.0 in items[1])
+
+def test_help_command(self):
+# Regression test for SPARK-5464
+rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
+df = self.spark.read.json(rdd)
+# render_doc() reproduces the help() exception without printing output
+pydoc.render_doc(df)
+pydoc.render_doc(df.foo)
+pydoc.render_doc(df.take(1))
+
+def test_dropna(self):
+schema = StructType([
+StructField("name", StringType(), True),
+StructField("age", IntegerType(), True),
+StructField("height", DoubleType(), True)])
+
+# shouldn't drop a non-null row
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', 50, 80.1)], schema).dropna().count(),
+1)
+
+# dropping rows with a single null value
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', None, 80.1)], schema).dropna().count(),
+0)
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', None, 80.1)], schema).dropna(how='any').count(),
+0)
+
+# if how = 'all', only drop rows if all values are null
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', None, 80.1)], schema).dropna(how='all').count(),
+1)
+self.assertEqual(self.spark.createDataFrame(
+[(None, None, None)], schema).dropna(how='all').count(),
+0)
+
+# how and subset
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', 50, None)], schema).dropna(how='any', subset=['name', 
'age']).count(),
+1)
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', None, None)], schema).dropna(how='any', 
subset=['name', 'age']).count(),
+0)
+
+# threshold
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', None, 80.1)], schema).dropna(thresh=2).count(),
+1)
+self.assertEqual(self.spark.createDataFrame(
+[(u'Alice', None, None)], schema).dropna(thresh=2).count(),
+

[2/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

2018-11-13 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_session.py
--
diff --git a/python/pyspark/sql/tests/test_session.py 
b/python/pyspark/sql/tests/test_session.py
new file mode 100644
index 000..b811047
--- /dev/null
+++ b/python/pyspark/sql/tests/test_session.py
@@ -0,0 +1,320 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import unittest
+
+from pyspark import SparkConf, SparkContext
+from pyspark.sql import SparkSession, SQLContext, Row
+from pyspark.testing.sqlutils import ReusedSQLTestCase
+from pyspark.tests import PySparkTestCase
+
+
+class SparkSessionTests(ReusedSQLTestCase):
+def test_sqlcontext_reuses_sparksession(self):
+sqlContext1 = SQLContext(self.sc)
+sqlContext2 = SQLContext(self.sc)
+self.assertTrue(sqlContext1.sparkSession is sqlContext2.sparkSession)
+
+
+class SparkSessionTests1(ReusedSQLTestCase):
+
+# We can't include this test into SQLTests because we will stop class's 
SparkContext and cause
+# other tests failed.
+def test_sparksession_with_stopped_sparkcontext(self):
+self.sc.stop()
+sc = SparkContext('local[4]', self.sc.appName)
+spark = SparkSession.builder.getOrCreate()
+try:
+df = spark.createDataFrame([(1, 2)], ["c", "c"])
+df.collect()
+finally:
+spark.stop()
+sc.stop()
+
+
+class SparkSessionTests2(PySparkTestCase):
+
+# This test is separate because it's closely related with session's start 
and stop.
+# See SPARK-23228.
+def test_set_jvm_default_session(self):
+spark = SparkSession.builder.getOrCreate()
+try:
+
self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isDefined())
+finally:
+spark.stop()
+
self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isEmpty())
+
+def test_jvm_default_session_already_set(self):
+# Here, we assume there is the default session already set in JVM.
+jsession = self.sc._jvm.SparkSession(self.sc._jsc.sc())
+self.sc._jvm.SparkSession.setDefaultSession(jsession)
+
+spark = SparkSession.builder.getOrCreate()
+try:
+
self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isDefined())
+# The session should be the same with the exiting one.
+
self.assertTrue(jsession.equals(spark._jvm.SparkSession.getDefaultSession().get()))
+finally:
+spark.stop()
+
+
+class SparkSessionTests3(unittest.TestCase):
+
+def test_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+activeSession = SparkSession.getActiveSession()
+df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name'])
+self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
+finally:
+spark.stop()
+
+def test_get_active_session_when_no_active_session(self):
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, spark)
+spark.stop()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+
+def test_SparkSession(self):
+spark = SparkSession.builder \
+.master("local") \
+.config("some-config", "v2") \
+.getOrCreate()
+try:
+self.assertEqual(spark.conf.get("some-config"), "v2")
+self.assertEqual(spark.sparkContext._conf.get("some-config"), "v2")
+self.assertEqual(spark.version, spark.sparkContext.version)
+spark.sql("CREATE DATABASE test_db")
+spark.catalog.setCurrentDatabase("test_db")
+self.assertEqual(spark.catalog.currentDatabase(), "test_db")
+spark.sql("CREATE TABLE table1 (name STRING, age INT) USING 
parquet")
+

spark git commit: [MINOR][SQL] Add disable bucketedRead workaround when throw RuntimeException

2018-11-14 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master ad853c567 -> f6255d7b7


[MINOR][SQL] Add disable bucketedRead workaround when throw RuntimeException

## What changes were proposed in this pull request?
It will throw `RuntimeException` when read from bucketed table(about 1.7G per 
bucket file):
![image](https://user-images.githubusercontent.com/5399861/48346889-8041ce00-e6b7-11e8-83b0-ead83fb15821.png)

Default(enable bucket read):
![image](https://user-images.githubusercontent.com/5399861/48347084-2c83b480-e6b8-11e8-913a-9cafc043e9e4.png)

Disable bucket read:
![image](https://user-images.githubusercontent.com/5399861/48347099-3a393a00-e6b8-11e8-94af-cb814e1ba277.png)

The reason is that each bucket file is too big. a workaround is disable bucket 
read. This PR add this workaround to Spark.

## How was this patch tested?

manual tests

Closes #23014 from wangyum/anotherWorkaround.

Authored-by: Yuming Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6255d7b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6255d7b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6255d7b

Branch: refs/heads/master
Commit: f6255d7b7cc4cc5d1f4fe0e5e493a1efee22f38f
Parents: ad853c5
Author: Yuming Wang 
Authored: Thu Nov 15 08:33:06 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 15 08:33:06 2018 +0800

--
 .../spark/sql/execution/vectorized/WritableColumnVector.java| 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f6255d7b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
index b0e119d..4f5e72c 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
@@ -101,10 +101,11 @@ public abstract class WritableColumnVector extends 
ColumnVector {
 String message = "Cannot reserve additional contiguous bytes in the 
vectorized reader (" +
 (requiredCapacity >= 0 ? "requested " + requiredCapacity + " bytes" : 
"integer overflow") +
 "). As a workaround, you can reduce the vectorized reader batch size, 
or disable the " +
-"vectorized reader. For parquet file format, refer to " +
+"vectorized reader, or disable " + SQLConf.BUCKETING_ENABLED().key() + 
" if you read " +
+"from bucket table. For Parquet file format, refer to " +
 SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().key() +
 " (default " + 
SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().defaultValueString() +
-") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for 
orc file format, " +
+") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for 
ORC file format, " +
 "refer to " + SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().key() +
 " (default " + 
SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().defaultValueString() +
 ") and " + SQLConf.ORC_VECTORIZED_READER_ENABLED().key() + ".";


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/4] spark git commit: [SPARK-26036][PYTHON] Break large tests.py files into smaller files

2018-11-14 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests/__init__.py
--
diff --git a/python/pyspark/tests/__init__.py b/python/pyspark/tests/__init__.py
new file mode 100644
index 000..12bdf0d
--- /dev/null
+++ b/python/pyspark/tests/__init__.py
@@ -0,0 +1,16 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#

http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests/test_appsubmit.py
--
diff --git a/python/pyspark/tests/test_appsubmit.py 
b/python/pyspark/tests/test_appsubmit.py
new file mode 100644
index 000..92bcb11
--- /dev/null
+++ b/python/pyspark/tests/test_appsubmit.py
@@ -0,0 +1,248 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import re
+import shutil
+import subprocess
+import tempfile
+import unittest
+import zipfile
+
+
+class SparkSubmitTests(unittest.TestCase):
+
+def setUp(self):
+self.programDir = tempfile.mkdtemp()
+tmp_dir = tempfile.gettempdir()
+self.sparkSubmit = [
+os.path.join(os.environ.get("SPARK_HOME"), "bin", "spark-submit"),
+"--conf", 
"spark.driver.extraJavaOptions=-Djava.io.tmpdir={0}".format(tmp_dir),
+"--conf", 
"spark.executor.extraJavaOptions=-Djava.io.tmpdir={0}".format(tmp_dir),
+]
+
+def tearDown(self):
+shutil.rmtree(self.programDir)
+
+def createTempFile(self, name, content, dir=None):
+"""
+Create a temp file with the given name and content and return its path.
+Strips leading spaces from content up to the first '|' in each line.
+"""
+pattern = re.compile(r'^ *\|', re.MULTILINE)
+content = re.sub(pattern, '', content.strip())
+if dir is None:
+path = os.path.join(self.programDir, name)
+else:
+os.makedirs(os.path.join(self.programDir, dir))
+path = os.path.join(self.programDir, dir, name)
+with open(path, "w") as f:
+f.write(content)
+return path
+
+def createFileInZip(self, name, content, ext=".zip", dir=None, 
zip_name=None):
+"""
+Create a zip archive containing a file with the given content and 
return its path.
+Strips leading spaces from content up to the first '|' in each line.
+"""
+pattern = re.compile(r'^ *\|', re.MULTILINE)
+content = re.sub(pattern, '', content.strip())
+if dir is None:
+path = os.path.join(self.programDir, name + ext)
+else:
+path = os.path.join(self.programDir, dir, zip_name + ext)
+zip = zipfile.ZipFile(path, 'w')
+zip.writestr(name, content)
+zip.close()
+return path
+
+def create_spark_package(self, artifact_name):
+group_id, artifact_id, version = artifact_name.split(":")
+self.createTempFile("%s-%s.pom" % (artifact_id, version), ("""
+|
+|http://maven.apache.org/POM/4.0.0;
+|   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
+|   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
+|   http://maven.apache.org/xsd/maven-4.0.0.xsd;>
+|   4.0.0
+|   %s
+|   %s
+|   %s
+|
+""" % (group_id, artifact_id, version)).lstrip(),
+

[3/4] spark git commit: [SPARK-26036][PYTHON] Break large tests.py files into smaller files

2018-11-14 Thread gurwls223

http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests.py
--
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
deleted file mode 100644
index 131c51e..000
--- a/python/pyspark/tests.py
+++ /dev/null
@@ -1,2502 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Unit tests for PySpark; additional tests are implemented as doctests in
-individual modules.
-"""
-
-from array import array
-from glob import glob
-import os
-import re
-import shutil
-import subprocess
-import sys
-import tempfile
-import time
-import zipfile
-import random
-import threading
-import hashlib
-
-from py4j.protocol import Py4JJavaError
-try:
-import xmlrunner
-except ImportError:
-xmlrunner = None
-
-if sys.version_info[:2] <= (2, 6):
-try:
-import unittest2 as unittest
-except ImportError:
-sys.stderr.write('Please install unittest2 to test with Python 2.6 or 
earlier')
-sys.exit(1)
-else:
-import unittest
-if sys.version_info[0] >= 3:
-xrange = range
-basestring = str
-
-if sys.version >= "3":
-from io import StringIO
-else:
-from StringIO import StringIO
-
-
-from pyspark import keyword_only
-from pyspark.conf import SparkConf
-from pyspark.context import SparkContext
-from pyspark.rdd import RDD
-from pyspark.files import SparkFiles
-from pyspark.serializers import read_int, BatchedSerializer, 
MarshalSerializer, PickleSerializer, \
-CloudPickleSerializer, CompressedSerializer, UTF8Deserializer, 
NoOpSerializer, \
-PairDeserializer, CartesianDeserializer, AutoBatchedSerializer, 
AutoSerializer, \
-FlattenedValuesSerializer
-from pyspark.shuffle import Aggregator, ExternalMerger, ExternalSorter
-from pyspark import shuffle
-from pyspark.profiler import BasicProfiler
-from pyspark.taskcontext import BarrierTaskContext, TaskContext
-
-_have_scipy = False
-_have_numpy = False
-try:
-import scipy.sparse
-_have_scipy = True
-except:
-# No SciPy, but that's okay, we'll skip those tests
-pass
-try:
-import numpy as np
-_have_numpy = True
-except:
-# No NumPy, but that's okay, we'll skip those tests
-pass
-
-
-SPARK_HOME = os.environ["SPARK_HOME"]
-
-
-class MergerTests(unittest.TestCase):
-
-def setUp(self):
-self.N = 1 << 12
-self.l = [i for i in xrange(self.N)]
-self.data = list(zip(self.l, self.l))
-self.agg = Aggregator(lambda x: [x],
-  lambda x, y: x.append(y) or x,
-  lambda x, y: x.extend(y) or x)
-
-def test_small_dataset(self):
-m = ExternalMerger(self.agg, 1000)
-m.mergeValues(self.data)
-self.assertEqual(m.spills, 0)
-self.assertEqual(sum(sum(v) for k, v in m.items()),
- sum(xrange(self.N)))
-
-m = ExternalMerger(self.agg, 1000)
-m.mergeCombiners(map(lambda x_y1: (x_y1[0], [x_y1[1]]), self.data))
-self.assertEqual(m.spills, 0)
-self.assertEqual(sum(sum(v) for k, v in m.items()),
- sum(xrange(self.N)))
-
-def test_medium_dataset(self):
-m = ExternalMerger(self.agg, 20)
-m.mergeValues(self.data)
-self.assertTrue(m.spills >= 1)
-self.assertEqual(sum(sum(v) for k, v in m.items()),
- sum(xrange(self.N)))
-
-m = ExternalMerger(self.agg, 10)
-m.mergeCombiners(map(lambda x_y2: (x_y2[0], [x_y2[1]]), self.data * 3))
-self.assertTrue(m.spills >= 1)
-self.assertEqual(sum(sum(v) for k, v in m.items()),
- sum(xrange(self.N)) * 3)
-
-def test_huge_dataset(self):
-m = ExternalMerger(self.agg, 5, partitions=3)
-m.mergeCombiners(map(lambda k_v: (k_v[0], [str(k_v[1])]), self.data * 
10))
-self.assertTrue(m.spills >= 1)
-self.assertEqual(sum(len(v) for k, v in m.items()),
- self.N * 10)
-m._cleanup()
-
-def test_group_by_key(self):
-
-def gen_data(N, step):
-for i in range(1, N + 1, step):
-for j in range(i):
-yield (i, [j])
-
-

[1/4] spark git commit: [SPARK-26036][PYTHON] Break large tests.py files into smaller files

2018-11-14 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f6255d7b7 -> 03306a6df


http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests/test_readwrite.py
--
diff --git a/python/pyspark/tests/test_readwrite.py 
b/python/pyspark/tests/test_readwrite.py
new file mode 100644
index 000..e45f5b3
--- /dev/null
+++ b/python/pyspark/tests/test_readwrite.py
@@ -0,0 +1,499 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import shutil
+import sys
+import tempfile
+import unittest
+from array import array
+
+from pyspark.testing.utils import ReusedPySparkTestCase, SPARK_HOME
+
+
+class InputFormatTests(ReusedPySparkTestCase):
+
+@classmethod
+def setUpClass(cls):
+ReusedPySparkTestCase.setUpClass()
+cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
+os.unlink(cls.tempdir.name)
+
cls.sc._jvm.WriteInputFormatTestDataGenerator.generateData(cls.tempdir.name, 
cls.sc._jsc)
+
+@classmethod
+def tearDownClass(cls):
+ReusedPySparkTestCase.tearDownClass()
+shutil.rmtree(cls.tempdir.name)
+
+@unittest.skipIf(sys.version >= "3", "serialize array of byte")
+def test_sequencefiles(self):
+basepath = self.tempdir.name
+ints = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfint/",
+   "org.apache.hadoop.io.IntWritable",
+   
"org.apache.hadoop.io.Text").collect())
+ei = [(1, u'aa'), (1, u'aa'), (2, u'aa'), (2, u'bb'), (2, u'bb'), (3, 
u'cc')]
+self.assertEqual(ints, ei)
+
+doubles = sorted(self.sc.sequenceFile(basepath + 
"/sftestdata/sfdouble/",
+  
"org.apache.hadoop.io.DoubleWritable",
+  
"org.apache.hadoop.io.Text").collect())
+ed = [(1.0, u'aa'), (1.0, u'aa'), (2.0, u'aa'), (2.0, u'bb'), (2.0, 
u'bb'), (3.0, u'cc')]
+self.assertEqual(doubles, ed)
+
+bytes = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfbytes/",
+"org.apache.hadoop.io.IntWritable",
+
"org.apache.hadoop.io.BytesWritable").collect())
+ebs = [(1, bytearray('aa', 'utf-8')),
+   (1, bytearray('aa', 'utf-8')),
+   (2, bytearray('aa', 'utf-8')),
+   (2, bytearray('bb', 'utf-8')),
+   (2, bytearray('bb', 'utf-8')),
+   (3, bytearray('cc', 'utf-8'))]
+self.assertEqual(bytes, ebs)
+
+text = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sftext/",
+   "org.apache.hadoop.io.Text",
+   
"org.apache.hadoop.io.Text").collect())
+et = [(u'1', u'aa'),
+  (u'1', u'aa'),
+  (u'2', u'aa'),
+  (u'2', u'bb'),
+  (u'2', u'bb'),
+  (u'3', u'cc')]
+self.assertEqual(text, et)
+
+bools = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfbool/",
+"org.apache.hadoop.io.IntWritable",
+
"org.apache.hadoop.io.BooleanWritable").collect())
+eb = [(1, False), (1, True), (2, False), (2, False), (2, True), (3, 
True)]
+self.assertEqual(bools, eb)
+
+nulls = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfnull/",
+"org.apache.hadoop.io.IntWritable",
+
"org.apache.hadoop.io.BooleanWritable").collect())
+en = [(1, None), (1, None), (2, None), (2, None), (2, None), (3, None)]
+self.assertEqual(nulls, en)
+
+maps = self.sc.sequenceFile(basepath + "/sftestdata/sfmap/",
+"org.apache.hadoop.io.IntWritable",
+
"org.apache.hadoop.io.MapWritable").collect()
+em = [(1, {}),
+  (1, {3.0: u'bb'}),
+  (2, {1.0: u'aa'}),
+  (2, {1.0: u'cc'}),
+  (3, {2.0:

spark git commit: [SPARK-26014][R] Deprecate R prior to version 3.4 in SparkR

2018-11-15 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 03306a6df -> d4130ec1f


[SPARK-26014][R] Deprecate R prior to version 3.4 in SparkR

## What changes were proposed in this pull request?

This PR proposes to bump up the minimum versions of R from 3.1 to 3.4.

R version. 3.1.x is too old. It's released 4.5 years ago. R 3.4.0 is released 
1.5 years ago. Considering the timing for Spark 3.0, deprecating lower 
versions, bumping up R to 3.4 might be reasonable option.

It should be good to deprecate and drop < R 3.4 support.

## How was this patch tested?

Jenkins tests.

Closes #23012 from HyukjinKwon/SPARK-26014.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4130ec1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4130ec1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4130ec1

Branch: refs/heads/master
Commit: d4130ec1f3461dcc961eee9802005ba7a15212d1
Parents: 03306a6
Author: hyukjinkwon 
Authored: Thu Nov 15 17:20:49 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Nov 15 17:20:49 2018 +0800

--
 R/WINDOWS.md | 2 +-
 R/pkg/DESCRIPTION| 2 +-
 R/pkg/inst/profile/general.R | 4 
 R/pkg/inst/profile/shell.R   | 4 
 docs/index.md| 3 ++-
 5 files changed, 12 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/WINDOWS.md
--
diff --git a/R/WINDOWS.md b/R/WINDOWS.md
index da668a6..33a4c85 100644
--- a/R/WINDOWS.md
+++ b/R/WINDOWS.md
@@ -3,7 +3,7 @@
 To build SparkR on Windows, the following steps are required
 
 1. Install R (>= 3.1) and 
[Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to
-include Rtools and R in `PATH`.
+include Rtools and R in `PATH`. Note that support for R prior to version 3.4 
is deprecated as of Spark 3.0.0.
 
 2. Install
 
[JDK8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
 and set

http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index cdaaa61..736da46 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -15,7 +15,7 @@ URL: http://www.apache.org/ http://spark.apache.org/
 BugReports: http://spark.apache.org/contributing.html
 SystemRequirements: Java (== 8)
 Depends:
-R (>= 3.0),
+R (>= 3.1),
 methods
 Suggests:
 knitr,

http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/pkg/inst/profile/general.R
--
diff --git a/R/pkg/inst/profile/general.R b/R/pkg/inst/profile/general.R
index 8c75c19..3efb460 100644
--- a/R/pkg/inst/profile/general.R
+++ b/R/pkg/inst/profile/general.R
@@ -16,6 +16,10 @@
 #
 
 .First <- function() {
+  if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), 
"3.4.0") == -1) {
+warning("Support for R prior to version 3.4 is deprecated since Spark 
3.0.0")
+  }
+
   packageDir <- Sys.getenv("SPARKR_PACKAGE_DIR")
   dirs <- strsplit(packageDir, ",")[[1]]
   .libPaths(c(dirs, .libPaths()))

http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/pkg/inst/profile/shell.R
--
diff --git a/R/pkg/inst/profile/shell.R b/R/pkg/inst/profile/shell.R
index 8a8111a..32eb367 100644
--- a/R/pkg/inst/profile/shell.R
+++ b/R/pkg/inst/profile/shell.R
@@ -16,6 +16,10 @@
 #
 
 .First <- function() {
+  if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), 
"3.4.0") == -1) {
+warning("Support for R prior to version 3.4 is deprecated since Spark 
3.0.0")
+  }
+
   home <- Sys.getenv("SPARK_HOME")
   .libPaths(c(file.path(home, "R", "lib"), .libPaths()))
   Sys.setenv(NOAWT = 1)

http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/docs/index.md
--
diff --git a/docs/index.md b/docs/index.md
index ac38f1d..bd287e3 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, 
Mac OS). It's easy
 locally on one machine --- all you need is to have `java` installed on your 
system `PATH`,
 or the `JAVA_HOME` environment variable pointing to a Java installation.
 
-Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 
{{site.SPARK_VERSION}}
+Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 
support is deprecated as of Spark 3.0.0.
+For the Scala API, Spark {{site.SPARK_VERSION}}
 uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a

spark git commit: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4.0 to 3.5.1 in AppVeyor build

2018-11-12 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 0ba9715c7 -> f9ff75653


[SPARK-26013][R][BUILD] Upgrade R tools version from 3.4.0 to 3.5.1 in AppVeyor 
build

## What changes were proposed in this pull request?

R tools 3.5.1 is released few months ago. Spark currently uses 3.4.0. We should 
better upgrade in AppVeyor.

## How was this patch tested?

AppVeyor builds.

Closes #23011 from HyukjinKwon/SPARK-26013.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9ff7565
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9ff7565
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9ff7565

Branch: refs/heads/master
Commit: f9ff75653fa8cd055fbcbfe94243049c38c60507
Parents: 0ba9715
Author: hyukjinkwon 
Authored: Tue Nov 13 01:21:03 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Nov 13 01:21:03 2018 +0800

--
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f9ff7565/dev/appveyor-install-dependencies.ps1
--
diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index 06d9d70..cc68ffb 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -116,7 +116,7 @@ Pop-Location
 
 # == R
 $rVer = "3.5.1"
-$rToolsVer = "3.4.0"
+$rToolsVer = "3.5.1"
 
 InstallR
 InstallRtools


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24601] Update Jackson to 2.9.6

2018-10-05 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 459700727 -> ab1650d29


[SPARK-24601] Update Jackson to 2.9.6

Hi all,

Jackson is incompatible with upstream versions, therefore bump the Jackson 
version to a more recent one. I bumped into some issues with Azure CosmosDB 
that is using a more recent version of Jackson. This can be fixed by adding 
exclusions and then it works without any issues. So no breaking changes in the 
API's.

I would also consider bumping the version of Jackson in Spark. I would suggest 
to keep up to date with the dependencies, since in the future this issue will 
pop up more frequently.

## What changes were proposed in this pull request?

Bump Jackson to 2.9.6

## How was this patch tested?

Compiled and tested it locally to see if anything broke.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #21596 from Fokko/fd-bump-jackson.

Authored-by: Fokko Driesprong 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab1650d2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab1650d2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab1650d2

Branch: refs/heads/master
Commit: ab1650d2938db4901b8c28df945d6a0691a19d31
Parents: 4597007
Author: Fokko Driesprong 
Authored: Fri Oct 5 16:40:08 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 5 16:40:08 2018 +0800

--
 .../deploy/rest/SubmitRestProtocolMessage.scala |  2 +-
 .../apache/spark/rdd/RDDOperationScope.scala|  2 +-
 .../scala/org/apache/spark/status/KVUtils.scala |  2 +-
 .../status/api/v1/JacksonMessageWriter.scala|  2 +-
 .../org/apache/spark/status/api/v1/api.scala|  3 ++
 dev/deps/spark-deps-hadoop-2.6  | 16 +-
 dev/deps/spark-deps-hadoop-2.7  | 16 +-
 dev/deps/spark-deps-hadoop-3.1  | 16 +-
 pom.xml |  7 ++---
 .../expressions/JsonExpressionsSuite.scala  |  7 +
 .../datasources/json/JsonBenchmarks.scala   | 33 +++-
 11 files changed, 59 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab1650d2/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala
 
b/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala
index ef5a7e3..97b689c 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala
@@ -36,7 +36,7 @@ import org.apache.spark.util.Utils
  *   (2) the Spark version of the client / server
  *   (3) an optional message
  */
-@JsonInclude(Include.NON_NULL)
+@JsonInclude(Include.NON_ABSENT)
 @JsonAutoDetect(getterVisibility = Visibility.ANY, setterVisibility = 
Visibility.ANY)
 @JsonPropertyOrder(alphabetic = true)
 private[rest] abstract class SubmitRestProtocolMessage {

http://git-wip-us.apache.org/repos/asf/spark/blob/ab1650d2/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala
--
diff --git a/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala
index 53d69ba..3abb2d8 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala
@@ -41,7 +41,7 @@ import org.apache.spark.internal.Logging
  * There is no particular relationship between an operation scope and a stage 
or a job.
  * A scope may live inside one stage (e.g. map) or span across multiple jobs 
(e.g. take).
  */
-@JsonInclude(Include.NON_NULL)
+@JsonInclude(Include.NON_ABSENT)
 @JsonPropertyOrder(Array("id", "name", "parent"))
 private[spark] class RDDOperationScope(
 val name: String,

http://git-wip-us.apache.org/repos/asf/spark/blob/ab1650d2/core/src/main/scala/org/apache/spark/status/KVUtils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/status/KVUtils.scala 
b/core/src/main/scala/org/apache/spark/status/KVUtils.scala
index 99b1843..45348be 100644
--- a/core/src/main/scala/org/apache/spark/status/KVUtils.scala
+++ b/core/src/main/scala/org/apache/spark/status/KVUtils.scala
@@ -42,7 +42,7 @@ private[spark] object KVUtils extends Logging {
   private[spark] class KVStoreScalaSerializer extends KVStoreSerializer {
 
 mapper.registerModule(DefaultScalaModule)
-

spark git commit: [SPARK-25659][PYTHON][TEST] Test type inference specification for createDataFrame in PySpark

2018-10-08 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f9935a3f8 -> f3fed2823


[SPARK-25659][PYTHON][TEST] Test type inference specification for 
createDataFrame in PySpark

## What changes were proposed in this pull request?

This PR proposes to specify type inference and simple e2e tests. Looks we are 
not cleanly testing those logics.

For instance, see 
https://github.com/apache/spark/blob/08c76b5d39127ae207d9d1fff99c2551e6ce2581/python/pyspark/sql/types.py#L894-L905

Looks we intended to support datetime.time and None for type inference too but 
it does not work:

```
>>> spark.createDataFrame([[datetime.time()]])
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 432, in _createFromLocal
data = [schema.toInternal(row) for row in data]
  File "/.../spark/python/pyspark/sql/types.py", line 604, in toInternal
for f, v, c in zip(self.fields, obj, self._needConversion))
  File "/.../spark/python/pyspark/sql/types.py", line 604, in 
for f, v, c in zip(self.fields, obj, self._needConversion))
  File "/.../spark/python/pyspark/sql/types.py", line 442, in toInternal
return self.dataType.toInternal(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 193, in toInternal
else time.mktime(dt.timetuple()))
AttributeError: 'datetime.time' object has no attribute 'timetuple'
>>> spark.createDataFrame([[None]])
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 419, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
  File "/.../python/pyspark/sql/session.py", line 353, in _inferSchemaFromList
raise ValueError("Some of types cannot be determined after inferring")
ValueError: Some of types cannot be determined after inferring
```
## How was this patch tested?

Manual tests and unit tests were added.

Closes #22653 from HyukjinKwon/SPARK-25659.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f3fed282
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f3fed282
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f3fed282

Branch: refs/heads/master
Commit: f3fed28230e4e5e08d182715e8cf901daf8f3b73
Parents: f9935a3
Author: hyukjinkwon 
Authored: Tue Oct 9 07:45:02 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 9 07:45:02 2018 +0800

--
 python/pyspark/sql/tests.py | 69 
 1 file changed, 69 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f3fed282/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index ac87ccd..85712df 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1149,6 +1149,75 @@ class SQLTests(ReusedSQLTestCase):
 result = self.spark.sql("SELECT l[0].a from test2 where d['key'].d = 
'2'")
 self.assertEqual(1, result.head()[0])
 
+def test_infer_schema_specification(self):
+from decimal import Decimal
+
+class A(object):
+def __init__(self):
+self.a = 1
+
+data = [
+True,
+1,
+"a",
+u"a",
+datetime.date(1970, 1, 1),
+datetime.datetime(1970, 1, 1, 0, 0),
+1.0,
+array.array("d", [1]),
+[1],
+(1, ),
+{"a": 1},
+bytearray(1),
+Decimal(1),
+Row(a=1),
+Row("a")(1),
+A(),
+]
+
+df = self.spark.createDataFrame([data])
+actual = list(map(lambda x: x.dataType.simpleString(), df.schema))
+expected = [
+'boolean',
+'bigint',
+'string',
+'string',
+'date',
+'timestamp',
+'double',
+'array',
+'array',
+'struct<_1:bigint>',
+'map',
+'binary',
+'decimal(38,18)',
+'struct',
+'struct',
+'struct',
+]
+self.assertEqual(actual, expected)
+
+actual = list(df.first())
+expected = [
+True,
+1,
+'a',
+u"a",
+datetime.date(1970, 1, 1),
+datetime.datetime(1970, 1, 1, 0, 0),
+1.0,
+[1.0],

spark git commit: [SPARK-25669][SQL] Check CSV header only when it exists

2018-10-09 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 4baa4d42a -> 404c84039


[SPARK-25669][SQL] Check CSV header only when it exists

## What changes were proposed in this pull request?

Currently the first row of dataset of CSV strings is compared to field names of 
user specified or inferred schema independently of presence of CSV header. It 
causes false-positive error messages. For example, parsing `"1,2"` outputs the 
error:

```java
java.lang.IllegalArgumentException: CSV header does not conform to the schema.
 Header: 1, 2
 Schema: _c0, _c1
Expected: _c0 but found: 1
```

In the PR, I propose:
- Checking CSV header only when it exists
- Filter header from the input dataset only if it exists

## How was this patch tested?

Added a test to `CSVSuite` which reproduces the issue.

Closes #22656 from MaxGekk/inferred-header-check.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 46fe40838aa682a7073dd6f1373518b0c8498a94)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/404c8403
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/404c8403
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/404c8403

Branch: refs/heads/branch-2.4
Commit: 404c840393086290cf975652f596b4768aa5d4eb
Parents: 4baa4d4
Author: Maxim Gekk 
Authored: Tue Oct 9 14:35:00 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 9 14:36:33 2018 +0800

--
 .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 7 +--
 .../apache/spark/sql/execution/datasources/csv/CSVSuite.scala | 6 ++
 2 files changed, 11 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/404c8403/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 27a1af2..869c584 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -505,7 +505,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val actualSchema =
   StructType(schema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
 
-val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
+val linesWithoutHeader = if (parsedOptions.headerFlag && 
maybeFirstLine.isDefined) {
+  val firstLine = maybeFirstLine.get
   val parser = new CsvParser(parsedOptions.asParserSettings)
   val columnNames = parser.parseLine(firstLine)
   CSVDataSource.checkHeaderColumnNames(
@@ -515,7 +516,9 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 parsedOptions.enforceSchema,
 sparkSession.sessionState.conf.caseSensitiveAnalysis)
   filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, 
parsedOptions))
-}.getOrElse(filteredLines.rdd)
+} else {
+  filteredLines.rdd
+}
 
 val parsed = linesWithoutHeader.mapPartitions { iter =>
   val rawParser = new UnivocityParser(actualSchema, parsedOptions)

http://git-wip-us.apache.org/repos/asf/spark/blob/404c8403/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index f70df0b..5d4746c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -1820,4 +1820,10 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 checkAnswer(spark.read.option("multiLine", 
true).schema(schema).csv(input), Row(null))
 assert(spark.read.csv(input).collect().toSet == Set(Row()))
   }
+
+  test("field names of inferred schema shouldn't compare to the first row") {
+val input = Seq("1,2").toDS()
+val df = spark.read.option("enforceSchema", false).csv(input)
+checkAnswer(df, Row("1", "2"))
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25669][SQL] Check CSV header only when it exists

2018-10-09 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a4b14a9cf -> 46fe40838


[SPARK-25669][SQL] Check CSV header only when it exists

## What changes were proposed in this pull request?

Currently the first row of dataset of CSV strings is compared to field names of 
user specified or inferred schema independently of presence of CSV header. It 
causes false-positive error messages. For example, parsing `"1,2"` outputs the 
error:

```java
java.lang.IllegalArgumentException: CSV header does not conform to the schema.
 Header: 1, 2
 Schema: _c0, _c1
Expected: _c0 but found: 1
```

In the PR, I propose:
- Checking CSV header only when it exists
- Filter header from the input dataset only if it exists

## How was this patch tested?

Added a test to `CSVSuite` which reproduces the issue.

Closes #22656 from MaxGekk/inferred-header-check.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/46fe4083
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/46fe4083
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/46fe4083

Branch: refs/heads/master
Commit: 46fe40838aa682a7073dd6f1373518b0c8498a94
Parents: a4b14a9
Author: Maxim Gekk 
Authored: Tue Oct 9 14:35:00 2018 +0800
Committer: hyukjinkwon 
Committed: Tue Oct 9 14:35:00 2018 +0800

--
 .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 7 +--
 .../apache/spark/sql/execution/datasources/csv/CSVSuite.scala | 6 ++
 2 files changed, 11 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/46fe4083/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index fe69f25..7269446 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -505,7 +505,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val actualSchema =
   StructType(schema.filterNot(_.name == 
parsedOptions.columnNameOfCorruptRecord))
 
-val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine =>
+val linesWithoutHeader = if (parsedOptions.headerFlag && 
maybeFirstLine.isDefined) {
+  val firstLine = maybeFirstLine.get
   val parser = new CsvParser(parsedOptions.asParserSettings)
   val columnNames = parser.parseLine(firstLine)
   CSVDataSource.checkHeaderColumnNames(
@@ -515,7 +516,9 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 parsedOptions.enforceSchema,
 sparkSession.sessionState.conf.caseSensitiveAnalysis)
   filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, 
parsedOptions))
-}.getOrElse(filteredLines.rdd)
+} else {
+  filteredLines.rdd
+}
 
 val parsed = linesWithoutHeader.mapPartitions { iter =>
   val rawParser = new UnivocityParser(actualSchema, parsedOptions)

http://git-wip-us.apache.org/repos/asf/spark/blob/46fe4083/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index f70df0b..5d4746c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -1820,4 +1820,10 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 checkAnswer(spark.read.option("multiLine", 
true).schema(schema).csv(input), Row(null))
 assert(spark.read.csv(input).collect().toSet == Set(Row()))
   }
+
+  test("field names of inferred schema shouldn't compare to the first row") {
+val input = Seq("1,2").toDS()
+val df = spark.read.option("enforceSchema", false).csv(input)
+checkAnswer(df, Row("1", "2"))
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests

2018-10-01 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 82990e5ef -> 426c2bd35


[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests

## What changes were proposed in this pull request?
Add more data types for Pandas UDF Tests for PySpark SQL

## How was this patch tested?
manual tests

Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests.

Lead-authored-by: Aleksandr Koriagin 
Co-authored-by: hyukjinkwon 
Co-authored-by: Alexander Koryagin 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 30f5d0f2ddfe56266ea81e4255f9b4f373dab237)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/426c2bd3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/426c2bd3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/426c2bd3

Branch: refs/heads/branch-2.4
Commit: 426c2bd35937add1a26e77d2f2879f0e3f0c2f45
Parents: 82990e5
Author: Aleksandr Koriagin 
Authored: Mon Oct 1 17:18:45 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 1 17:19:00 2018 +0800

--
 python/pyspark/sql/tests.py | 107 +--
 1 file changed, 79 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/426c2bd3/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index dece1da..690035a 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -5478,32 +5478,81 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase):
 .withColumn("v", explode(col('vs'))).drop('vs')
 
 def test_supported_types(self):
-from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col
-df = self.data.withColumn("arr", array(col("id")))
+from decimal import Decimal
+from distutils.version import LooseVersion
+import pyarrow as pa
+from pyspark.sql.functions import pandas_udf, PandasUDFType
 
-# Different forms of group map pandas UDF, results of these are the 
same
+values = [
+1, 2, 3,
+4, 5, 1.1,
+2.2, Decimal(1.123),
+[1, 2, 2], True, 'hello'
+]
+output_fields = [
+('id', IntegerType()), ('byte', ByteType()), ('short', 
ShortType()),
+('int', IntegerType()), ('long', LongType()), ('float', 
FloatType()),
+('double', DoubleType()), ('decim', DecimalType(10, 3)),
+('array', ArrayType(IntegerType())), ('bool', BooleanType()), 
('str', StringType())
+]
 
-output_schema = StructType(
-[StructField('id', LongType()),
- StructField('v', IntegerType()),
- StructField('arr', ArrayType(LongType())),
- StructField('v1', DoubleType()),
- StructField('v2', LongType())])
+# TODO: Add BinaryType to variables above once minimum pyarrow version 
is 0.10.0
+if LooseVersion(pa.__version__) >= LooseVersion("0.10.0"):
+values.append(bytearray([0x01, 0x02]))
+output_fields.append(('bin', BinaryType()))
 
+output_schema = StructType([StructField(*x) for x in output_fields])
+df = self.spark.createDataFrame([values], schema=output_schema)
+
+# Different forms of group map pandas UDF, results of these are the 
same
 udf1 = pandas_udf(
-lambda pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id),
+lambda pdf: pdf.assign(
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
 output_schema,
 PandasUDFType.GROUPED_MAP
 )
 
 udf2 = pandas_udf(
-lambda _, pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + 
pdf.id),
+lambda _, pdf: pdf.assign(
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
 output_schema,
 PandasUDFType.GROUPED_MAP
 )
 
 udf3 = pandas_udf(
-lambda key, pdf: pdf.assign(id=key[0], v1=pdf.v * pdf.id * 1.0, 
v2=pdf.v + pdf.id),
+lambda

spark git commit: [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests

2018-10-01 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 21f0b73db -> 30f5d0f2d


[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests

## What changes were proposed in this pull request?
Add more data types for Pandas UDF Tests for PySpark SQL

## How was this patch tested?
manual tests

Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests.

Lead-authored-by: Aleksandr Koriagin 
Co-authored-by: hyukjinkwon 
Co-authored-by: Alexander Koryagin 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30f5d0f2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30f5d0f2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30f5d0f2

Branch: refs/heads/master
Commit: 30f5d0f2ddfe56266ea81e4255f9b4f373dab237
Parents: 21f0b73
Author: Aleksandr Koriagin 
Authored: Mon Oct 1 17:18:45 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Oct 1 17:18:45 2018 +0800

--
 python/pyspark/sql/tests.py | 107 +--
 1 file changed, 79 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/30f5d0f2/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index b88a655..815772d 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -5525,32 +5525,81 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase):
 .withColumn("v", explode(col('vs'))).drop('vs')
 
 def test_supported_types(self):
-from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col
-df = self.data.withColumn("arr", array(col("id")))
+from decimal import Decimal
+from distutils.version import LooseVersion
+import pyarrow as pa
+from pyspark.sql.functions import pandas_udf, PandasUDFType
 
-# Different forms of group map pandas UDF, results of these are the 
same
+values = [
+1, 2, 3,
+4, 5, 1.1,
+2.2, Decimal(1.123),
+[1, 2, 2], True, 'hello'
+]
+output_fields = [
+('id', IntegerType()), ('byte', ByteType()), ('short', 
ShortType()),
+('int', IntegerType()), ('long', LongType()), ('float', 
FloatType()),
+('double', DoubleType()), ('decim', DecimalType(10, 3)),
+('array', ArrayType(IntegerType())), ('bool', BooleanType()), 
('str', StringType())
+]
 
-output_schema = StructType(
-[StructField('id', LongType()),
- StructField('v', IntegerType()),
- StructField('arr', ArrayType(LongType())),
- StructField('v1', DoubleType()),
- StructField('v2', LongType())])
+# TODO: Add BinaryType to variables above once minimum pyarrow version 
is 0.10.0
+if LooseVersion(pa.__version__) >= LooseVersion("0.10.0"):
+values.append(bytearray([0x01, 0x02]))
+output_fields.append(('bin', BinaryType()))
 
+output_schema = StructType([StructField(*x) for x in output_fields])
+df = self.spark.createDataFrame([values], schema=output_schema)
+
+# Different forms of group map pandas UDF, results of these are the 
same
 udf1 = pandas_udf(
-lambda pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id),
+lambda pdf: pdf.assign(
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
 output_schema,
 PandasUDFType.GROUPED_MAP
 )
 
 udf2 = pandas_udf(
-lambda _, pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + 
pdf.id),
+lambda _, pdf: pdf.assign(
+byte=pdf.byte * 2,
+short=pdf.short * 2,
+int=pdf.int * 2,
+long=pdf.long * 2,
+float=pdf.float * 2,
+double=pdf.double * 2,
+decim=pdf.decim * 2,
+bool=False if pdf.bool else True,
+str=pdf.str + 'there',
+array=pdf.array,
+),
 output_schema,
 PandasUDFType.GROUPED_MAP
 )
 
 udf3 = pandas_udf(
-lambda key, pdf: pdf.assign(id=key[0], v1=pdf.v * pdf.id * 1.0, 
v2=pdf.v + pdf.id),
+lambda key, pdf: pdf.assign(
+id=key[0],
+byte=pdf.byte * 2,
+

spark git commit: [SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java

2018-09-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master dcb9a97f3 -> 623c2ec4e


[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java

## What changes were proposed in this pull request?

In the PR, I propose to extend implementation of existing method:
```
def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
```
to support values of the struct type. This allows pivoting by multiple columns 
combined by `struct`:
```
trainingSales
  .groupBy($"sales.year")
  .pivot(
pivotColumn = struct(lower($"sales.course"), $"training"),
values = Seq(
  struct(lit("dotnet"), lit("Experts")),
  struct(lit("java"), lit("Dummies")))
  ).agg(sum($"sales.earnings"))
```

## How was this patch tested?

Added a test for values specified via `struct` in Java and Scala.

Closes #22316 from MaxGekk/pivoting-by-multiple-columns2.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/623c2ec4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/623c2ec4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/623c2ec4

Branch: refs/heads/master
Commit: 623c2ec4ef3776bc5e2cac2c66300ddc6264db54
Parents: dcb9a97
Author: Maxim Gekk 
Authored: Sat Sep 29 21:50:35 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 21:50:35 2018 +0800

--
 .../spark/sql/RelationalGroupedDataset.scala| 17 +--
 .../apache/spark/sql/JavaDataFrameSuite.java| 16 ++
 .../apache/spark/sql/DataFramePivotSuite.scala  | 23 
 3 files changed, 54 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/623c2ec4/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
index d700fb8..dbacdbf 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
@@ -330,6 +330,15 @@ class RelationalGroupedDataset protected[sql](
*   df.groupBy("year").pivot("course").sum("earnings")
* }}}
*
+   * From Spark 2.5.0, values can be literal columns, for instance, struct. 
For pivoting by
+   * multiple columns, use the `struct` function to combine the columns and 
values:
+   *
+   * {{{
+   *   df.groupBy("year")
+   * .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts"
+   * .agg(sum($"earnings"))
+   * }}}
+   *
* @param pivotColumn Name of the column to pivot.
* @param values List of values that will be translated to columns in the 
output DataFrame.
* @since 1.6.0
@@ -413,10 +422,14 @@ class RelationalGroupedDataset protected[sql](
   def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = 
{
 groupType match {
   case RelationalGroupedDataset.GroupByType =>
+val valueExprs = values.map(_ match {
+  case c: Column => c.expr
+  case v => Literal.apply(v)
+})
 new RelationalGroupedDataset(
   df,
   groupingExprs,
-  RelationalGroupedDataset.PivotType(pivotColumn.expr, 
values.map(Literal.apply)))
+  RelationalGroupedDataset.PivotType(pivotColumn.expr, valueExprs))
   case _: RelationalGroupedDataset.PivotType =>
 throw new UnsupportedOperationException("repeated pivots are not 
supported")
   case _ =>
@@ -561,5 +574,5 @@ private[sql] object RelationalGroupedDataset {
   /**
* To indicate it's the PIVOT
*/
-  private[sql] case class PivotType(pivotCol: Expression, values: 
Seq[Literal]) extends GroupType
+  private[sql] case class PivotType(pivotCol: Expression, values: 
Seq[Expression]) extends GroupType
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/623c2ec4/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
index 3f37e58..00f41d6 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
@@ -317,6 +317,22 @@ public class JavaDataFrameSuite {
 Assert.assertEquals(3.0, actual.get(1).getDouble(2), 0.01);
   }
 
+  @Test
+  public void pivotColumnValues() {
+Dataset df = spark.table("courseSales");
+List actual =

spark git commit: [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table

2018-09-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 ec2c17abf -> a14306b1d


[SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table

## What changes were proposed in this pull request?

Markdown links are not working inside html table. We should use html link tag.

## How was this patch tested?

Verified in IntelliJ IDEA's markdown editor and online markdown editor.

Closes #22588 from viirya/SPARK-25262-followup.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: hyukjinkwon 
(cherry picked from commit dcb9a97f3e16d4645529ac619c3197fcba1c9806)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a14306b1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a14306b1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a14306b1

Branch: refs/heads/branch-2.4
Commit: a14306b1d5a135cff0441c1c953032d0c6a51c47
Parents: ec2c17a
Author: Liang-Chi Hsieh 
Authored: Sat Sep 29 18:18:37 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 18:18:52 2018 +0800

--
 docs/running-on-kubernetes.md | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a14306b1/docs/running-on-kubernetes.md
--
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index fc7c9a5..f19aa41 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -667,15 +667,15 @@ specific to Spark on Kubernetes.
   spark.kubernetes.driver.limit.cores
   (none)
   
-Specify a hard cpu 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit
 for the driver pod.
   
 
 
   spark.kubernetes.executor.request.cores
   (none)
   
-Specify the cpu request for each executor pod. Values conform to the 
Kubernetes 
[convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu).
 
-Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu 
units documented in [CPU 
units](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units).
   
+Specify the cpu request for each executor pod. Values conform to the 
Kubernetes https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu;>convention.
+Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu 
units documented in https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units;>CPU
 units.
 This is distinct from spark.executor.cores: it is only used 
and takes precedence over spark.executor.cores for specifying the 
executor pod cpu request if set. Task 
 parallelism, e.g., number of tasks an executor can run concurrently is not 
affected by this.
   
@@ -684,7 +684,7 @@ specific to Spark on Kubernetes.
   spark.kubernetes.executor.limit.cores
   (none)
   
-Specify a hard cpu 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
+Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit
 for each executor pod launched for the Spark Application.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25447][SQL] Support JSON options by schema_of_json()

2018-09-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1e437835e -> 1007cae20


[SPARK-25447][SQL] Support JSON options by schema_of_json()

## What changes were proposed in this pull request?

In the PR, I propose to extended the `schema_of_json()` function, and accept 
JSON options since they can impact on schema inferring. Purpose is to support 
the same options that `from_json` can use during schema inferring.

## How was this patch tested?

Added SQL, Python and Scala tests (`JsonExpressionsSuite` and 
`JsonFunctionsSuite`) that checks JSON options are used.

Closes #22442 from MaxGekk/schema_of_json-options.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1007cae2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1007cae2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1007cae2

Branch: refs/heads/master
Commit: 1007cae20e8f566e7d7c25f0f81c9b84f352b6d5
Parents: 1e43783
Author: Maxim Gekk 
Authored: Sat Sep 29 17:53:30 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 17:53:30 2018 +0800

--
 python/pyspark/sql/functions.py | 11 ++--
 .../catalyst/expressions/jsonExpressions.scala  | 28 +++-
 .../expressions/JsonExpressionsSuite.scala  | 12 +++--
 .../scala/org/apache/spark/sql/functions.scala  | 15 +++
 .../sql-tests/inputs/json-functions.sql |  4 +++
 .../sql-tests/results/json-functions.sql.out| 18 -
 .../apache/spark/sql/JsonFunctionsSuite.scala   |  8 ++
 7 files changed, 85 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1007cae2/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index e5bc1ea..74f0685 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2348,11 +2348,15 @@ def to_json(col, options={}):
 
 @ignore_unicode_prefix
 @since(2.4)
-def schema_of_json(col):
+def schema_of_json(col, options={}):
 """
 Parses a column containing a JSON string and infers its schema in DDL 
format.
 
 :param col: string column in json format
+:param options: options to control parsing. accepts the same options as 
the JSON datasource
+
+.. versionchanged:: 2.5
+   It accepts `options` parameter to control schema inferring.
 
 >>> from pyspark.sql.types import *
 >>> data = [(1, '{"a": 1}')]
@@ -2361,10 +2365,13 @@ def schema_of_json(col):
 [Row(json=u'struct')]
 >>> df.select(schema_of_json(lit('{"a": 0}')).alias("json")).collect()
 [Row(json=u'struct')]
+>>> schema = schema_of_json(lit('{a: 1}'), 
{'allowUnquotedFieldNames':'true'})
+>>> df.select(schema.alias("json")).collect()
+[Row(json=u'struct')]
 """
 
 sc = SparkContext._active_spark_context
-jc = sc._jvm.functions.schema_of_json(_to_java_column(col))
+jc = sc._jvm.functions.schema_of_json(_to_java_column(col), options)
 return Column(jc)
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1007cae2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index bd9090a..f5297dd 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -740,15 +740,31 @@ case class StructsToJson(
   examples = """
 Examples:
   > SELECT _FUNC_('[{"col":0}]');
-   array>
+   array>
+  > SELECT _FUNC_('[{"col":01}]', map('allowNumericLeadingZeros', 'true'));
+   array>
   """,
   since = "2.4.0")
-case class SchemaOfJson(child: Expression)
+case class SchemaOfJson(
+child: Expression,
+options: Map[String, String])
   extends UnaryExpression with String2StringExpression with CodegenFallback {
 
-  private val jsonOptions = new JSONOptions(Map.empty, "UTC")
-  private val jsonFactory = new JsonFactory()
-  jsonOptions.setJacksonOptions(jsonFactory)
+  def this(child: Expression) = this(child, Map.empty[String, String])
+
+  def this(child: Expression, options: Expression) = this(
+  child = child,
+  options = JsonExprUtils.convertToMapData(options))
+
+  @transient
+  private lazy val jsonOptions = new JSONOptions(options, "UTC")
+
+  @transient
+  private lazy val jsonFactory = {
+val factory = new JsonFactory()
+

spark git commit: [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table

2018-09-29 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1007cae20 -> dcb9a97f3


[SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table

## What changes were proposed in this pull request?

Markdown links are not working inside html table. We should use html link tag.

## How was this patch tested?

Verified in IntelliJ IDEA's markdown editor and online markdown editor.

Closes #22588 from viirya/SPARK-25262-followup.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dcb9a97f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dcb9a97f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dcb9a97f

Branch: refs/heads/master
Commit: dcb9a97f3e16d4645529ac619c3197fcba1c9806
Parents: 1007cae
Author: Liang-Chi Hsieh 
Authored: Sat Sep 29 18:18:37 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 18:18:37 2018 +0800

--
 docs/running-on-kubernetes.md | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dcb9a97f/docs/running-on-kubernetes.md
--
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index c7aea27..b4088d7 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -680,15 +680,15 @@ specific to Spark on Kubernetes.
   spark.kubernetes.driver.limit.cores
   (none)
   
-Specify a hard cpu 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit
 for the driver pod.
   
 
 
   spark.kubernetes.executor.request.cores
   (none)
   
-Specify the cpu request for each executor pod. Values conform to the 
Kubernetes 
[convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu).
 
-Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu 
units documented in [CPU 
units](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units).
   
+Specify the cpu request for each executor pod. Values conform to the 
Kubernetes https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu;>convention.
+Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu 
units documented in https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units;>CPU
 units.
 This is distinct from spark.executor.cores: it is only used 
and takes precedence over spark.executor.cores for specifying the 
executor pod cpu request if set. Task 
 parallelism, e.g., number of tasks an executor can run concurrently is not 
affected by this.
   
@@ -697,7 +697,7 @@ specific to Spark on Kubernetes.
   spark.kubernetes.executor.limit.cores
   (none)
   
-Specify a hard cpu 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
+Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit
 for each executor pod launched for the Spark Application.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25565][BUILD] Add scalastyle rule to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master b6b8a6632 -> a2f502cf5


[SPARK-25565][BUILD] Add scalastyle rule to check add Locale.ROOT to 
.toLowerCase and .toUpperCase for internal calls

## What changes were proposed in this pull request?

This PR adds a rule to force `.toLowerCase(Locale.ROOT)` or 
`toUpperCase(Locale.ROOT)`.

It produces an error as below:

```
[error]   Are you sure that you want to use toUpperCase or toLowerCase 
without the root locale? In most cases, you
[error]   should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) 
instead.
[error]   If you must use toUpperCase or toLowerCase without the root 
locale, wrap the code block with
[error]   // scalastyle:off caselocale
[error]   .toUpperCase
[error]   .toLowerCase
[error]   // scalastyle:on caselocale
```

This PR excludes the cases above for SQL code path for external calls like 
table name, column name and etc.

For test suites, or when it's clear there's no locale problem like Turkish 
locale problem, it uses `Locale.ROOT`.

One minor problem is, `UTF8String` has both methods, `toLowerCase` and 
`toUpperCase`, and the new rule detects them as well. They are ignored.

## How was this patch tested?

Manually tested, and Jenkins tests.

Closes #22581 from HyukjinKwon/SPARK-25565.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a2f502cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a2f502cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a2f502cf

Branch: refs/heads/master
Commit: a2f502cf53b6b00af7cb80b6f38e64cf46367595
Parents: b6b8a66
Author: hyukjinkwon 
Authored: Sun Sep 30 14:31:04 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Sep 30 14:31:04 2018 +0800

--
 .../types/UTF8StringPropertyCheckSuite.scala|  2 ++
 .../apache/spark/metrics/sink/StatsdSink.scala  |  5 ++--
 .../apache/spark/rdd/OrderedRDDFunctions.scala  |  3 +-
 .../scala/org/apache/spark/util/Utils.scala |  2 +-
 .../deploy/history/FsHistoryProviderSuite.scala |  4 +--
 .../spark/ml/feature/StopWordsRemover.scala |  2 ++
 .../org/apache/spark/ml/feature/Tokenizer.scala |  4 +++
 .../submit/KubernetesClientApplication.scala|  4 +--
 .../cluster/k8s/ExecutorPodsSnapshot.scala  |  4 ++-
 .../deploy/mesos/MesosClusterDispatcher.scala   |  3 +-
 scalastyle-config.xml   | 13 +
 .../analysis/higherOrderFunctions.scala |  2 ++
 .../expressions/stringExpressions.scala |  6 
 .../spark/sql/catalyst/parser/AstBuilder.scala  |  2 ++
 .../spark/sql/catalyst/util/StringUtils.scala   |  2 ++
 .../org/apache/spark/sql/internal/SQLConf.scala |  3 +-
 .../org/apache/spark/sql/util/SchemaUtils.scala |  2 ++
 .../spark/sql/util/SchemaUtilsSuite.scala   |  4 ++-
 .../InsertIntoHadoopFsRelationCommand.scala |  2 ++
 .../datasources/csv/CSVDataSource.scala |  6 
 .../execution/streaming/WatermarkTracker.scala  |  4 ++-
 .../state/SymmetricHashJoinStateManager.scala   |  4 ++-
 .../spark/sql/ColumnExpressionSuite.scala   |  4 +--
 .../apache/spark/sql/DataFramePivotSuite.scala  |  4 ++-
 .../scala/org/apache/spark/sql/JoinSuite.scala  |  4 ++-
 .../sql/streaming/EventTimeWatermarkSuite.scala |  4 +--
 .../spark/sql/hive/HiveExternalCatalog.scala| 16 +++
 .../spark/sql/hive/HiveMetastoreCatalog.scala   |  4 +++
 .../spark/sql/hive/CompressionCodecSuite.scala  | 29 
 .../sql/hive/HiveSchemaInferenceSuite.scala |  9 +++---
 .../apache/spark/sql/hive/StatisticsSuite.scala | 15 ++
 31 files changed, 132 insertions(+), 40 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a2f502cf/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala
--
diff --git 
a/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala
 
b/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala
index 7d3331f..9656951 100644
--- 
a/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala
+++ 
b/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala
@@ -63,6 +63,7 @@ class UTF8StringPropertyCheckSuite extends FunSuite with 
GeneratorDrivenProperty
 }
   }
 
+  // scalastyle:off caselocale
   test("toUpperCase") {
 forAll { (s: String) =>
   assert(toUTF8(s).toUpperCase === toUTF8(s.toUpperCase))
@@ -74,6 +75,7 @@ class UTF8StringPropertyCheckSuite extends FunSuite with 
GeneratorDrivenProperty
   assert(toUTF8(s).toLowerCase === toUTF8(s.toLowerCase))
 }
   }
+  //

spark git commit: [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

2018-10-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 79dd4c964 -> 927e52793


[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement

## What changes were proposed in this pull request?

This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement, for instance:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()

spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) 
GROUP BY v2"
spark.sql(q).show()
```

```
+---+---+
| v2|sum_udf(v1)|
+---+---+
|  1|  1|
|  0|  5|
+---+---+
```

## How was this patch tested?

Manual test and unit test.

Closes #22620 from HyukjinKwon/SPARK-25601.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/927e5279
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/927e5279
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/927e5279

Branch: refs/heads/master
Commit: 927e527934a882fab89ca661c4eb31f84c45d830
Parents: 79dd4c9
Author: hyukjinkwon 
Authored: Thu Oct 4 09:38:06 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Oct 4 09:38:06 2018 +0800

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

2018-10-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 075dd620e -> 79dd4c964


[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement

## What changes were proposed in this pull request?

This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement, for instance:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()

spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) 
GROUP BY v2"
spark.sql(q).show()
```

```
+---+---+
| v2|sum_udf(v1)|
+---+---+
|  1|  1|
|  0|  5|
+---+---+
```

## How was this patch tested?

Manual test and unit test.

Closes #22620 from HyukjinKwon/SPARK-25601.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79dd4c96
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79dd4c96
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79dd4c96

Branch: refs/heads/master
Commit: 79dd4c96484c9be7ad9250b64f3fd8e088707641
Parents: 075dd62
Author: hyukjinkwon 
Authored: Thu Oct 4 09:36:23 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Oct 4 09:36:23 2018 +0800

--
 python/pyspark/sql/tests.py | 20 ++--
 python/pyspark/sql/udf.py   | 15 +--
 2 files changed, 31 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79dd4c96/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 815772d..d3c29d0 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -5642,8 +5642,9 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase):
 
 foo_udf = pandas_udf(lambda x: x, "id long", PandasUDFType.GROUPED_MAP)
 with QuietTest(self.sc):
-with self.assertRaisesRegexp(ValueError, 'f must be either 
SQL_BATCHED_UDF or '
- 'SQL_SCALAR_PANDAS_UDF'):
+with self.assertRaisesRegexp(
+ValueError,
+
'f.*SQL_BATCHED_UDF.*SQL_SCALAR_PANDAS_UDF.*SQL_GROUPED_AGG_PANDAS_UDF.*'):
 self.spark.catalog.registerFunction("foo_udf", foo_udf)
 
 def test_decorator(self):
@@ -6459,6 +6460,21 @@ class GroupedAggPandasUDFTests(ReusedSQLTestCase):
 'mixture.*aggregate function.*group aggregate pandas UDF'):
 df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect()
 
+def test_register_vectorized_udf_basic(self):
+from pyspark.sql.functions import pandas_udf
+from pyspark.rdd import PythonEvalType
+
+sum_pandas_udf = pandas_udf(
+lambda v: v.sum(), "integer", 
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF)
+
+self.assertEqual(sum_pandas_udf.evalType, 
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF)
+group_agg_pandas_udf = self.spark.udf.register("sum_pandas_udf", 
sum_pandas_udf)
+self.assertEqual(group_agg_pandas_udf.evalType, 
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF)
+q = "SELECT sum_pandas_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) 
tbl(v1, v2) GROUP BY v2"
+actual = sorted(map(lambda r: r[0], self.spark.sql(q).collect()))
+expected = [1, 5]
+self.assertEqual(actual, expected)
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,

http://git-wip-us.apache.org/repos/asf/spark/blob/79dd4c96/python/pyspark/sql/udf.py
--
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 9dbe49b..58f4e0d 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -298,6 +298,15 @@ class UDFRegistration(object):
 >>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
 [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 
+>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # doctest: 
+SKIP
+... def sum_udf(v):
+... return v.sum()
+...
+>>> _ = spark.udf.register("sum_udf", sum_udf)  # doctest: +SKIP
+>>> q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) 
tbl(v1, v2) GROUP BY v2"
+>>> spark.sql(q).collect()  # doctest: +SKIP
+[Row(sum_udf(v1)=1), Row(sum_udf(v1)=5)]
+
 .. note:: Registration for a user-defined function (case 2.) was 
added from
 Spark 2.3.0.
 """
@@ -310,9 +319,11 @@ class UDFRegistration(object):
 "Invalid

spark git commit: [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

2018-10-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 443d12dbb -> 0763b758d


[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement

## What changes were proposed in this pull request?

This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement, for instance:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()

spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) 
GROUP BY v2"
spark.sql(q).show()
```

```
+---+---+
| v2|sum_udf(v1)|
+---+---+
|  1|  1|
|  0|  5|
+---+---+
```

## How was this patch tested?

Manual test and unit test.

Closes #22620 from HyukjinKwon/SPARK-25601.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0763b758
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0763b758
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0763b758

Branch: refs/heads/branch-2.4
Commit: 0763b758de55fd14d7da4832d01b5713e582b257
Parents: 443d12d
Author: hyukjinkwon 
Authored: Thu Oct 4 09:36:23 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Oct 4 09:43:42 2018 +0800

--
 python/pyspark/sql/tests.py | 20 ++--
 python/pyspark/sql/udf.py   | 15 +--
 2 files changed, 31 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0763b758/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 690035a..e991032 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -5595,8 +5595,9 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase):
 
 foo_udf = pandas_udf(lambda x: x, "id long", PandasUDFType.GROUPED_MAP)
 with QuietTest(self.sc):
-with self.assertRaisesRegexp(ValueError, 'f must be either 
SQL_BATCHED_UDF or '
- 'SQL_SCALAR_PANDAS_UDF'):
+with self.assertRaisesRegexp(
+ValueError,
+
'f.*SQL_BATCHED_UDF.*SQL_SCALAR_PANDAS_UDF.*SQL_GROUPED_AGG_PANDAS_UDF.*'):
 self.spark.catalog.registerFunction("foo_udf", foo_udf)
 
 def test_decorator(self):
@@ -6412,6 +6413,21 @@ class GroupedAggPandasUDFTests(ReusedSQLTestCase):
 'mixture.*aggregate function.*group aggregate pandas UDF'):
 df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect()
 
+def test_register_vectorized_udf_basic(self):
+from pyspark.sql.functions import pandas_udf
+from pyspark.rdd import PythonEvalType
+
+sum_pandas_udf = pandas_udf(
+lambda v: v.sum(), "integer", 
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF)
+
+self.assertEqual(sum_pandas_udf.evalType, 
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF)
+group_agg_pandas_udf = self.spark.udf.register("sum_pandas_udf", 
sum_pandas_udf)
+self.assertEqual(group_agg_pandas_udf.evalType, 
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF)
+q = "SELECT sum_pandas_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) 
tbl(v1, v2) GROUP BY v2"
+actual = sorted(map(lambda r: r[0], self.spark.sql(q).collect()))
+expected = [1, 5]
+self.assertEqual(actual, expected)
+
 
 @unittest.skipIf(
 not _have_pandas or not _have_pyarrow,

http://git-wip-us.apache.org/repos/asf/spark/blob/0763b758/python/pyspark/sql/udf.py
--
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 9dbe49b..58f4e0d 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -298,6 +298,15 @@ class UDFRegistration(object):
 >>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
 [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 
+>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # doctest: 
+SKIP
+... def sum_udf(v):
+... return v.sum()
+...
+>>> _ = spark.udf.register("sum_udf", sum_udf)  # doctest: +SKIP
+>>> q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) 
tbl(v1, v2) GROUP BY v2"
+>>> spark.sql(q).collect()  # doctest: +SKIP
+[Row(sum_udf(v1)=1), Row(sum_udf(v1)=5)]
+
 .. note:: Registration for a user-defined function (case 2.) was 
added from
 Spark 2.3.0.
 """
@@ -310,9 +319,11 @@ class UDFRegistration(object):

spark git commit: [SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled

2018-10-03 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master d6be46eb9 -> 928d0739c


[SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled

## What changes were proposed in this pull request?

With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore 
corrupt Avro files, which is consistent with Parquet and Orc data source.

## How was this patch tested?

Unit test

Closes #22611 from gengliangwang/ignoreCorruptAvro.

Authored-by: Gengliang Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/928d0739
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/928d0739
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/928d0739

Branch: refs/heads/master
Commit: 928d0739c45d0fbb1d3bfc09c0ed7a213f09f3e5
Parents: d6be46e
Author: Gengliang Wang 
Authored: Wed Oct 3 17:08:55 2018 +0800
Committer: hyukjinkwon 
Committed: Wed Oct 3 17:08:55 2018 +0800

--
 .../apache/spark/sql/avro/AvroFileFormat.scala  | 78 +---
 .../org/apache/spark/sql/avro/AvroSuite.scala   | 43 +++
 2 files changed, 93 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/928d0739/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
index 6df23c9..e60fa88 100755
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
@@ -32,14 +32,14 @@ import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileStatus, Path}
 import org.apache.hadoop.mapreduce.Job
 
-import org.apache.spark.TaskContext
+import org.apache.spark.{SparkException, TaskContext}
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriterFactory, PartitionedFile}
 import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
 import org.apache.spark.sql.types.StructType
-import org.apache.spark.util.SerializableConfiguration
+import org.apache.spark.util.{SerializableConfiguration, Utils}
 
 private[avro] class AvroFileFormat extends FileFormat
   with DataSourceRegister with Logging with Serializable {
@@ -59,36 +59,13 @@ private[avro] class AvroFileFormat extends FileFormat
 val conf = spark.sessionState.newHadoopConf()
 val parsedOptions = new AvroOptions(options, conf)
 
-// Schema evolution is not supported yet. Here we only pick a single 
random sample file to
-// figure out the schema of the whole dataset.
-val sampleFile =
-  if (parsedOptions.ignoreExtension) {
-files.headOption.getOrElse {
-  throw new FileNotFoundException("Files for schema inferring have 
been not found.")
-}
-  } else {
-files.find(_.getPath.getName.endsWith(".avro")).getOrElse {
-  throw new FileNotFoundException(
-"No Avro files found. If files don't have .avro extension, set 
ignoreExtension to true")
-}
-  }
-
 // User can specify an optional avro json schema.
 val avroSchema = parsedOptions.schema
   .map(new Schema.Parser().parse)
   .getOrElse {
-val in = new FsInput(sampleFile.getPath, conf)
-try {
-  val reader = DataFileReader.openReader(in, new 
GenericDatumReader[GenericRecord]())
-  try {
-reader.getSchema
-  } finally {
-reader.close()
-  }
-} finally {
-  in.close()
-}
-  }
+inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension,
+  spark.sessionState.conf.ignoreCorruptFiles)
+}
 
 SchemaConverters.toSqlType(avroSchema).dataType match {
   case t: StructType => Some(t)
@@ -100,6 +77,51 @@ private[avro] class AvroFileFormat extends FileFormat
 }
   }
 
+  private def inferAvroSchemaFromFiles(
+  files: Seq[FileStatus],
+  conf: Configuration,
+  ignoreExtension: Boolean,
+  ignoreCorruptFiles: Boolean): Schema = {
+// Schema evolution is not supported yet. Here we only pick first random 
readable sample file to
+// figure out the schema of the whole dataset.
+val avroReader = files.iterator.map { f =>
+  val path = f.getPath
+  if (!ignoreExtension && !path.getName.endsWith(".avro")) {
+None
+  } else {
+Utils.tryWithResource {
+  new FsInput(path, conf)
+} { in =>
+  try {
+

spark git commit: [SPARK-25655][BUILD] Add -Pspark-ganglia-lgpl to the scala style check.

2018-10-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 58287a398 -> 44cf800c8


[SPARK-25655][BUILD] Add -Pspark-ganglia-lgpl to the scala style check.

## What changes were proposed in this pull request?
Our lint failed due to the following errors:
```
[INFO] --- scalastyle-maven-plugin:1.0.0:check (default)  
spark-ganglia-lgpl_2.11 ---
error 
file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
 message=
  Are you sure that you want to use toUpperCase or toLowerCase without the 
root locale? In most cases, you
  should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead.
  If you must use toUpperCase or toLowerCase without the root locale, wrap 
the code block with
  // scalastyle:off caselocale
  .toUpperCase
  .toLowerCase
  // scalastyle:on caselocale
 line=67 column=49
error 
file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
 message=
  Are you sure that you want to use toUpperCase or toLowerCase without the 
root locale? In most cases, you
  should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead.
  If you must use toUpperCase or toLowerCase without the root locale, wrap 
the code block with
  // scalastyle:off caselocale
  .toUpperCase
  .toLowerCase
  // scalastyle:on caselocale
 line=71 column=32
Saving to 
outputFile=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/target/scalastyle-output.xml
```

See 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/8890/

## How was this patch tested?
N/A

Closes #22647 from gatorsmile/fixLint.

Authored-by: gatorsmile 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44cf800c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44cf800c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44cf800c

Branch: refs/heads/master
Commit: 44cf800c831588b1f7940dd8eef7ecb6cde28f23
Parents: 58287a3
Author: gatorsmile 
Authored: Sat Oct 6 14:25:48 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Oct 6 14:25:48 2018 +0800

--
 dev/scalastyle| 1 +
 .../scala/org/apache/spark/metrics/sink/GangliaSink.scala | 7 ---
 2 files changed, 5 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/44cf800c/dev/scalastyle
--
diff --git a/dev/scalastyle b/dev/scalastyle
index b8053df..b0ad025 100755
--- a/dev/scalastyle
+++ b/dev/scalastyle
@@ -29,6 +29,7 @@ ERRORS=$(echo -e "q\n" \
 -Pflume \
 -Phive \
 -Phive-thriftserver \
+-Pspark-ganglia-lgpl \
 scalastyle test:scalastyle \
 | awk '{if($1~/error/)print}' \
 )

http://git-wip-us.apache.org/repos/asf/spark/blob/44cf800c/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
--
diff --git 
a/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
 
b/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
index 0cd795f..93db477 100644
--- 
a/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
+++ 
b/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.metrics.sink
 
-import java.util.Properties
+import java.util.{Locale, Properties}
 import java.util.concurrent.TimeUnit
 
 import com.codahale.metrics.MetricRegistry
@@ -64,11 +64,12 @@ class GangliaSink(val property: Properties, val registry: 
MetricRegistry,
   val ttl = 
propertyToOption(GANGLIA_KEY_TTL).map(_.toInt).getOrElse(GANGLIA_DEFAULT_TTL)
   val dmax = 
propertyToOption(GANGLIA_KEY_DMAX).map(_.toInt).getOrElse(GANGLIA_DEFAULT_DMAX)
   val mode: UDPAddressingMode = propertyToOption(GANGLIA_KEY_MODE)
-.map(u => 
GMetric.UDPAddressingMode.valueOf(u.toUpperCase)).getOrElse(GANGLIA_DEFAULT_MODE)
+.map(u => GMetric.UDPAddressingMode.valueOf(u.toUpperCase(Locale.Root)))
+.getOrElse(GANGLIA_DEFAULT_MODE)
   val pollPeriod = propertyToOption(GANGLIA_KEY_PERIOD).map(_.toInt)
 .getOrElse(GANGLIA_DEFAULT_PERIOD)
   val pollUnit: TimeUnit = propertyToOption(GANGLIA_KEY_UNIT)
-.map(u => TimeUnit.valueOf(u.toUpperCase))
+.map(u => TimeUnit.valueOf(u.toUpperCase(Locale.Root)))
 .getOrElse(GANGLIA_DEFAULT_UNIT)
 
   MetricsSystem.checkMinimalPollingPeriod(pollUnit,

spark git commit: [SPARK-25621][SPARK-25622][TEST] Reduce test time of BucketedReadWithHiveSupportSuite

2018-10-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master f2f4e7afe -> 1ee472eec


[SPARK-25621][SPARK-25622][TEST] Reduce test time of 
BucketedReadWithHiveSupportSuite

## What changes were proposed in this pull request?

By replacing loops with random possible value.
- `read partitioning bucketed tables with bucket pruning filters` reduce from 
55s to 7s
- `read partitioning bucketed tables having composite filters` reduce from 54s 
to 8s
- total time: reduce from 288s to 192s

## How was this patch tested?

Unit test

Closes #22640 from gengliangwang/fastenBucketedReadSuite.

Authored-by: Gengliang Wang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ee472ee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ee472ee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ee472ee

Branch: refs/heads/master
Commit: 1ee472eec15e104c4cd087179a9491dc542e15d7
Parents: f2f4e7a
Author: Gengliang Wang 
Authored: Sat Oct 6 14:54:04 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Oct 6 14:54:04 2018 +0800

--
 .../spark/sql/sources/BucketedReadSuite.scala   | 181 ++-
 1 file changed, 91 insertions(+), 90 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1ee472ee/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala
index a941420..a2bc651 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala
@@ -20,6 +20,8 @@ package org.apache.spark.sql.sources
 import java.io.File
 import java.net.URI
 
+import scala.util.Random
+
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.catalog.BucketSpec
 import org.apache.spark.sql.catalyst.expressions
@@ -47,11 +49,13 @@ class BucketedReadWithoutHiveSupportSuite extends 
BucketedReadSuite with SharedS
 abstract class BucketedReadSuite extends QueryTest with SQLTestUtils {
   import testImplicits._
 
-  private lazy val df = (0 until 50).map(i => (i % 5, i % 13, 
i.toString)).toDF("i", "j", "k")
+  private val maxI = 5
+  private val maxJ = 13
+  private lazy val df = (0 until 50).map(i => (i % maxI, i % maxJ, 
i.toString)).toDF("i", "j", "k")
   private lazy val nullDF = (for {
 i <- 0 to 50
 s <- Seq(null, "a", "b", "c", "d", "e", "f", null, "g")
-  } yield (i % 5, s, i % 13)).toDF("i", "j", "k")
+  } yield (i % maxI, s, i % maxJ)).toDF("i", "j", "k")
 
   // number of buckets that doesn't yield empty buckets when bucketing on 
column j on df/nullDF
   // empty buckets before filtering might hide bugs in pruning logic
@@ -66,23 +70,22 @@ abstract class BucketedReadSuite extends QueryTest with 
SQLTestUtils {
 .bucketBy(8, "j", "k")
 .saveAsTable("bucketed_table")
 
-  for (i <- 0 until 5) {
-val table = spark.table("bucketed_table").filter($"i" === i)
-val query = table.queryExecution
-val output = query.analyzed.output
-val rdd = query.toRdd
-
-assert(rdd.partitions.length == 8)
-
-val attrs = table.select("j", "k").queryExecution.analyzed.output
-val checkBucketId = rdd.mapPartitionsWithIndex((index, rows) => {
-  val getBucketId = UnsafeProjection.create(
-HashPartitioning(attrs, 8).partitionIdExpression :: Nil,
-output)
-  rows.map(row => getBucketId(row).getInt(0) -> index)
-})
-checkBucketId.collect().foreach(r => assert(r._1 == r._2))
-  }
+  val bucketValue = Random.nextInt(maxI)
+  val table = spark.table("bucketed_table").filter($"i" === bucketValue)
+  val query = table.queryExecution
+  val output = query.analyzed.output
+  val rdd = query.toRdd
+
+  assert(rdd.partitions.length == 8)
+
+  val attrs = table.select("j", "k").queryExecution.analyzed.output
+  val checkBucketId = rdd.mapPartitionsWithIndex((index, rows) => {
+val getBucketId = UnsafeProjection.create(
+  HashPartitioning(attrs, 8).partitionIdExpression :: Nil,
+  output)
+rows.map(row => getBucketId(row).getInt(0) -> index)
+  })
+  checkBucketId.collect().foreach(r => assert(r._1 == r._2))
 }
   }
 
@@ -145,36 +148,36 @@ abstract class BucketedReadSuite extends QueryTest with 
SQLTestUtils {
 .bucketBy(numBuckets, "j")
 .saveAsTable("bucketed_table")
 
-  for (j <- 0 until 13) {
-// Case 1: EqualTo
-checkPrunedAnswers(
-  bucketSpec,
-  bucketValues = j :: Nil,
-

spark git commit: [SPARK-25202][SQL] Implements split with limit sql function

2018-10-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 44cf800c8 -> 17781d753


[SPARK-25202][SQL] Implements split with limit sql function

## What changes were proposed in this pull request?

Adds support for the setting limit in the sql split function

## How was this patch tested?

1. Updated unit tests
2. Tested using Scala spark shell

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #7 from phegstrom/master.

Authored-by: Parker Hegstrom 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17781d75
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17781d75
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17781d75

Branch: refs/heads/master
Commit: 17781d75308c328b11cab3658ca4f358539414f2
Parents: 44cf800
Author: Parker Hegstrom 
Authored: Sat Oct 6 14:30:43 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Oct 6 14:30:43 2018 +0800

--
 R/pkg/R/functions.R | 15 +--
 R/pkg/R/generics.R  |  2 +-
 R/pkg/tests/fulltests/test_sparkSQL.R   |  8 
 .../apache/spark/unsafe/types/UTF8String.java   |  6 +++
 .../spark/unsafe/types/UTF8StringSuite.java | 14 ---
 python/pyspark/sql/functions.py | 28 +
 .../expressions/regexpExpressions.scala | 44 ++--
 .../expressions/RegexpExpressionsSuite.scala| 15 +--
 .../scala/org/apache/spark/sql/functions.scala  | 32 --
 .../sql-tests/inputs/string-functions.sql   |  6 ++-
 .../sql-tests/results/string-functions.sql.out  | 18 +++-
 .../apache/spark/sql/StringFunctionsSuite.scala | 44 ++--
 12 files changed, 189 insertions(+), 43 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/17781d75/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 2cb4cb8..6a8fef5 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -3473,13 +3473,21 @@ setMethod("collect_set",
 
 #' @details
 #' \code{split_string}: Splits string on regular expression.
-#' Equivalent to \code{split} SQL function.
+#' Equivalent to \code{split} SQL function. Optionally a
+#' \code{limit} can be specified
 #'
 #' @rdname column_string_functions
+#' @param limit determines the length of the returned array.
+#'  \itemize{
+#'  \item \code{limit > 0}: length of the array will be at most 
\code{limit}
+#'  \item \code{limit <= 0}: the returned array can have any length
+#'  }
+#'
 #' @aliases split_string split_string,Column-method
 #' @examples
 #'
 #' \dontrun{
+#' head(select(df, split_string(df$Class, "\\d", 2)))
 #' head(select(df, split_string(df$Sex, "a")))
 #' head(select(df, split_string(df$Class, "\\d")))
 #' # This is equivalent to the following SQL expression
@@ -3487,8 +3495,9 @@ setMethod("collect_set",
 #' @note split_string 2.3.0
 setMethod("split_string",
   signature(x = "Column", pattern = "character"),
-  function(x, pattern) {
-jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, 
pattern)
+  function(x, pattern, limit = -1) {
+jc <- callJStatic("org.apache.spark.sql.functions",
+  "split", x@jc, pattern, as.integer(limit))
 column(jc)
   })
 

http://git-wip-us.apache.org/repos/asf/spark/blob/17781d75/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 27c1b31..697d124 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1258,7 +1258,7 @@ setGeneric("sort_array", function(x, asc = TRUE) { 
standardGeneric("sort_array")
 
 #' @rdname column_string_functions
 #' @name NULL
-setGeneric("split_string", function(x, pattern) { 
standardGeneric("split_string") })
+setGeneric("split_string", function(x, pattern, ...) { 
standardGeneric("split_string") })
 
 #' @rdname column_string_functions
 #' @name NULL

http://git-wip-us.apache.org/repos/asf/spark/blob/17781d75/R/pkg/tests/fulltests/test_sparkSQL.R
--
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 50eff37..5cc75aa 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -1819,6 +1819,14 @@ test_that("string operators", {
 collect(select(df4, split_string(df4$a, "")))[1, 1],
 list(list("a.b@c.d   1", "b"))
   )
+  expect_equal(
+collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
+list(list("a", "b@c.d   1\\b"))
+  )
+

spark git commit: [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema.

2018-10-06 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 17781d753 -> f2f4e7afe


[SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while 
inferring CSV schema.

## What changes were proposed in this pull request?
Current the CSV's infer schema code inlines 
`TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of 
the common type coercion code when applicable.  This way we can take advantage 
of any improvement to the base method.

Thanks to MaxGekk for finding this while reviewing another PR.

## How was this patch tested?
This is a minor refactor.  Existing tests are used to verify the change.

Closes #22619 from dilipbiswal/csv_minor.

Authored-by: Dilip Biswal 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2f4e7af
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2f4e7af
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2f4e7af

Branch: refs/heads/master
Commit: f2f4e7afe730badaf443f459b27fe40879947d51
Parents: 17781d7
Author: Dilip Biswal 
Authored: Sat Oct 6 14:49:51 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Oct 6 14:49:51 2018 +0800

--
 .../datasources/csv/CSVInferSchema.scala| 37 
 1 file changed, 14 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f2f4e7af/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
index a585cbe..3596ff1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
@@ -70,7 +70,7 @@ private[csv] object CSVInferSchema {
 
   def mergeRowTypes(first: Array[DataType], second: Array[DataType]): 
Array[DataType] = {
 first.zipAll(second, NullType, NullType).map { case (a, b) =>
-  findTightestCommonType(a, b).getOrElse(NullType)
+  compatibleType(a, b).getOrElse(NullType)
 }
   }
 
@@ -88,7 +88,7 @@ private[csv] object CSVInferSchema {
 case LongType => tryParseLong(field, options)
 case _: DecimalType =>
   // DecimalTypes have different precisions and scales, so we try to 
find the common type.
-  findTightestCommonType(typeSoFar, tryParseDecimal(field, 
options)).getOrElse(StringType)
+  compatibleType(typeSoFar, tryParseDecimal(field, 
options)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field, options)
 case TimestampType => tryParseTimestamp(field, options)
 case BooleanType => tryParseBoolean(field, options)
@@ -172,35 +172,27 @@ private[csv] object CSVInferSchema {
 StringType
   }
 
-  private val numericPrecedence: IndexedSeq[DataType] = 
TypeCoercion.numericPrecedence
+  /**
+   * Returns the common data type given two input data types so that the 
return type
+   * is compatible with both input data types.
+   */
+  private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
+TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+  }
 
   /**
-   * Copied from internal Spark api
-   * [[org.apache.spark.sql.catalyst.analysis.TypeCoercion]]
+   * The following pattern matching represents additional type promotion rules 
that
+   * are CSV specific.
*/
-  val findTightestCommonType: (DataType, DataType) => Option[DataType] = {
-case (t1, t2) if t1 == t2 => Some(t1)
-case (NullType, t1) => Some(t1)
-case (t1, NullType) => Some(t1)
+  private val findCompatibleTypeForCSV: (DataType, DataType) => 
Option[DataType] = {
 case (StringType, t2) => Some(StringType)
 case (t1, StringType) => Some(StringType)
 
-// Promote numeric types to the highest of the two and all numeric types 
to unlimited decimal
-case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
-  val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2)
-  Some(numericPrecedence(index))
-
-// These two cases below deal with when `DecimalType` is larger than 
`IntegralType`.
-case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) =>
-  Some(t2)
-case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
-  Some(t1)
-
 // These two cases below deal with when `IntegralType` is larger than 
`DecimalType`.
 case (t1: IntegralType, t2: DecimalType) =>
-  findTightestCommonType(DecimalType.forType(t1), t2)
+

spark git commit: [SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf

2018-10-07 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master fba722e31 -> 3eb842969


[SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of 
Pandas.Series and return type of pandas udf

## What changes were proposed in this pull request?

For Pandas UDFs, we get arrow type from defined Catalyst return data type of 
UDFs. We use this arrow type to do serialization of data. If the defined return 
data type doesn't match with actual return type of Pandas.Series returned by 
Pandas UDFs, it has a risk to return incorrect data from Python side.

Currently we don't have reliable approach to check if the data conversion is 
safe or not. We leave some document to notify this to users for now. When there 
is next upgrade of PyArrow available we can use to check it, we should add the 
option to check it.

## How was this patch tested?

Only document change.

Closes #22610 from viirya/SPARK-25461.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3eb84296
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3eb84296
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3eb84296

Branch: refs/heads/master
Commit: 3eb842969906d6e81a137af6dc4339881df0a315
Parents: fba722e
Author: Liang-Chi Hsieh 
Authored: Sun Oct 7 23:18:46 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Oct 7 23:18:46 2018 +0800

--
 python/pyspark/sql/functions.py | 6 ++
 1 file changed, 6 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3eb84296/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 7685264..be089ee 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2948,6 +2948,12 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 can fail on special rows, the workaround is to incorporate the 
condition into the functions.
 
 .. note:: The user-defined functions do not take keyword arguments on the 
calling side.
+
+.. note:: The data type of returned `pandas.Series` from the user-defined 
functions should be
+matched with defined returnType (see :meth:`types.to_arrow_type` and
+:meth:`types.from_arrow_type`). When there is mismatch between them, 
Spark might do
+conversion on returned data. The conversion is not guaranteed to be 
correct and results
+should be checked for accuracy by users.
 """
 # decorator @pandas_udf(returnType, functionType)
 is_decorator = f is None or isinstance(f, (str, DataType))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25262][DOC][FOLLOWUP] Fix missing markup tag

2018-09-28 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 5d726b865 -> e99ba8d7c


[SPARK-25262][DOC][FOLLOWUP] Fix missing markup tag

## What changes were proposed in this pull request?

This adds a missing end markup tag. This should go `master` branch only.

## How was this patch tested?

This is a doc-only change. Manual via `SKIP_API=1 jekyll build`.

Closes #22584 from dongjoon-hyun/SPARK-25262.

Authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e99ba8d7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e99ba8d7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e99ba8d7

Branch: refs/heads/master
Commit: e99ba8d7c8ec4b4cdd63fd1621f54be993bb0404
Parents: 5d726b8
Author: Dongjoon Hyun 
Authored: Sat Sep 29 11:23:37 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 11:23:37 2018 +0800

--
 docs/running-on-kubernetes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e99ba8d7/docs/running-on-kubernetes.md
--
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index 840e306..c7aea27 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -800,7 +800,7 @@ specific to Spark on Kubernetes.
 
 
   spark.kubernetes.local.dirs.tmpfs
-  false
+  false
   
Configure the emptyDir volumes used to back 
SPARK_LOCAL_DIRS within the Spark driver and executor pods to use 
tmpfs backing i.e. RAM.  See Local 
Storage earlier on this page
for more discussion of this.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e99ba8d7c -> 1e437835e


[SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by 
using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove 
the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` 
will not fail because 
[SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a 
fallback logic. However, it will cause many trials and fallbacks in all builds 
over `branch-2.3/branch-2.4/master`. We had better fix this issue.

## How was this patch tested?

Pass the Jenkins with the updated version.

Closes #22587 from dongjoon-hyun/SPARK-25570.

Authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e437835
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e437835
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e437835

Branch: refs/heads/master
Commit: 1e437835e96c4417117f44c29eba5ebc0112926f
Parents: e99ba8d
Author: Dongjoon Hyun 
Authored: Sat Sep 29 11:43:58 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 11:43:58 2018 +0800

--
 .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e437835/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index a7d6972..fd4985d 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -206,7 +206,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.1")
+  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2")
 
   protected var spark: SparkSession = _
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 7614313c9 -> ec2c17abf


[SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by 
using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove 
the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` 
will not fail because 
[SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a 
fallback logic. However, it will cause many trials and fallbacks in all builds 
over `branch-2.3/branch-2.4/master`. We had better fix this issue.

## How was this patch tested?

Pass the Jenkins with the updated version.

Closes #22587 from dongjoon-hyun/SPARK-25570.

Authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 1e437835e96c4417117f44c29eba5ebc0112926f)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec2c17ab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec2c17ab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec2c17ab

Branch: refs/heads/branch-2.4
Commit: ec2c17abf43d304fab26dde3ae624f553cdbd32e
Parents: 7614313
Author: Dongjoon Hyun 
Authored: Sat Sep 29 11:43:58 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 11:44:12 2018 +0800

--
 .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ec2c17ab/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 25df333..46b66c1 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -203,7 +203,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.1")
+  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2")
 
   protected var spark: SparkSession = _
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 f13565b6e -> eb78380c0


[SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by 
using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove 
the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` 
will not fail because 
[SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a 
fallback logic. However, it will cause many trials and fallbacks in all builds 
over `branch-2.3/branch-2.4/master`. We had better fix this issue.

## How was this patch tested?

Pass the Jenkins with the updated version.

Closes #22587 from dongjoon-hyun/SPARK-25570.

Authored-by: Dongjoon Hyun 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 1e437835e96c4417117f44c29eba5ebc0112926f)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb78380c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb78380c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb78380c

Branch: refs/heads/branch-2.3
Commit: eb78380c0e1e620e996435a4c08acb652c868795
Parents: f13565b
Author: Dongjoon Hyun 
Authored: Sat Sep 29 11:43:58 2018 +0800
Committer: hyukjinkwon 
Committed: Sat Sep 29 11:44:27 2018 +0800

--
 .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eb78380c/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 5103aa8..af15da6 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -203,7 +203,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.1")
+  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2")
 
   protected var spark: SparkSession = _
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25273][DOC] How to install testthat 1.0.2

2018-08-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master e9fce2a4c -> 3c67cb0b5


[SPARK-25273][DOC] How to install testthat 1.0.2

## What changes were proposed in this pull request?

R tests require `testthat` v1.0.2. In the PR, I described how to install the 
version in the section 
http://spark.apache.org/docs/latest/building-spark.html#running-r-tests.

Closes #22272 from MaxGekk/r-testthat-doc.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c67cb0b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c67cb0b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c67cb0b

Branch: refs/heads/master
Commit: 3c67cb0b52c14f1cee1a0aaf74d6d71f28cbb5f2
Parents: e9fce2a
Author: Maxim Gekk 
Authored: Thu Aug 30 20:25:26 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Aug 30 20:25:26 2018 +0800

--
 docs/README.md | 3 ++-
 docs/building-spark.md | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3c67cb0b/docs/README.md
--
diff --git a/docs/README.md b/docs/README.md
index 7da543d..fb67c4b 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -22,8 +22,9 @@ $ sudo gem install jekyll jekyll-redirect-from pygments.rb
 $ sudo pip install Pygments
 # Following is needed only for generating API docs
 $ sudo pip install sphinx pypandoc mkdocs
-$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "testthat", 
"rmarkdown"), repos="http://cran.stat.ucla.edu/;)'
+$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "rmarkdown"), 
repos="http://cran.stat.ucla.edu/;)'
 $ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", 
repos="http://cran.stat.ucla.edu/;)'
+$ sudo Rscript -e 'devtools::install_version("testthat", version = "1.0.2", 
repos="http://cran.stat.ucla.edu/;)'
 ```
 
 Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to 
replace gem with gem2.0.

http://git-wip-us.apache.org/repos/asf/spark/blob/3c67cb0b/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index d3dfd49..0086aea 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -236,7 +236,8 @@ The run-tests script also can be limited to a specific 
Python version or a speci
 
 To run the SparkR tests you will need to install the 
[knitr](https://cran.r-project.org/package=knitr), 
[rmarkdown](https://cran.r-project.org/package=rmarkdown), 
[testthat](https://cran.r-project.org/package=testthat), 
[e1071](https://cran.r-project.org/package=e1071) and 
[survival](https://cran.r-project.org/package=survival) packages first:
 
-R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+R -e "devtools::install_version('testthat', version = '1.0.2', 
repos='http://cran.us.r-project.org')"
 
 You can run just the SparkR tests using the command:
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25273][DOC] How to install testthat 1.0.2

2018-08-30 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 306e881b6 -> b072717b3


[SPARK-25273][DOC] How to install testthat 1.0.2

## What changes were proposed in this pull request?

R tests require `testthat` v1.0.2. In the PR, I described how to install the 
version in the section 
http://spark.apache.org/docs/latest/building-spark.html#running-r-tests.

Closes #22272 from MaxGekk/r-testthat-doc.

Authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b072717b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b072717b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b072717b

Branch: refs/heads/branch-2.3
Commit: b072717b3f6178e728c0bf855aca243c275e58f0
Parents: 306e881
Author: Maxim Gekk 
Authored: Thu Aug 30 20:25:26 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Aug 30 20:26:36 2018 +0800

--
 docs/README.md | 3 ++-
 docs/building-spark.md | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b072717b/docs/README.md
--
diff --git a/docs/README.md b/docs/README.md
index 166a7e5..174a735 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -22,8 +22,9 @@ $ sudo gem install jekyll jekyll-redirect-from pygments.rb
 $ sudo pip install Pygments
 # Following is needed only for generating API docs
 $ sudo pip install sphinx pypandoc mkdocs
-$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "testthat", 
"rmarkdown"), repos="http://cran.stat.ucla.edu/;)'
+$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "rmarkdown"), 
repos="http://cran.stat.ucla.edu/;)'
 $ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", 
repos="http://cran.stat.ucla.edu/;)'
+$ sudo Rscript -e 'devtools::install_version("testthat", version = "1.0.2", 
repos="http://cran.stat.ucla.edu/;)'
 ```
 
 Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to 
replace gem with gem2.0.

http://git-wip-us.apache.org/repos/asf/spark/blob/b072717b/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index 9f78c04..cd80835 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -232,7 +232,8 @@ The run-tests script also can be limited to a specific 
Python version or a speci
 
 To run the SparkR tests you will need to install the 
[knitr](https://cran.r-project.org/package=knitr), 
[rmarkdown](https://cran.r-project.org/package=rmarkdown), 
[testthat](https://cran.r-project.org/package=testthat), 
[e1071](https://cran.r-project.org/package=e1071) and 
[survival](https://cran.r-project.org/package=survival) packages first:
 
-R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
+R -e "devtools::install_version('testthat', version = '1.0.2', 
repos='http://cran.us.r-project.org')"
 
 You can run just the SparkR tests using the command:
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23

2018-09-19 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 a9a8d3a4b -> 99ae693b3


[SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 
and Pandas 0.23

## What changes were proposed in this pull request?

Fix test that constructs a Pandas DataFrame by specifying the column order. 
Previously this test assumed the columns would be sorted alphabetically, 
however when using Python 3.6 with Pandas 0.23 or higher, the original column 
order is maintained. This causes the columns to get mixed up and the test 
errors.

Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4

Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471.

Authored-by: Bryan Cutler 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 90e3955f384ca07bdf24faa6cdb60ded944cf0d8)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99ae693b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99ae693b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99ae693b

Branch: refs/heads/branch-2.4
Commit: 99ae693b3722db6e01825b8cf2c3f2ef74a65ddb
Parents: a9a8d3a
Author: Bryan Cutler 
Authored: Thu Sep 20 09:29:29 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Sep 20 09:29:49 2018 +0800

--
 python/pyspark/sql/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/99ae693b/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 08d7cfa..603f994 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3266,7 +3266,7 @@ class SQLTests(ReusedSQLTestCase):
 import pandas as pd
 from datetime import datetime
 pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)],
-"d": [pd.Timestamp.now().date()]})
+"d": [pd.Timestamp.now().date()]}, columns=["d", 
"ts"])
 # test types are inferred correctly without specifying schema
 df = self.spark.createDataFrame(pdf)
 self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23

2018-09-19 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6f681d429 -> 90e3955f3


[SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 
and Pandas 0.23

## What changes were proposed in this pull request?

Fix test that constructs a Pandas DataFrame by specifying the column order. 
Previously this test assumed the columns would be sorted alphabetically, 
however when using Python 3.6 with Pandas 0.23 or higher, the original column 
order is maintained. This causes the columns to get mixed up and the test 
errors.

Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4

Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471.

Authored-by: Bryan Cutler 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90e3955f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90e3955f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90e3955f

Branch: refs/heads/master
Commit: 90e3955f384ca07bdf24faa6cdb60ded944cf0d8
Parents: 6f681d4
Author: Bryan Cutler 
Authored: Thu Sep 20 09:29:29 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Sep 20 09:29:29 2018 +0800

--
 python/pyspark/sql/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90e3955f/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 08d7cfa..603f994 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -3266,7 +3266,7 @@ class SQLTests(ReusedSQLTestCase):
 import pandas as pd
 from datetime import datetime
 pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)],
-"d": [pd.Timestamp.now().date()]})
+"d": [pd.Timestamp.now().date()]}, columns=["d", 
"ts"])
 # test types are inferred correctly without specifying schema
 df = self.spark.createDataFrame(pdf)
 self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23

2018-09-19 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 7b5da37c0 -> e319a624e


[SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 
and Pandas 0.23

## What changes were proposed in this pull request?

Fix test that constructs a Pandas DataFrame by specifying the column order. 
Previously this test assumed the columns would be sorted alphabetically, 
however when using Python 3.6 with Pandas 0.23 or higher, the original column 
order is maintained. This causes the columns to get mixed up and the test 
errors.

Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4

Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471.

Authored-by: Bryan Cutler 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 90e3955f384ca07bdf24faa6cdb60ded944cf0d8)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e319a624
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e319a624
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e319a624

Branch: refs/heads/branch-2.3
Commit: e319a624e2f366a941bd92a685e1b48504c887b1
Parents: 7b5da37
Author: Bryan Cutler 
Authored: Thu Sep 20 09:29:29 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Sep 20 09:30:06 2018 +0800

--
 python/pyspark/sql/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e319a624/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 6bfb329..3c5fc97 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -2885,7 +2885,7 @@ class SQLTests(ReusedSQLTestCase):
 import pandas as pd
 from datetime import datetime
 pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)],
-"d": [pd.Timestamp.now().date()]})
+"d": [pd.Timestamp.now().date()]}, columns=["d", 
"ts"])
 # test types are inferred correctly without specifying schema
 df = self.spark.createDataFrame(pdf)
 self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent

2018-09-20 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 0e31a6f25 -> 7ff5386ed


[MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent

## What changes were proposed in this pull request?

This PR replace an effective `show()` to `collect()` to make the output silent.

**Before:**

```
test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... +---+--+
|key|   val|
+---+--+
|  0|[0.0, 0.0]|
|  1|[1.0, 1.0]|
|  2|[2.0, 2.0]|
|  0|[3.0, 3.0]|
|  1|[4.0, 4.0]|
|  2|[5.0, 5.0]|
|  0|[6.0, 6.0]|
|  1|[7.0, 7.0]|
|  2|[8.0, 8.0]|
|  0|[9.0, 9.0]|
+---+--+
```

**After:**

```
test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... ok
```

## How was this patch tested?

Manually tested.

Closes #22479 from HyukjinKwon/minor-udf-test.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ff5386e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ff5386e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ff5386e

Branch: refs/heads/master
Commit: 7ff5386ed934190344b2cda1069bde4bc68a3e63
Parents: 0e31a6f
Author: hyukjinkwon 
Authored: Thu Sep 20 15:03:16 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Sep 20 15:03:16 2018 +0800

--
 python/pyspark/sql/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ff5386e/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 603f994..8724bbc 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1168,7 +1168,7 @@ class SQLTests(ReusedSQLTestCase):
 df = self.spark.createDataFrame(
 [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)],
 schema=schema)
-df.show()
+df.collect()
 
 def test_nested_udt_in_df(self):
 schema = StructType().add("key", LongType()).add("val", 
ArrayType(PythonOnlyUDT()))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent

2018-09-20 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 dfcff3839 -> e07042a35


[MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent

## What changes were proposed in this pull request?

This PR replace an effective `show()` to `collect()` to make the output silent.

**Before:**

```
test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... +---+--+
|key|   val|
+---+--+
|  0|[0.0, 0.0]|
|  1|[1.0, 1.0]|
|  2|[2.0, 2.0]|
|  0|[3.0, 3.0]|
|  1|[4.0, 4.0]|
|  2|[5.0, 5.0]|
|  0|[6.0, 6.0]|
|  1|[7.0, 7.0]|
|  2|[8.0, 8.0]|
|  0|[9.0, 9.0]|
+---+--+
```

**After:**

```
test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... ok
```

## How was this patch tested?

Manually tested.

Closes #22479 from HyukjinKwon/minor-udf-test.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 
(cherry picked from commit 7ff5386ed934190344b2cda1069bde4bc68a3e63)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e07042a3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e07042a3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e07042a3

Branch: refs/heads/branch-2.4
Commit: e07042a3593199f5045e1476b6b324f7f0901143
Parents: dfcff38
Author: hyukjinkwon 
Authored: Thu Sep 20 15:03:16 2018 +0800
Committer: hyukjinkwon 
Committed: Thu Sep 20 15:03:34 2018 +0800

--
 python/pyspark/sql/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e07042a3/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 603f994..8724bbc 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -1168,7 +1168,7 @@ class SQLTests(ReusedSQLTestCase):
 df = self.spark.createDataFrame(
 [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)],
 schema=schema)
-df.show()
+df.collect()
 
 def test_nested_udt_in_df(self):
 schema = StructType().add("key", LongType()).add("val", 
ArrayType(PythonOnlyUDT()))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol

2018-09-24 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 c64e7506d -> 36e7c8fcc


[SPARKR] Match pyspark features in SparkR communication protocol


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36e7c8fc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36e7c8fc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36e7c8fc

Branch: refs/heads/branch-2.4
Commit: 36e7c8fcc1aeff0b15deb1243bd9615a202d320f
Parents: c64e750
Author: hyukjinkwon 
Authored: Mon Sep 24 19:25:02 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Sep 24 19:28:31 2018 +0800

--
 R/pkg/R/context.R   | 43 ++--
 R/pkg/tests/fulltests/test_Serde.R  | 32 +++
 R/pkg/tests/fulltests/test_sparkSQL.R   | 12 --
 .../scala/org/apache/spark/api/r/RRDD.scala | 33 ++-
 .../scala/org/apache/spark/api/r/RUtils.scala   |  4 ++
 5 files changed, 98 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/36e7c8fc/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index f168ca7..e991367 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -167,18 +167,30 @@ parallelize <- function(sc, coll, numSlices = 1) {
   # 2-tuples of raws
   serializedSlices <- lapply(slices, serialize, connection = NULL)
 
-  # The PRC backend cannot handle arguments larger than 2GB (INT_MAX)
+  # The RPC backend cannot handle arguments larger than 2GB (INT_MAX)
   # If serialized data is safely less than that threshold we send it over the 
PRC channel.
   # Otherwise, we write it to a file and send the file name
   if (objectSize < sizeLimit) {
 jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", 
sc, serializedSlices)
   } else {
-fileName <- writeToTempFile(serializedSlices)
-jrdd <- tryCatch(callJStatic(
-"org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, 
as.integer(numSlices)),
-  finally = {
-file.remove(fileName)
-})
+if (callJStatic("org.apache.spark.api.r.RUtils", "getEncryptionEnabled", 
sc)) {
+  # the length of slices here is the parallelism to use in the jvm's 
sc.parallelize()
+  parallelism <- as.integer(numSlices)
+  jserver <- newJObject("org.apache.spark.api.r.RParallelizeServer", sc, 
parallelism)
+  authSecret <- callJMethod(jserver, "secret")
+  port <- callJMethod(jserver, "port")
+  conn <- socketConnection(port = port, blocking = TRUE, open = "wb", 
timeout = 1500)
+  doServerAuth(conn, authSecret)
+  writeToConnection(serializedSlices, conn)
+  jrdd <- callJMethod(jserver, "getResult")
+} else {
+  fileName <- writeToTempFile(serializedSlices)
+  jrdd <- tryCatch(callJStatic(
+  "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, 
as.integer(numSlices)),
+finally = {
+  file.remove(fileName)
+  })
+}
   }
 
   RDD(jrdd, "byte")
@@ -194,14 +206,21 @@ getMaxAllocationLimit <- function(sc) {
   ))
 }
 
+writeToConnection <- function(serializedSlices, conn) {
+  tryCatch({
+for (slice in serializedSlices) {
+  writeBin(as.integer(length(slice)), conn, endian = "big")
+  writeBin(slice, conn, endian = "big")
+}
+  }, finally = {
+close(conn)
+  })
+}
+
 writeToTempFile <- function(serializedSlices) {
   fileName <- tempfile()
   conn <- file(fileName, "wb")
-  for (slice in serializedSlices) {
-writeBin(as.integer(length(slice)), conn, endian = "big")
-writeBin(slice, conn, endian = "big")
-  }
-  close(conn)
+  writeToConnection(serializedSlices, conn)
   fileName
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/36e7c8fc/R/pkg/tests/fulltests/test_Serde.R
--
diff --git a/R/pkg/tests/fulltests/test_Serde.R 
b/R/pkg/tests/fulltests/test_Serde.R
index 3577929..1525bdb 100644
--- a/R/pkg/tests/fulltests/test_Serde.R
+++ b/R/pkg/tests/fulltests/test_Serde.R
@@ -124,3 +124,35 @@ test_that("SerDe of list of lists", {
 })
 
 sparkR.session.stop()
+
+# Note that this test should be at the end of tests since the configruations 
used here are not
+# specific to sessions, and the Spark context is restarted.
+test_that("createDataFrame large objects", {
+  for (encryptionEnabled in list("true", "false")) {
+# To simulate a large object scenario, we set spark.r.maxAllocationLimit 
to a smaller value
+conf <- list(spark.r.maxAllocationLimit = "100",
+ spark.io.encryption.enabled = encryptionEnabled)
+
+suppressWarnings(sparkR.session(master = sparkRTestMaster,
+sparkConfig = conf,
+

spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol

2018-09-24 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master c79072aaf -> c3b4a94a9


[SPARKR] Match pyspark features in SparkR communication protocol


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c3b4a94a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c3b4a94a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c3b4a94a

Branch: refs/heads/master
Commit: c3b4a94a91d66c172cf332321d3a78dba29ef8f0
Parents: c79072a
Author: hyukjinkwon 
Authored: Mon Sep 24 19:25:02 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Sep 24 19:25:02 2018 +0800

--
 R/pkg/R/context.R   | 43 ++--
 R/pkg/tests/fulltests/test_Serde.R  | 32 +++
 R/pkg/tests/fulltests/test_sparkSQL.R   | 12 --
 .../scala/org/apache/spark/api/r/RRDD.scala | 33 ++-
 .../scala/org/apache/spark/api/r/RUtils.scala   |  4 ++
 5 files changed, 98 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c3b4a94a/R/pkg/R/context.R
--
diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index f168ca7..e991367 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -167,18 +167,30 @@ parallelize <- function(sc, coll, numSlices = 1) {
   # 2-tuples of raws
   serializedSlices <- lapply(slices, serialize, connection = NULL)
 
-  # The PRC backend cannot handle arguments larger than 2GB (INT_MAX)
+  # The RPC backend cannot handle arguments larger than 2GB (INT_MAX)
   # If serialized data is safely less than that threshold we send it over the 
PRC channel.
   # Otherwise, we write it to a file and send the file name
   if (objectSize < sizeLimit) {
 jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", 
sc, serializedSlices)
   } else {
-fileName <- writeToTempFile(serializedSlices)
-jrdd <- tryCatch(callJStatic(
-"org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, 
as.integer(numSlices)),
-  finally = {
-file.remove(fileName)
-})
+if (callJStatic("org.apache.spark.api.r.RUtils", "getEncryptionEnabled", 
sc)) {
+  # the length of slices here is the parallelism to use in the jvm's 
sc.parallelize()
+  parallelism <- as.integer(numSlices)
+  jserver <- newJObject("org.apache.spark.api.r.RParallelizeServer", sc, 
parallelism)
+  authSecret <- callJMethod(jserver, "secret")
+  port <- callJMethod(jserver, "port")
+  conn <- socketConnection(port = port, blocking = TRUE, open = "wb", 
timeout = 1500)
+  doServerAuth(conn, authSecret)
+  writeToConnection(serializedSlices, conn)
+  jrdd <- callJMethod(jserver, "getResult")
+} else {
+  fileName <- writeToTempFile(serializedSlices)
+  jrdd <- tryCatch(callJStatic(
+  "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, 
as.integer(numSlices)),
+finally = {
+  file.remove(fileName)
+  })
+}
   }
 
   RDD(jrdd, "byte")
@@ -194,14 +206,21 @@ getMaxAllocationLimit <- function(sc) {
   ))
 }
 
+writeToConnection <- function(serializedSlices, conn) {
+  tryCatch({
+for (slice in serializedSlices) {
+  writeBin(as.integer(length(slice)), conn, endian = "big")
+  writeBin(slice, conn, endian = "big")
+}
+  }, finally = {
+close(conn)
+  })
+}
+
 writeToTempFile <- function(serializedSlices) {
   fileName <- tempfile()
   conn <- file(fileName, "wb")
-  for (slice in serializedSlices) {
-writeBin(as.integer(length(slice)), conn, endian = "big")
-writeBin(slice, conn, endian = "big")
-  }
-  close(conn)
+  writeToConnection(serializedSlices, conn)
   fileName
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c3b4a94a/R/pkg/tests/fulltests/test_Serde.R
--
diff --git a/R/pkg/tests/fulltests/test_Serde.R 
b/R/pkg/tests/fulltests/test_Serde.R
index 3577929..1525bdb 100644
--- a/R/pkg/tests/fulltests/test_Serde.R
+++ b/R/pkg/tests/fulltests/test_Serde.R
@@ -124,3 +124,35 @@ test_that("SerDe of list of lists", {
 })
 
 sparkR.session.stop()
+
+# Note that this test should be at the end of tests since the configruations 
used here are not
+# specific to sessions, and the Spark context is restarted.
+test_that("createDataFrame large objects", {
+  for (encryptionEnabled in list("true", "false")) {
+# To simulate a large object scenario, we set spark.r.maxAllocationLimit 
to a smaller value
+conf <- list(spark.r.maxAllocationLimit = "100",
+ spark.io.encryption.enabled = encryptionEnabled)
+
+suppressWarnings(sparkR.session(master = sparkRTestMaster,
+sparkConfig = conf,
+

spark git commit: [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm

2018-09-22 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 1303eb5c8 -> c64e7506d


[MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm

## What changes were proposed in this pull request?

Always close the tempFile after `serializer.dump_stream(data, tempFile)` in 
_serialize_to_jvm

## How was this patch tested?

N/A

Closes #22523 from gatorsmile/fixMinor.

Authored-by: gatorsmile 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c64e7506
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c64e7506
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c64e7506

Branch: refs/heads/branch-2.4
Commit: c64e7506dabaccc60f8140aeae589053645f23a6
Parents: 1303eb5
Author: gatorsmile 
Authored: Sun Sep 23 10:16:33 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Sep 23 10:18:00 2018 +0800

--
 python/pyspark/context.py | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c64e7506/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 87255c4..0924d3d 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -537,8 +537,10 @@ class SparkContext(object):
 # parallelize from there.
 tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir)
 try:
-serializer.dump_stream(data, tempFile)
-tempFile.close()
+try:
+serializer.dump_stream(data, tempFile)
+finally:
+tempFile.close()
 return reader_func(tempFile.name)
 finally:
 # we eagerily reads the file so we can delete right after.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra

2018-09-22 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 0fbba76fa -> a72d118cd


[SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and 
macOS High Sierra

## What changes were proposed in this pull request?

This PR does not fix the problem itself but just target to add few comments to 
run PySpark tests on Python 3.6 and macOS High Serria since it actually blocks 
to run tests on this enviornment.

it does not target to fix the problem yet.

The problem here looks because we fork python workers and the forked workers 
somehow call Objective-C libraries in some codes at CPython's implementation. 
After debugging a while, I suspect `pickle` in Python 3.6 has some changes:

https://github.com/apache/spark/blob/58419b92673c46911c25bc6c6b13397f880c6424/python/pyspark/serializers.py#L577

in particular, it looks also related to which objects are serialized or not as 
well.

This link 
(http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html)
 and this link 
(https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/)
 were helpful for me to understand this.

I am still debugging this but my guts say it's difficult to fix or workaround 
within Spark side.

## How was this patch tested?

Manually tested:

Before `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`:

```
/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
 ResourceWarning: subprocess 27563 is still running
  ResourceWarning, source=self)
[Stage 0:>  (0 + 1) / 
1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
progress in another thread when fork() was called.
objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in progress 
in another thread when fork() was called. We cannot safely call it or ignore it 
in the fork() child process. Crashing instead. Set a breakpoint on 
objc_initializeAfterForkError to debug.
ERROR

==
ERROR: test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, 
in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
o54.processAllAvailable.
: org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
=== Streaming Query ===
Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 
08d1435b-5358-4fb6-b167-811584a3163e]
Current Committed Offsets: {}
Current Available Offsets: 
{FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]:
 {"logOffset":0}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Writing job aborted.
at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
```

After `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`:

```
test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ...
ok
```

Closes #22480 from HyukjinKwon/SPARK-25473.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a72d118c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a72d118c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a72d118c

Branch: refs/heads/master
Commit: a72d118cd96cd44d37cb8f8b6c444953a99aab3f
Parents: 0fbba76
Author: hyukjinkwon 
Authored: Sun Sep 23 11:14:27 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Sep 23 11:14:27 2018 +0800

--
 python/pyspark/sql/tests.py | 3 +++
 1 file changed, 3 insertions(+)
--

spark git commit: [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm

2018-09-22 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 6ca87eb2e -> 0fbba76fa


[MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm

## What changes were proposed in this pull request?

Always close the tempFile after `serializer.dump_stream(data, tempFile)` in 
_serialize_to_jvm

## How was this patch tested?

N/A

Closes #22523 from gatorsmile/fixMinor.

Authored-by: gatorsmile 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fbba76f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fbba76f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fbba76f

Branch: refs/heads/master
Commit: 0fbba76faa00a18eef5d8c2ef2e673744d0d490b
Parents: 6ca87eb
Author: gatorsmile 
Authored: Sun Sep 23 10:16:33 2018 +0800
Committer: hyukjinkwon 
Committed: Sun Sep 23 10:16:33 2018 +0800

--
 python/pyspark/context.py | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0fbba76f/python/pyspark/context.py
--
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 87255c4..0924d3d 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -537,8 +537,10 @@ class SparkContext(object):
 # parallelize from there.
 tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir)
 try:
-serializer.dump_stream(data, tempFile)
-tempFile.close()
+try:
+serializer.dump_stream(data, tempFile)
+finally:
+tempFile.close()
 return reader_func(tempFile.name)
 finally:
 # we eagerily reads the file so we can delete right after.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 8779 matches

Mail list logo