spark git commit: [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator
Repository: spark Updated Branches: refs/heads/master 219922422 -> cb90617f8 [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator ## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes #22635 from viirya/SPARK-25591. Authored-by: Liang-Chi Hsieh Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb90617f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb90617f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb90617f Branch: refs/heads/master Commit: cb90617f894fd51a092710271823ec7d1cd3a668 Parents: 2199224 Author: Liang-Chi Hsieh Authored: Mon Oct 8 15:18:08 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:18:08 2018 +0800 -- python/pyspark/accumulators.py | 12 python/pyspark/sql/tests.py| 25 + 2 files changed, 33 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb90617f/python/pyspark/accumulators.py -- diff --git a/python/pyspark/accumulators.py b/python/pyspark/accumulators.py index 30ad042..00ec094 100644 --- a/python/pyspark/accumulators.py +++ b/python/pyspark/accumulators.py @@ -109,10 +109,14 @@ _accumulatorRegistry = {} def _deserialize_accumulator(aid, zero_value, accum_param): from pyspark.accumulators import _accumulatorRegistry -accum = Accumulator(aid, zero_value, accum_param) -accum._deserialized = True -_accumulatorRegistry[aid] = accum -return accum +# If this certain accumulator was deserialized, don't overwrite it. +if aid in _accumulatorRegistry: +return _accumulatorRegistry[aid] +else: +accum = Accumulator(aid, zero_value, accum_param) +accum._deserialized = True +_accumulatorRegistry[aid] = accum +return accum class Accumulator(object): http://git-wip-us.apache.org/repos/asf/spark/blob/cb90617f/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index d3c29d0..ac87ccd 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -3603,6 +3603,31 @@ class SQLTests(ReusedSQLTestCase): self.assertEquals(None, df._repr_html_()) self.assertEquals(expected, df.__repr__()) +# SPARK-25591 +def test_same_accumulator_in_udfs(self): +from pyspark.sql.functions import udf + +data_schema = StructType([StructField("a", IntegerType(), True), + StructField("b", IntegerType(), True)]) +data = self.spark.createDataFrame([[1, 2]], schema=data_schema) + +test_accum = self.sc.accumulator(0) + +def first_udf(x): +test_accum.add(1) +return x + +def second_udf(x): +test_accum.add(100) +return x + +func_udf = udf(first_udf, IntegerType()) +func_udf2 = udf(second_udf, IntegerType()) +data = data.withColumn("out1", func_udf(data["a"])) +data = data.withColumn("out2", func_udf2(data["b"])) +data.collect() +self.assertEqual(test_accum.value, 101) + class HiveSparkSubmitTests(SparkSubmitTests): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check
Repository: spark Updated Branches: refs/heads/branch-2.4 c8b94099a -> 4214ddd34 [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/12980 added Travis CI file mainly for linter because we disabled Java lint check in Jenkins. It's enabled as of https://github.com/apache/spark/pull/21399 and now SBT runs it. Looks we can now remove the file added before. ## How was this patch tested? N/A Closes #22665 Closes #22667 from HyukjinKwon/SPARK-25673. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon (cherry picked from commit 219922422003e59cc8b3bece60778536759fa669) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4214ddd3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4214ddd3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4214ddd3 Branch: refs/heads/branch-2.4 Commit: 4214ddd34514351a58cf6a0254f33c6d5c8fd924 Parents: c8b9409 Author: hyukjinkwon Authored: Mon Oct 8 15:07:06 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:07:35 2018 +0800 -- .travis.yml | 50 -- 1 file changed, 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4214ddd3/.travis.yml -- diff --git a/.travis.yml b/.travis.yml deleted file mode 100644 index 05b94ade..000 --- a/.travis.yml +++ /dev/null @@ -1,50 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Spark provides this Travis CI configuration file to help contributors -# check Scala/Java style conformance and JDK7/8 compilation easily -# during their preparing pull requests. -# - Scalastyle is executed during `maven install` implicitly. -# - Java Checkstyle is executed by `lint-java`. -# See the related discussion here. -# https://github.com/apache/spark/pull/12980 - -# 1. Choose OS (Ubuntu 14.04.3 LTS Server Edition 64bit, ~2 CORE, 7.5GB RAM) -sudo: required -dist: trusty - -# 2. Choose language and target JDKs for parallel builds. -language: java -jdk: - - oraclejdk8 - -# 3. Setup cache directory for SBT and Maven. -cache: - directories: - - $HOME/.sbt - - $HOME/.m2 - -# 4. Turn off notifications. -notifications: - email: false - -# 5. Run maven install before running lint-java. -install: - - export MAVEN_SKIP_RC=1 - - build/mvn -T 4 -q -DskipTests -Pkubernetes -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver install - -# 6. Run lint-java. -script: - - dev/lint-java - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check
Repository: spark Updated Branches: refs/heads/master ebd899b8a -> 219922422 [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/12980 added Travis CI file mainly for linter because we disabled Java lint check in Jenkins. It's enabled as of https://github.com/apache/spark/pull/21399 and now SBT runs it. Looks we can now remove the file added before. ## How was this patch tested? N/A Closes #22665 Closes #22667 from HyukjinKwon/SPARK-25673. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/21992242 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/21992242 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/21992242 Branch: refs/heads/master Commit: 219922422003e59cc8b3bece60778536759fa669 Parents: ebd899b Author: hyukjinkwon Authored: Mon Oct 8 15:07:06 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:07:06 2018 +0800 -- .travis.yml | 50 -- 1 file changed, 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/21992242/.travis.yml -- diff --git a/.travis.yml b/.travis.yml deleted file mode 100644 index 05b94ade..000 --- a/.travis.yml +++ /dev/null @@ -1,50 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Spark provides this Travis CI configuration file to help contributors -# check Scala/Java style conformance and JDK7/8 compilation easily -# during their preparing pull requests. -# - Scalastyle is executed during `maven install` implicitly. -# - Java Checkstyle is executed by `lint-java`. -# See the related discussion here. -# https://github.com/apache/spark/pull/12980 - -# 1. Choose OS (Ubuntu 14.04.3 LTS Server Edition 64bit, ~2 CORE, 7.5GB RAM) -sudo: required -dist: trusty - -# 2. Choose language and target JDKs for parallel builds. -language: java -jdk: - - oraclejdk8 - -# 3. Setup cache directory for SBT and Maven. -cache: - directories: - - $HOME/.sbt - - $HOME/.m2 - -# 4. Turn off notifications. -notifications: - email: false - -# 5. Run maven install before running lint-java. -install: - - export MAVEN_SKIP_RC=1 - - build/mvn -T 4 -q -DskipTests -Pkubernetes -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver install - -# 6. Run lint-java. -script: - - dev/lint-java - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator
Repository: spark Updated Branches: refs/heads/branch-2.4 4214ddd34 -> 692ddb3f9 [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator ## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes #22635 from viirya/SPARK-25591. Authored-by: Liang-Chi Hsieh Signed-off-by: hyukjinkwon (cherry picked from commit cb90617f894fd51a092710271823ec7d1cd3a668) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/692ddb3f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/692ddb3f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/692ddb3f Branch: refs/heads/branch-2.4 Commit: 692ddb3f92ad6ee5ceca2f5ee4ea67d636c32d88 Parents: 4214ddd Author: Liang-Chi Hsieh Authored: Mon Oct 8 15:18:08 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:18:27 2018 +0800 -- python/pyspark/accumulators.py | 12 python/pyspark/sql/tests.py| 25 + 2 files changed, 33 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/692ddb3f/python/pyspark/accumulators.py -- diff --git a/python/pyspark/accumulators.py b/python/pyspark/accumulators.py index 30ad042..00ec094 100644 --- a/python/pyspark/accumulators.py +++ b/python/pyspark/accumulators.py @@ -109,10 +109,14 @@ _accumulatorRegistry = {} def _deserialize_accumulator(aid, zero_value, accum_param): from pyspark.accumulators import _accumulatorRegistry -accum = Accumulator(aid, zero_value, accum_param) -accum._deserialized = True -_accumulatorRegistry[aid] = accum -return accum +# If this certain accumulator was deserialized, don't overwrite it. +if aid in _accumulatorRegistry: +return _accumulatorRegistry[aid] +else: +accum = Accumulator(aid, zero_value, accum_param) +accum._deserialized = True +_accumulatorRegistry[aid] = accum +return accum class Accumulator(object): http://git-wip-us.apache.org/repos/asf/spark/blob/692ddb3f/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index e991032..b05de54 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -3556,6 +3556,31 @@ class SQLTests(ReusedSQLTestCase): self.assertEquals(None, df._repr_html_()) self.assertEquals(expected, df.__repr__()) +# SPARK-25591 +def test_same_accumulator_in_udfs(self): +from pyspark.sql.functions import udf + +data_schema = StructType([StructField("a", IntegerType(), True), + StructField("b", IntegerType(), True)]) +data = self.spark.createDataFrame([[1, 2]], schema=data_schema) + +test_accum = self.sc.accumulator(0) + +def first_udf(x): +test_accum.add(1) +return x + +def second_udf(x): +test_accum.add(100) +return x + +func_udf = udf(first_udf, IntegerType()) +func_udf2 = udf(second_udf, IntegerType()) +data = data.withColumn("out1", func_udf(data["a"])) +data = data.withColumn("out2", func_udf2(data["b"])) +data.collect() +self.assertEqual(test_accum.value, 101) + class HiveSparkSubmitTests(SparkSubmitTests): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception
Repository: spark Updated Branches: refs/heads/branch-2.4 692ddb3f9 -> 193ce77fc [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception ## What changes were proposed in this pull request? Documentation is updated with proper classname org.apache.spark.io.ZStdCompressionCodec ## How was this patch tested? we used the spark.io.compression.codec = org.apache.spark.io.ZStdCompressionCodec and verified the logs. Closes #22669 from shivusondur/CompressionIssue. Authored-by: shivusondur Signed-off-by: hyukjinkwon (cherry picked from commit 1a6815cd9f421a106f8d96a36a53042a00f02386) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/193ce77f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/193ce77f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/193ce77f Branch: refs/heads/branch-2.4 Commit: 193ce77fccf54cfdacdc011db13655c28e524458 Parents: 692ddb3 Author: shivusondur Authored: Mon Oct 8 15:43:08 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:43:35 2018 +0800 -- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/193ce77f/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 5577393..613e214 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -954,7 +954,7 @@ Apart from these, the following properties are also available, and may be useful org.apache.spark.io.LZ4CompressionCodec, org.apache.spark.io.LZFCompressionCodec, org.apache.spark.io.SnappyCompressionCodec, -and org.apache.spark.io.ZstdCompressionCodec. +and org.apache.spark.io.ZStdCompressionCodec. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception
Repository: spark Updated Branches: refs/heads/master cb90617f8 -> 1a6815cd9 [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception ## What changes were proposed in this pull request? Documentation is updated with proper classname org.apache.spark.io.ZStdCompressionCodec ## How was this patch tested? we used the spark.io.compression.codec = org.apache.spark.io.ZStdCompressionCodec and verified the logs. Closes #22669 from shivusondur/CompressionIssue. Authored-by: shivusondur Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1a6815cd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1a6815cd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1a6815cd Branch: refs/heads/master Commit: 1a6815cd9f421a106f8d96a36a53042a00f02386 Parents: cb90617 Author: shivusondur Authored: Mon Oct 8 15:43:08 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:43:08 2018 +0800 -- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1a6815cd/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 5577393..613e214 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -954,7 +954,7 @@ Apart from these, the following properties are also available, and may be useful org.apache.spark.io.LZ4CompressionCodec, org.apache.spark.io.LZFCompressionCodec, org.apache.spark.io.SnappyCompressionCodec, -and org.apache.spark.io.ZstdCompressionCodec. +and org.apache.spark.io.ZStdCompressionCodec. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs
Repository: spark Updated Branches: refs/heads/master 1a6815cd9 -> a853a8020 [SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs ### What changes were proposed in this pull request? We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well). It's even difficult to identify the problems (see https://github.com/apache/spark/pull/20163 and https://github.com/apache/spark/pull/22610). This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them. ```python import sys import array import datetime from decimal import Decimal from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import udf if sys.version >= '3': long = int data = [ None, True, 1, long(1), "a", u"a", datetime.date(1970, 1, 1), datetime.datetime(1970, 1, 1, 0, 0), 1.0, array.array("i", [1]), [1], (1,), bytearray([65, 66, 67]), Decimal(1), {"a": 1}, Row(kwargs=1), Row("namedtuple")(1), ] types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), StringType(), DateType(), TimestampType(), FloatType(), DoubleType(), ArrayType(IntegerType()), BinaryType(), DecimalType(10, 0), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), ] df = spark.range(1) results = [] count = 0 total = len(types) * len(data) spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for v in data: try: row = df.select(udf(lambda: v, t)()).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Python Value: [%s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), str(v), type(v).__name__, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data)) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: "# %s # noqa" % line, strings.strip().split("\n" ``` This table was generated under Python 2 but the code above is Python 3 compatible as well. ## How was this patch tested? Manually tested and lint check. Closes #22655 from HyukjinKwon/SPARK-25666. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a853a802 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a853a802 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a853a802 Branch: refs/heads/master Commit: a853a80202032083ad411eec5ec97b304f732a61 Parents: 1a6815c Author: hyukjinkwon Authored: Mon Oct 8 15:47:15 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 8 15:47:15 2018 +0800 -- python/pyspark/sql/functions.py | 33 + 1 file changed, 33 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a853a802/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index be089ee..5425d31 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2733,6 +2733,39 @@ def udf(f=None, returnType=StringType()): | 8| JOHN DOE| 22| +--+--++ """ + +# The following table shows most of Python data and SQL type conversions in normal UDFs that +# are not yet visible to the user. Some of behaviors are buggy and might be changed in the near +# future. The table might have to be eventually documented externally. +# Please see SPARK-25666's PR to see the codes in order to generate the table below. +# +# +-+--+--+--+---+---+---++-+--+--+-++-++--+--+--+ # noqa +# |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)| a(str)| a(unicode)|1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| (1,)(tuple)| ABC(bytearray)| 1(Decimal)|{'a': 1}(dict)|Row(kwargs=1)(Row)|Row(namedtuple=1)(Row)| # noqa +#
spark git commit: [SPARK-25684][SQL] Organize header related codes in CSV datasource
Repository: spark Updated Branches: refs/heads/master a00181418 -> 39872af88 [SPARK-25684][SQL] Organize header related codes in CSV datasource ## What changes were proposed in this pull request? 1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is). - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc. - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can. 2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that. - The checking header and column pruning stuff were added (per https://github.com/apache/spark/pull/20894 and https://github.com/apache/spark/pull/21296) but some of codes such as https://github.com/apache/spark/pull/22123 are duplicated - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (https://github.com/apache/spark/pull/22656). 3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is). - Similar reasons above with 1. ## How was this patch tested? Existing tests should cover this. Closes #22676 from HyukjinKwon/refactoring-csv. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39872af8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39872af8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39872af8 Branch: refs/heads/master Commit: 39872af882e3d73667acfab93c9de962c9c8939d Parents: a001814 Author: hyukjinkwon Authored: Fri Oct 12 09:16:41 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 12 09:16:41 2018 +0800 -- .../org/apache/spark/sql/DataFrameReader.scala | 18 +-- .../datasources/csv/CSVDataSource.scala | 161 ++- .../datasources/csv/CSVFileFormat.scala | 11 +- .../datasources/csv/CSVHeaderChecker.scala | 131 +++ .../execution/datasources/csv/CSVUtils.scala| 44 - .../datasources/csv/UnivocityParser.scala | 34 ++-- 6 files changed, 217 insertions(+), 182 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/39872af8/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 7269446..3af70b5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -505,20 +505,14 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val actualSchema = StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord)) -val linesWithoutHeader = if (parsedOptions.headerFlag && maybeFirstLine.isDefined) { - val firstLine = maybeFirstLine.get - val parser = new CsvParser(parsedOptions.asParserSettings) - val columnNames = parser.parseLine(firstLine) - CSVDataSource.checkHeaderColumnNames( +val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine => + val headerChecker = new CSVHeaderChecker( actualSchema, -columnNames, -csvDataset.getClass.getCanonicalName, -parsedOptions.enforceSchema, -sparkSession.sessionState.conf.caseSensitiveAnalysis) +parsedOptions, +source = s"CSV source: $csvDataset") + headerChecker.checkHeaderColumnNames(firstLine) filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, parsedOptions)) -} else { - filteredLines.rdd -} +}.getOrElse(filteredLines.rdd) val parsed = linesWithoutHeader.mapPartitions { iter => val rawParser = new UnivocityParser(actualSchema, parsedOptions) http://git-wip-us.apache.org/repos/asf/spark/blob/39872af8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala index b93f418..0b5a719 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala @@ -51,11 +51,8 @@ abstract class
spark git commit: [SPARK-25372][YARN][K8S][FOLLOW-UP] Deprecate and generalize keytab / principal config
Repository: spark Updated Branches: refs/heads/master 6c3f2c6a6 -> 9426fd0c2 [SPARK-25372][YARN][K8S][FOLLOW-UP] Deprecate and generalize keytab / principal config ## What changes were proposed in this pull request? Update the next version of Spark from 2.5 to 3.0 ## How was this patch tested? N/A Closes #22717 from gatorsmile/followupSPARK-25372. Authored-by: gatorsmile Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9426fd0c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9426fd0c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9426fd0c Branch: refs/heads/master Commit: 9426fd0c244480e52881e4bc8b36bd261ec851c7 Parents: 6c3f2c6 Author: gatorsmile Authored: Sun Oct 14 15:20:01 2018 +0800 Committer: hyukjinkwon Committed: Sun Oct 14 15:20:01 2018 +0800 -- core/src/main/scala/org/apache/spark/SparkConf.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9426fd0c/core/src/main/scala/org/apache/spark/SparkConf.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 81aa31d..5166543 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -729,9 +729,9 @@ private[spark] object SparkConf extends Logging { EXECUTOR_MEMORY_OVERHEAD.key -> Seq( AlternateConfig("spark.yarn.executor.memoryOverhead", "2.3")), KEYTAB.key -> Seq( - AlternateConfig("spark.yarn.keytab", "2.5")), + AlternateConfig("spark.yarn.keytab", "3.0")), PRINCIPAL.key -> Seq( - AlternateConfig("spark.yarn.principal", "2.5")) + AlternateConfig("spark.yarn.principal", "3.0")) ) /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pushdown test time costs in Jenkins
Repository: spark Updated Branches: refs/heads/master fdaa99897 -> 5c7f6b663 [SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pushdown test time costs in Jenkins ## What changes were proposed in this pull request? Only test these 4 cases is enough: https://github.com/apache/spark/blob/be2238fb502b0f49a8a1baa6da9bc3e99540b40e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L269-L279 ## How was this patch tested? Manual tests on my local machine. before: ``` - filter pushdown - decimal (13 seconds, 683 milliseconds) ``` after: ``` - filter pushdown - decimal (9 seconds, 713 milliseconds) ``` Closes #22636 from wangyum/SPARK-25629. Authored-by: Yuming Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c7f6b66 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c7f6b66 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c7f6b66 Branch: refs/heads/master Commit: 5c7f6b66368a956accfc34636c84ca3825f8d0b1 Parents: fdaa998 Author: Yuming Wang Authored: Tue Oct 16 12:30:02 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 16 12:30:02 2018 +0800 -- .../parquet/ParquetFilterSuite.scala| 67 ++-- 1 file changed, 33 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5c7f6b66/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala index 01e41b3..9cfc943 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala @@ -524,41 +524,40 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } test("filter pushdown - decimal") { -Seq(true, false).foreach { legacyFormat => +Seq( + (false, Decimal.MAX_INT_DIGITS), // int32Writer + (false, Decimal.MAX_LONG_DIGITS), // int64Writer + (true, Decimal.MAX_LONG_DIGITS), // binaryWriterUsingUnscaledLong + (false, DecimalType.MAX_PRECISION) // binaryWriterUsingUnscaledBytes +).foreach { case (legacyFormat, precision) => withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyFormat.toString) { -Seq( - s"a decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 32BitDecimalType - s"a decimal(${Decimal.MAX_LONG_DIGITS}, 2)", // 64BitDecimalType - "a decimal(38, 18)" // ByteArrayDecimalType -).foreach { schemaDDL => - val schema = StructType.fromDDL(schemaDDL) - val rdd = -spark.sparkContext.parallelize((1 to 4).map(i => Row(new java.math.BigDecimal(i - val dataFrame = spark.createDataFrame(rdd, schema) - testDecimalPushDown(dataFrame) { implicit df => -assert(df.schema === schema) -checkFilterPredicate('a.isNull, classOf[Eq[_]], Seq.empty[Row]) -checkFilterPredicate('a.isNotNull, classOf[NotEq[_]], (1 to 4).map(Row.apply(_))) - -checkFilterPredicate('a === 1, classOf[Eq[_]], 1) -checkFilterPredicate('a <=> 1, classOf[Eq[_]], 1) -checkFilterPredicate('a =!= 1, classOf[NotEq[_]], (2 to 4).map(Row.apply(_))) - -checkFilterPredicate('a < 2, classOf[Lt[_]], 1) -checkFilterPredicate('a > 3, classOf[Gt[_]], 4) -checkFilterPredicate('a <= 1, classOf[LtEq[_]], 1) -checkFilterPredicate('a >= 4, classOf[GtEq[_]], 4) - -checkFilterPredicate(Literal(1) === 'a, classOf[Eq[_]], 1) -checkFilterPredicate(Literal(1) <=> 'a, classOf[Eq[_]], 1) -checkFilterPredicate(Literal(2) > 'a, classOf[Lt[_]], 1) -checkFilterPredicate(Literal(3) < 'a, classOf[Gt[_]], 4) -checkFilterPredicate(Literal(1) >= 'a, classOf[LtEq[_]], 1) -checkFilterPredicate(Literal(4) <= 'a, classOf[GtEq[_]], 4) - -checkFilterPredicate(!('a < 4), classOf[GtEq[_]], 4) -checkFilterPredicate('a < 2 || 'a > 3, classOf[Operators.Or], Seq(Row(1), Row(4))) - } +val schema = StructType.fromDDL(s"a decimal($precision, 2)") +val rdd = + spark.sparkContext.parallelize((1 to 4).map(i => Row(new java.math.BigDecimal(i +val dataFrame = spark.createDataFrame(rdd, schema) +testDecimalPushDown(dataFrame) { implicit df => +
spark git commit: [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count
Repository: spark Updated Branches: refs/heads/master 5c7f6b663 -> e028fd3ae [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count ## What changes were proposed in this pull request? AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell. Since Spark supports it, we should clearly document the current behavior and add tests to verify it. ## How was this patch tested? N/A Closes #22728 from cloud-fan/doc. Authored-by: Wenchen Fan Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e028fd3a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e028fd3a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e028fd3a Branch: refs/heads/master Commit: e028fd3aed9e5e4c478f307f0a467b54b73ff0d5 Parents: 5c7f6b6 Author: Wenchen Fan Authored: Tue Oct 16 15:13:01 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 16 15:13:01 2018 +0800 -- .../catalyst/expressions/aggregate/Count.scala | 2 +- .../test/resources/sql-tests/inputs/count.sql | 27 ++ .../resources/sql-tests/results/count.sql.out | 55 3 files changed, 83 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e028fd3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala index 40582d0..8cab8e4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala @@ -52,7 +52,7 @@ abstract class CountLike extends DeclarativeAggregate { usage = """ _FUNC_(*) - Returns the total number of retrieved rows, including rows containing null. -_FUNC_(expr) - Returns the number of rows for which the supplied expression is non-null. +_FUNC_(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null. _FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-null. """) http://git-wip-us.apache.org/repos/asf/spark/blob/e028fd3a/sql/core/src/test/resources/sql-tests/inputs/count.sql -- diff --git a/sql/core/src/test/resources/sql-tests/inputs/count.sql b/sql/core/src/test/resources/sql-tests/inputs/count.sql new file mode 100644 index 000..9f9ee4a --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/count.sql @@ -0,0 +1,27 @@ +-- Test data. +CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES +(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null) +AS testData(a, b); + +-- count with single expression +SELECT + count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, b)) +FROM testData; + +-- distinct count with single expression +SELECT + count(DISTINCT 1), + count(DISTINCT null), + count(DISTINCT a), + count(DISTINCT b), + count(DISTINCT (a + b)), + count(DISTINCT (a, b)) +FROM testData; + +-- count with multiple expressions +SELECT count(a, b), count(b, a), count(testData.*) FROM testData; + +-- distinct count with multiple expressions +SELECT + count(DISTINCT a, b), count(DISTINCT b, a), count(DISTINCT *), count(DISTINCT testData.*) +FROM testData; http://git-wip-us.apache.org/repos/asf/spark/blob/e028fd3a/sql/core/src/test/resources/sql-tests/results/count.sql.out -- diff --git a/sql/core/src/test/resources/sql-tests/results/count.sql.out b/sql/core/src/test/resources/sql-tests/results/count.sql.out new file mode 100644 index 000..b8a86d4 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/results/count.sql.out @@ -0,0 +1,55 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 5 + + +-- !query 0 +CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES +(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null) +AS testData(a, b) +-- !query 0 schema +struct<> +-- !query 0 output + + + +-- !query 1 +SELECT + count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, b)) +FROM testData +-- !query 1 schema +struct +-- !query 1 output +7 7 0 5 5 4 7 + + +-- !query 2 +SELECT + count(DISTINCT 1), + count(DISTINCT null), +
spark git commit: [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count
Repository: spark Updated Branches: refs/heads/branch-2.4 8bc7ab03d -> 77156f8c8 [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count ## What changes were proposed in this pull request? AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell. Since Spark supports it, we should clearly document the current behavior and add tests to verify it. ## How was this patch tested? N/A Closes #22728 from cloud-fan/doc. Authored-by: Wenchen Fan Signed-off-by: hyukjinkwon (cherry picked from commit e028fd3aed9e5e4c478f307f0a467b54b73ff0d5) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/77156f8c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/77156f8c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/77156f8c Branch: refs/heads/branch-2.4 Commit: 77156f8c81720ec7364b386a95ef1b30713fe55c Parents: 8bc7ab0 Author: Wenchen Fan Authored: Tue Oct 16 15:13:01 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 16 15:13:19 2018 +0800 -- .../catalyst/expressions/aggregate/Count.scala | 2 +- .../test/resources/sql-tests/inputs/count.sql | 27 ++ .../resources/sql-tests/results/count.sql.out | 55 3 files changed, 83 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/77156f8c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala index 40582d0..8cab8e4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala @@ -52,7 +52,7 @@ abstract class CountLike extends DeclarativeAggregate { usage = """ _FUNC_(*) - Returns the total number of retrieved rows, including rows containing null. -_FUNC_(expr) - Returns the number of rows for which the supplied expression is non-null. +_FUNC_(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null. _FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-null. """) http://git-wip-us.apache.org/repos/asf/spark/blob/77156f8c/sql/core/src/test/resources/sql-tests/inputs/count.sql -- diff --git a/sql/core/src/test/resources/sql-tests/inputs/count.sql b/sql/core/src/test/resources/sql-tests/inputs/count.sql new file mode 100644 index 000..9f9ee4a --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/count.sql @@ -0,0 +1,27 @@ +-- Test data. +CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES +(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null) +AS testData(a, b); + +-- count with single expression +SELECT + count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, b)) +FROM testData; + +-- distinct count with single expression +SELECT + count(DISTINCT 1), + count(DISTINCT null), + count(DISTINCT a), + count(DISTINCT b), + count(DISTINCT (a + b)), + count(DISTINCT (a, b)) +FROM testData; + +-- count with multiple expressions +SELECT count(a, b), count(b, a), count(testData.*) FROM testData; + +-- distinct count with multiple expressions +SELECT + count(DISTINCT a, b), count(DISTINCT b, a), count(DISTINCT *), count(DISTINCT testData.*) +FROM testData; http://git-wip-us.apache.org/repos/asf/spark/blob/77156f8c/sql/core/src/test/resources/sql-tests/results/count.sql.out -- diff --git a/sql/core/src/test/resources/sql-tests/results/count.sql.out b/sql/core/src/test/resources/sql-tests/results/count.sql.out new file mode 100644 index 000..b8a86d4 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/results/count.sql.out @@ -0,0 +1,55 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 5 + + +-- !query 0 +CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES +(1, 1), (1, 2), (2, 1), (1, 1), (null, 2), (1, null), (null, null) +AS testData(a, b) +-- !query 0 schema +struct<> +-- !query 0 output + + + +-- !query 1 +SELECT + count(*), count(1), count(null), count(a), count(b), count(a + b), count((a, b)) +FROM testData +-- !query 1 schema +struct +-- !query 1 output +7 7 0 5
spark git commit: [SQL][CATALYST][MINOR] update some error comments
Repository: spark Updated Branches: refs/heads/master a9f685bb7 -> e9332f600 [SQL][CATALYST][MINOR] update some error comments ## What changes were proposed in this pull request? this PR correct some comment error: 1. change from "as low a possible" to "as low as possible" in RewriteDistinctAggregates.scala 2. delete redundant word âwithâ in HiveTableScanExecâs doExecute() method ## How was this patch tested? Existing unit tests. Closes #22694 from CarolinePeng/update_comment. Authored-by: å½ç¿00244106 <00244106@zte.intra> Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e9332f60 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e9332f60 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e9332f60 Branch: refs/heads/master Commit: e9332f600eb4f275b3bff368863a68c2a4349182 Parents: a9f685b Author: å½ç¿00244106 <00244106@zte.intra> Authored: Wed Oct 17 12:45:13 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 17 12:45:13 2018 +0800 -- .../spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala | 4 ++-- .../org/apache/spark/sql/hive/execution/HiveTableScanExec.scala | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e9332f60/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala index 4448ace..b946800 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala @@ -95,7 +95,7 @@ import org.apache.spark.sql.types.IntegerType * * This rule duplicates the input data by two or more times (# distinct groups + an optional * non-distinct group). This will put quite a bit of memory pressure of the used aggregate and - * exchange operators. Keeping the number of distinct groups as low a possible should be priority, + * exchange operators. Keeping the number of distinct groups as low as possible should be priority, * we could improve this in the current rule by applying more advanced expression canonicalization * techniques. */ @@ -241,7 +241,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] { groupByAttrs ++ distinctAggChildAttrs ++ Seq(gid) ++ regularAggChildAttrMap.map(_._2), a.child) - // Construct the first aggregate operator. This de-duplicates the all the children of + // Construct the first aggregate operator. This de-duplicates all the children of // distinct operators, and applies the regular aggregate operators. val firstAggregateGroupBy = groupByAttrs ++ distinctAggChildAttrs :+ gid val firstAggregate = Aggregate( http://git-wip-us.apache.org/repos/asf/spark/blob/e9332f60/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala index b3795b4..92c6632 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala @@ -182,7 +182,7 @@ case class HiveTableScanExec( protected override def doExecute(): RDD[InternalRow] = { // Using dummyCallSite, as getCallSite can turn out to be expensive with -// with multiple partitions. +// multiple partitions. val rdd = if (!relation.isPartitioned) { Utils.withDummyCallSite(sqlContext.sparkContext) { hadoopReader.makeRDDForTable(hiveQlTable) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: [SPARK-25393][SQL] Adding new function from_csv()
[SPARK-25393][SQL] Adding new function from_csv() ## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes #22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk Co-authored-by: Maxim Gekk Co-authored-by: Hyukjin Kwon Co-authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e9af9460 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e9af9460 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e9af9460 Branch: refs/heads/master Commit: e9af9460bc008106b670abac44a869721bfde42a Parents: 9d4dd79 Author: Maxim Gekk Authored: Wed Oct 17 09:32:05 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 17 09:32:05 2018 +0800 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 40 ++- R/pkg/R/generics.R | 4 + R/pkg/tests/fulltests/test_sparkSQL.R | 7 + python/pyspark/sql/functions.py | 37 +- sql/catalyst/pom.xml| 6 + .../catalyst/analysis/FunctionRegistry.scala| 5 +- .../spark/sql/catalyst/csv/CSVExprUtils.scala | 82 + .../sql/catalyst/csv/CSVHeaderChecker.scala | 131 +++ .../spark/sql/catalyst/csv/CSVOptions.scala | 217 .../sql/catalyst/csv/UnivocityParser.scala | 351 ++ .../sql/catalyst/expressions/ExprUtils.scala| 45 +++ .../catalyst/expressions/csvExpressions.scala | 120 +++ .../catalyst/expressions/jsonExpressions.scala | 21 +- .../sql/catalyst/util/FailureSafeParser.scala | 80 + .../sql/catalyst/csv/CSVExprUtilsSuite.scala| 61 .../expressions/CsvExpressionsSuite.scala | 158 + .../org/apache/spark/sql/DataFrameReader.scala | 5 +- .../datasources/FailureSafeParser.scala | 82 - .../datasources/csv/CSVDataSource.scala | 1 + .../datasources/csv/CSVFileFormat.scala | 1 + .../datasources/csv/CSVHeaderChecker.scala | 131 --- .../datasources/csv/CSVInferSchema.scala| 1 + .../execution/datasources/csv/CSVOptions.scala | 217 .../execution/datasources/csv/CSVUtils.scala| 67 +--- .../datasources/csv/UnivocityGenerator.scala| 1 + .../datasources/csv/UnivocityParser.scala | 352 --- .../datasources/json/JsonDataSource.scala | 1 + .../scala/org/apache/spark/sql/functions.scala | 32 ++ .../sql-tests/inputs/csv-functions.sql | 9 + .../sql-tests/results/csv-functions.sql.out | 69 .../apache/spark/sql/CsvFunctionsSuite.scala| 62 .../datasources/csv/CSVInferSchemaSuite.scala | 1 + .../datasources/csv/CSVUtilsSuite.scala | 61 .../datasources/csv/UnivocityParserSuite.scala | 2 +- 35 files changed, 1531 insertions(+), 930 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e9af9460/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 96ff389..c512284 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -274,6 +274,7 @@ exportMethods("%<=>%", "floor", "format_number", "format_string", + "from_csv", "from_json", "from_unixtime", "from_utc_timestamp", http://git-wip-us.apache.org/repos/asf/spark/blob/e9af9460/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 6a8fef5..d2ca1d6 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -188,6 +188,7 @@ NULL #' \item \code{to_json}: it is the column containing the struct, array of the structs, #' the map or array of maps. #' \item \code{from_json}: it is the column containing the JSON string. +#' \item \code{from_csv}: it is the column containing the CSV string. #' } #' @param y Column to compute on. #' @param value A value to compute on. @@ -196,6 +197,13 @@ NULL #' \item \code{array_position}: a value to locate in the given array. #' \item \code{array_remove}: a value to remove in the given array. #'
[1/2] spark git commit: [SPARK-25393][SQL] Adding new function from_csv()
Repository: spark Updated Branches: refs/heads/master 9d4dd7992 -> e9af9460b http://git-wip-us.apache.org/repos/asf/spark/blob/e9af9460/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala deleted file mode 100644 index 492a21b..000 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala +++ /dev/null @@ -1,217 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution.datasources.csv - -import java.nio.charset.StandardCharsets -import java.util.{Locale, TimeZone} - -import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, UnescapedQuoteHandling} -import org.apache.commons.lang3.time.FastDateFormat - -import org.apache.spark.internal.Logging -import org.apache.spark.sql.catalyst.util._ - -class CSVOptions( -@transient val parameters: CaseInsensitiveMap[String], -val columnPruning: Boolean, -defaultTimeZoneId: String, -defaultColumnNameOfCorruptRecord: String) - extends Logging with Serializable { - - def this( -parameters: Map[String, String], -columnPruning: Boolean, -defaultTimeZoneId: String, -defaultColumnNameOfCorruptRecord: String = "") = { - this( -CaseInsensitiveMap(parameters), -columnPruning, -defaultTimeZoneId, -defaultColumnNameOfCorruptRecord) - } - - private def getChar(paramName: String, default: Char): Char = { -val paramValue = parameters.get(paramName) -paramValue match { - case None => default - case Some(null) => default - case Some(value) if value.length == 0 => '\u' - case Some(value) if value.length == 1 => value.charAt(0) - case _ => throw new RuntimeException(s"$paramName cannot be more than one character") -} - } - - private def getInt(paramName: String, default: Int): Int = { -val paramValue = parameters.get(paramName) -paramValue match { - case None => default - case Some(null) => default - case Some(value) => try { -value.toInt - } catch { -case e: NumberFormatException => - throw new RuntimeException(s"$paramName should be an integer. Found $value") - } -} - } - - private def getBool(paramName: String, default: Boolean = false): Boolean = { -val param = parameters.getOrElse(paramName, default.toString) -if (param == null) { - default -} else if (param.toLowerCase(Locale.ROOT) == "true") { - true -} else if (param.toLowerCase(Locale.ROOT) == "false") { - false -} else { - throw new Exception(s"$paramName flag can be true or false") -} - } - - val delimiter = CSVUtils.toChar( -parameters.getOrElse("sep", parameters.getOrElse("delimiter", ","))) - val parseMode: ParseMode = -parameters.get("mode").map(ParseMode.fromString).getOrElse(PermissiveMode) - val charset = parameters.getOrElse("encoding", -parameters.getOrElse("charset", StandardCharsets.UTF_8.name())) - - val quote = getChar("quote", '\"') - val escape = getChar("escape", '\\') - val charToEscapeQuoteEscaping = parameters.get("charToEscapeQuoteEscaping") match { -case None => None -case Some(null) => None -case Some(value) if value.length == 0 => None -case Some(value) if value.length == 1 => Some(value.charAt(0)) -case _ => - throw new RuntimeException("charToEscapeQuoteEscaping cannot be more than one character") - } - val comment = getChar("comment", '\u') - - val headerFlag = getBool("header") - val inferSchemaFlag = getBool("inferSchema") - val ignoreLeadingWhiteSpaceInRead = getBool("ignoreLeadingWhiteSpace", default = false) - val ignoreTrailingWhiteSpaceInRead = getBool("ignoreTrailingWhiteSpace", default = false) - - // For write, both options were `true` by default. We leave it as `true` for - // backwards compatibility. - val ignoreLeadingWhiteSpaceFlagInWrite =
spark git commit: [SQL][CATALYST][MINOR] update some error comments
Repository: spark Updated Branches: refs/heads/branch-2.4 144cb949d -> 3591bd229 [SQL][CATALYST][MINOR] update some error comments ## What changes were proposed in this pull request? this PR correct some comment error: 1. change from "as low a possible" to "as low as possible" in RewriteDistinctAggregates.scala 2. delete redundant word âwithâ in HiveTableScanExecâs doExecute() method ## How was this patch tested? Existing unit tests. Closes #22694 from CarolinePeng/update_comment. Authored-by: å½ç¿00244106 <00244106@zte.intra> Signed-off-by: hyukjinkwon (cherry picked from commit e9332f600eb4f275b3bff368863a68c2a4349182) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3591bd22 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3591bd22 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3591bd22 Branch: refs/heads/branch-2.4 Commit: 3591bd2293f49ac8023166597704ad1bd21dabe9 Parents: 144cb94 Author: å½ç¿00244106 <00244106@zte.intra> Authored: Wed Oct 17 12:45:13 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 17 12:45:30 2018 +0800 -- .../spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala | 4 ++-- .../org/apache/spark/sql/hive/execution/HiveTableScanExec.scala | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3591bd22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala index 4448ace..b946800 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala @@ -95,7 +95,7 @@ import org.apache.spark.sql.types.IntegerType * * This rule duplicates the input data by two or more times (# distinct groups + an optional * non-distinct group). This will put quite a bit of memory pressure of the used aggregate and - * exchange operators. Keeping the number of distinct groups as low a possible should be priority, + * exchange operators. Keeping the number of distinct groups as low as possible should be priority, * we could improve this in the current rule by applying more advanced expression canonicalization * techniques. */ @@ -241,7 +241,7 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] { groupByAttrs ++ distinctAggChildAttrs ++ Seq(gid) ++ regularAggChildAttrMap.map(_._2), a.child) - // Construct the first aggregate operator. This de-duplicates the all the children of + // Construct the first aggregate operator. This de-duplicates all the children of // distinct operators, and applies the regular aggregate operators. val firstAggregateGroupBy = groupByAttrs ++ distinctAggChildAttrs :+ gid val firstAggregate = Aggregate( http://git-wip-us.apache.org/repos/asf/spark/blob/3591bd22/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala index b3795b4..92c6632 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala @@ -182,7 +182,7 @@ case class HiveTableScanExec( protected override def doExecute(): RDD[InternalRow] = { // Using dummyCallSite, as getCallSite can turn out to be expensive with -// with multiple partitions. +// multiple partitions. val rdd = if (!relation.isPartitioned) { Utils.withDummyCallSite(sqlContext.sparkContext) { hadoopReader.makeRDDForTable(hiveQlTable) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode
Repository: spark Updated Branches: refs/heads/master d0ecff285 -> 1e6c1d8bf [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode ## What changes were proposed in this pull request? CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode. ## How was this patch tested? Unit test with a file with crlf line endings. Closes #22503 from justinuang/fix-clrf-multiline. Authored-by: Justin Uang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e6c1d8b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e6c1d8b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e6c1d8b Branch: refs/heads/master Commit: 1e6c1d8bfb7841596452e25b870823b9a4b267f4 Parents: d0ecff2 Author: Justin Uang Authored: Fri Oct 19 11:13:02 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 19 11:13:02 2018 +0800 -- .../org/apache/spark/sql/catalyst/csv/CSVOptions.scala | 2 ++ sql/core/src/test/resources/test-data/cars-crlf.csv | 7 +++ .../spark/sql/execution/datasources/csv/CSVSuite.scala | 12 3 files changed, 21 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala index 3e25d82..cdaaa17 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala @@ -212,6 +212,8 @@ class CSVOptions( settings.setEmptyValue(emptyValueInRead) settings.setMaxCharsPerColumn(maxCharsPerColumn) settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) +settings.setLineSeparatorDetectionEnabled(multiLine == true) + settings } } http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/core/src/test/resources/test-data/cars-crlf.csv -- diff --git a/sql/core/src/test/resources/test-data/cars-crlf.csv b/sql/core/src/test/resources/test-data/cars-crlf.csv new file mode 100644 index 000..d018d08 --- /dev/null +++ b/sql/core/src/test/resources/test-data/cars-crlf.csv @@ -0,0 +1,7 @@ + +year,make,model,comment,blank +"2012","Tesla","S","No comment", + +1997,Ford,E350,"Go get one now they are going fast", +2015,Chevy,Volt + http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index d59035b..d43efc8 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -52,6 +52,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te private val carsNullFile = "test-data/cars-null.csv" private val carsEmptyValueFile = "test-data/cars-empty-value.csv" private val carsBlankColName = "test-data/cars-blank-column-name.csv" + private val carsCrlf = "test-data/cars-crlf.csv" private val emptyFile = "test-data/empty.csv" private val commentsFile = "test-data/comments.csv" private val disableCommentsFile = "test-data/disable_comments.csv" @@ -220,6 +221,17 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } + test("crlf line separators in multiline mode") { +val cars = spark + .read + .format("csv") + .option("multiLine", "true") + .option("header", "true") + .load(testFile(carsCrlf)) + +verifyCars(cars, withHeader = true) + } + test("test aliases sep and encoding for delimiter and charset") { // scalastyle:off val cars = spark - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency
Repository: spark Updated Branches: refs/heads/branch-2.4 36307b1e4 -> 9ed2e4204 [MINOR][DOC] Spacing items in migration guide for readability and consistency ## What changes were proposed in this pull request? Currently, migration guide has no space between each item which looks too compact and hard to read. Some of items already had some spaces between them in the migration guide. This PR suggest to format them consistently for readability. Before: ![screen shot 2018-10-18 at 10 00 04 am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png) After: ![screen shot 2018-10-18 at 9 53 55 am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png) ## How was this patch tested? Manually tested: Closes #22761 from HyukjinKwon/minor-migration-doc. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon (cherry picked from commit c8f7691c64a28174a54e8faa159b50a3836a7225) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9ed2e420 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9ed2e420 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9ed2e420 Branch: refs/heads/branch-2.4 Commit: 9ed2e42044a1105a1c8b5868dbb320b1b477bcf0 Parents: 36307b1 Author: hyukjinkwon Authored: Fri Oct 19 13:55:27 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 19 13:55:43 2018 +0800 -- docs/sql-migration-guide-upgrade.md | 54 1 file changed, 54 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9ed2e420/docs/sql-migration-guide-upgrade.md -- diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index 3476aa8..062e07b 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -70,26 +70,47 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite. + - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive. + - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis. + - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970. + - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. + - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. + - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty
spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency
Repository: spark Updated Branches: refs/heads/master 1e6c1d8bf -> c8f7691c6 [MINOR][DOC] Spacing items in migration guide for readability and consistency ## What changes were proposed in this pull request? Currently, migration guide has no space between each item which looks too compact and hard to read. Some of items already had some spaces between them in the migration guide. This PR suggest to format them consistently for readability. Before: ![screen shot 2018-10-18 at 10 00 04 am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png) After: ![screen shot 2018-10-18 at 9 53 55 am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png) ## How was this patch tested? Manually tested: Closes #22761 from HyukjinKwon/minor-migration-doc. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c8f7691c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c8f7691c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c8f7691c Branch: refs/heads/master Commit: c8f7691c64a28174a54e8faa159b50a3836a7225 Parents: 1e6c1d8 Author: hyukjinkwon Authored: Fri Oct 19 13:55:27 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 19 13:55:27 2018 +0800 -- docs/sql-migration-guide-upgrade.md | 54 1 file changed, 54 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c8f7691c/docs/sql-migration-guide-upgrade.md -- diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index 7faf8bd..7871a49 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -74,26 +74,47 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite. + - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive. + - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis. + - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970. + - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. + - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. + - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe. + - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example,
spark git commit: [SPARK-25040][SQL] Empty string for non string types should be disallowed
Repository: spark Updated Branches: refs/heads/master c391dc65e -> 03e82e368 [SPARK-25040][SQL] Empty string for non string types should be disallowed ## What changes were proposed in this pull request? This takes over original PR at #22019. The original proposal is to have null for float and double types. Later a more reasonable proposal is to disallow empty strings. This patch adds logic to throw exception when finding empty strings for non string types. ## How was this patch tested? Added test. Closes #22787 from viirya/SPARK-25040. Authored-by: Liang-Chi Hsieh Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03e82e36 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03e82e36 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03e82e36 Branch: refs/heads/master Commit: 03e82e36896afb43cc42c8d065ebe41a19ec62a7 Parents: c391dc6 Author: Liang-Chi Hsieh Authored: Tue Oct 23 13:43:53 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 23 13:43:53 2018 +0800 -- docs/sql-migration-guide-upgrade.md | 2 ++ .../spark/sql/catalyst/json/JacksonParser.scala | 19 +- .../execution/datasources/json/JsonSuite.scala | 37 +++- 3 files changed, 48 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/03e82e36/docs/sql-migration-guide-upgrade.md -- diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index 68a897c..b8b9ad8 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -11,6 +11,8 @@ displayTitle: Spark SQL Upgrading Guide - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. + - In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as `IntegerType`. For `FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for `StringType` and `BinaryType`. + ## Upgrading From Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. http://git-wip-us.apache.org/repos/asf/spark/blob/03e82e36/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index 984979a..918c9e7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -168,7 +168,7 @@ class JacksonParser( case VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT => parser.getFloatValue -case VALUE_STRING => +case VALUE_STRING if parser.getTextLength >= 1 => // Special case handling for NaN and Infinity. parser.getText match { case "NaN" => Float.NaN @@ -184,7 +184,7 @@ class JacksonParser( case VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT => parser.getDoubleValue -case VALUE_STRING => +case VALUE_STRING if parser.getTextLength >= 1 => // Special case handling for NaN and Infinity. parser.getText match { case "NaN" => Double.NaN @@ -211,7 +211,7 @@ class JacksonParser( case TimestampType => (parser: JsonParser) => parseJsonToken[java.lang.Long](parser, dataType) { -case VALUE_STRING => +case VALUE_STRING if parser.getTextLength >= 1 => val stringValue = parser.getText // This one will lose microseconds parts. // See
spark git commit: [SPARK-25785][SQL] Add prettyNames for from_json, to_json, from_csv, and schema_of_json
Repository: spark Updated Branches: refs/heads/master 4acbda4a9 -> 3370865b0 [SPARK-25785][SQL] Add prettyNames for from_json, to_json, from_csv, and schema_of_json ## What changes were proposed in this pull request? This PR adds `prettyNames` for `from_json`, `to_json`, `from_csv`, and `schema_of_json` so that appropriate names are used. ## How was this patch tested? Unit tests Closes #22773 from HyukjinKwon/minor-prettyNames. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3370865b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3370865b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3370865b Branch: refs/heads/master Commit: 3370865b0ebe9b04c6671631aee5917b41ceba9c Parents: 4acbda4 Author: hyukjinkwon Authored: Sat Oct 20 10:15:53 2018 +0800 Committer: hyukjinkwon Committed: Sat Oct 20 10:15:53 2018 +0800 -- .../catalyst/expressions/csvExpressions.scala | 2 + .../catalyst/expressions/jsonExpressions.scala | 6 +++ .../sql-tests/results/csv-functions.sql.out | 4 +- .../sql-tests/results/json-functions.sql.out| 50 ++-- .../native/stringCastAndExpressions.sql.out | 2 +- 5 files changed, 36 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala index a63b624..853b1ea 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala @@ -117,4 +117,6 @@ case class CsvToStructs( } override def inputTypes: Seq[AbstractDataType] = StringType :: Nil + + override def prettyName: String = "from_csv" } http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index 9f28483..b4815b4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -610,6 +610,8 @@ case class JsonToStructs( case _: MapType => "entries" case _ => super.sql } + + override def prettyName: String = "from_json" } /** @@ -730,6 +732,8 @@ case class StructsToJson( override def nullSafeEval(value: Any): Any = converter(value) override def inputTypes: Seq[AbstractDataType] = TypeCollection(ArrayType, StructType) :: Nil + + override def prettyName: String = "to_json" } /** @@ -774,6 +778,8 @@ case class SchemaOfJson( UTF8String.fromString(dt.catalogString) } + + override def prettyName: String = "schema_of_json" } object JsonExprUtils { http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out -- diff --git a/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out index 15dbe36..f19f34a 100644 --- a/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out @@ -5,7 +5,7 @@ -- !query 0 select from_csv('1, 3.14', 'a INT, f FLOAT') -- !query 0 schema -struct> +struct> -- !query 0 output {"a":1,"f":3.14} @@ -13,7 +13,7 @@ struct> -- !query 1 select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/')) -- !query 1 schema -struct> +struct> -- !query 1 output {"time":2015-08-26 00:00:00.0} http://git-wip-us.apache.org/repos/asf/spark/blob/3370865b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out -- diff --git a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out index 77e9000..868eee8 100644 --- a/sql/core/src/test/resources/sql-tests/results/json-functions.sql.out +++
spark git commit: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Repository: spark Updated Branches: refs/heads/master 7d425b190 -> c3eaee776 [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Master ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? An integration test was added which mimics the Scala test for the same feature. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21990 from RussellSpitzer/SPARK-25003-master. Authored-by: Russell Spitzer Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c3eaee77 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c3eaee77 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c3eaee77 Branch: refs/heads/master Commit: c3eaee776509b0a23d0ba7a575575516bab4aa4e Parents: 7d425b1 Author: Russell Spitzer Authored: Thu Oct 18 12:29:09 2018 +0800 Committer: hyukjinkwon Committed: Thu Oct 18 12:29:09 2018 +0800 -- python/pyspark/sql/tests.py | 42 +++ .../org/apache/spark/sql/SparkSession.scala | 56 +--- 2 files changed, 80 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c3eaee77/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 85712df..8065d82 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -3837,6 +3837,48 @@ class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils): "The callback from the query execution listener should be called after 'toPandas'") +class SparkExtensionsTest(unittest.TestCase): +# These tests are separate because it uses 'spark.sql.extensions' which is +# static and immutable. This can't be set or unset, for example, via `spark.conf`. + +@classmethod +def setUpClass(cls): +import glob +from pyspark.find_spark_home import _find_spark_home + +SPARK_HOME = _find_spark_home() +filename_pattern = ( +"sql/core/target/scala-*/test-classes/org/apache/spark/sql/" +"SparkSessionExtensionSuite.class") +if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)): +raise unittest.SkipTest( +"'org.apache.spark.sql.SparkSessionExtensionSuite' is not " +"available. Will skip the related tests.") + +# Note that 'spark.sql.extensions' is a static immutable configuration. +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config( +"spark.sql.extensions", +"org.apache.spark.sql.MyExtensions") \ +.getOrCreate() + +@classmethod +def tearDownClass(cls): +cls.spark.stop() + +def test_use_custom_class_for_extensions(self): +self.assertTrue( + self.spark._jsparkSession.sessionState().planner().strategies().contains( + self.spark._jvm.org.apache.spark.sql.MySparkStrategy(self.spark._jsparkSession)), +"MySparkStrategy not found in active planner strategies") +self.assertTrue( + self.spark._jsparkSession.sessionState().analyzer().extendedResolutionRules().contains( + self.spark._jvm.org.apache.spark.sql.MyRule(self.spark._jsparkSession)), +"MyRule not found in extended resolution rules") + + class SparkSessionTests(PySparkTestCase): # This test is separate because it's closely related with session's start and stop. http://git-wip-us.apache.org/repos/asf/spark/blob/c3eaee77/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala index 2b847fb..71f967a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -84,8 +84,17 @@ class SparkSession private( // The call site where this SparkSession was constructed. private val creationSite: CallSite = Utils.getCallSite() + /** + * Constructor used in Pyspark. Contains explicit application of Spark Session Extensions + * which otherwise only occurs during
spark git commit: [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates
Repository: spark Updated Branches: refs/heads/master e028fd3ae -> 2c664edc0 [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. **Test Data** ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` **Spark 2.3.2** ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 152 ms ``` **Spark 2.4.0 RC3** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2c664edc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2c664edc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2c664edc Branch: refs/heads/master Commit: 2c664edc060a41340eb374fd44b5d32c3c06a15c Parents: e028fd3 Author: Dongjoon Hyun Authored: Tue Oct 16 20:30:23 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 16 20:30:23 2018 +0800 -- .../execution/datasources/orc/OrcFilters.scala | 37 +++- .../datasources/orc/OrcQuerySuite.scala | 28 +-- .../sql/execution/datasources/orc/OrcTest.scala | 10 ++ 3 files changed, 46 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2c664edc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala index 2b17b47..0a64981 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala @@ -67,6 +67,16 @@ private[sql] object OrcFilters { } } + // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters + // in order to distinguish predicate pushdown for nested columns. + private def quoteAttributeNameIfNeeded(name: String) : String = { +if (!name.contains("`") && name.contains(".")) { + s"`$name`" +} else { + name +} + } + /** * Create ORC filter as a SearchArgument instance. */ @@ -215,38 +225,47 @@ private[sql] object OrcFilters { // wrapped by a "parent" predicate (`And`, `Or`, or `Not`). case EqualTo(attribute, value) if isSearchableType(dataTypeMap(attribute)) => +val quotedName = quoteAttributeNameIfNeeded(attribute) val castedValue = castLiteralValue(value, dataTypeMap(attribute)) -Some(builder.startAnd().equals(attribute, getType(attribute), castedValue).end()) +Some(builder.startAnd().equals(quotedName, getType(attribute), castedValue).end()) case EqualNullSafe(attribute, value) if isSearchableType(dataTypeMap(attribute)) => +val quotedName = quoteAttributeNameIfNeeded(attribute) val castedValue = castLiteralValue(value, dataTypeMap(attribute)) -Some(builder.startAnd().nullSafeEquals(attribute, getType(attribute), castedValue).end()) +Some(builder.startAnd().nullSafeEquals(quotedName, getType(attribute), castedValue).end()) case LessThan(attribute, value) if isSearchableType(dataTypeMap(attribute)) => +val quotedName = quoteAttributeNameIfNeeded(attribute) val castedValue = castLiteralValue(value, dataTypeMap(attribute)) -Some(builder.startAnd().lessThan(attribute, getType(attribute), castedValue).end()) +
spark git commit: [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates
Repository: spark Updated Branches: refs/heads/branch-2.4 77156f8c8 -> 144cb949d [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. **Test Data** ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` **Spark 2.3.2** ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 152 ms ``` **Spark 2.4.0 RC3** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) ++ |col.with.dot| ++ | 5| | 7| | 8| ++ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon (cherry picked from commit 2c664edc060a41340eb374fd44b5d32c3c06a15c) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/144cb949 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/144cb949 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/144cb949 Branch: refs/heads/branch-2.4 Commit: 144cb949d597e6cd0e662f2320e983cb6903ecfb Parents: 77156f8 Author: Dongjoon Hyun Authored: Tue Oct 16 20:30:23 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 16 20:30:40 2018 +0800 -- .../execution/datasources/orc/OrcFilters.scala | 37 +++- .../datasources/orc/OrcQuerySuite.scala | 28 +-- .../sql/execution/datasources/orc/OrcTest.scala | 10 ++ 3 files changed, 46 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/144cb949/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala index dbafc46..5b93a60 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala @@ -67,6 +67,16 @@ private[sql] object OrcFilters { } } + // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters + // in order to distinguish predicate pushdown for nested columns. + private def quoteAttributeNameIfNeeded(name: String) : String = { +if (!name.contains("`") && name.contains(".")) { + s"`$name`" +} else { + name +} + } + /** * Create ORC filter as a SearchArgument instance. */ @@ -178,38 +188,47 @@ private[sql] object OrcFilters { // wrapped by a "parent" predicate (`And`, `Or`, or `Not`). case EqualTo(attribute, value) if isSearchableType(dataTypeMap(attribute)) => +val quotedName = quoteAttributeNameIfNeeded(attribute) val castedValue = castLiteralValue(value, dataTypeMap(attribute)) -Some(builder.startAnd().equals(attribute, getType(attribute), castedValue).end()) +Some(builder.startAnd().equals(quotedName, getType(attribute), castedValue).end()) case EqualNullSafe(attribute, value) if isSearchableType(dataTypeMap(attribute)) => +val quotedName = quoteAttributeNameIfNeeded(attribute) val castedValue = castLiteralValue(value, dataTypeMap(attribute)) -Some(builder.startAnd().nullSafeEquals(attribute, getType(attribute), castedValue).end()) +Some(builder.startAnd().nullSafeEquals(quotedName, getType(attribute), castedValue).end()) case LessThan(attribute, value) if isSearchableType(dataTypeMap(attribute)) => +val quotedName = quoteAttributeNameIfNeeded(attribute) val castedValue = castLiteralValue(value,
spark git commit: [MINOR][SQL] Avoid hardcoded configuration keys in SQLConf's `doc`
Repository: spark Updated Branches: refs/heads/master 5e5d886a2 -> 5bd5e1b9c [MINOR][SQL] Avoid hardcoded configuration keys in SQLConf's `doc` ## What changes were proposed in this pull request? This PR proposes to avoid hardcorded configuration keys in SQLConf's `doc. ## How was this patch tested? Manually verified. Closes #22877 from HyukjinKwon/minor-conf-name. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5bd5e1b9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5bd5e1b9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5bd5e1b9 Branch: refs/heads/master Commit: 5bd5e1b9c84b5f7d4d67ab94e02d49ebdd02f177 Parents: 5e5d886 Author: hyukjinkwon Authored: Tue Oct 30 07:38:26 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 30 07:38:26 2018 +0800 -- .../org/apache/spark/sql/internal/SQLConf.scala | 41 +++- 1 file changed, 23 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5bd5e1b9/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 4edffce..535ec51 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -408,7 +408,8 @@ object SQLConf { val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") .doc("If true, enables Parquet filter push-down optimization for Date. " + - "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") + s"This configuration only has an effect when '${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " + + "enabled.") .internal() .booleanConf .createWithDefault(true) @@ -416,7 +417,7 @@ object SQLConf { val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = buildConf("spark.sql.parquet.filterPushdown.timestamp") .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + -"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +s"This configuration only has an effect when '${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " + "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") .internal() .booleanConf @@ -425,7 +426,8 @@ object SQLConf { val PARQUET_FILTER_PUSHDOWN_DECIMAL_ENABLED = buildConf("spark.sql.parquet.filterPushdown.decimal") .doc("If true, enables Parquet filter push-down optimization for Decimal. " + -"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +s"This configuration only has an effect when '${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " + +"enabled.") .internal() .booleanConf .createWithDefault(true) @@ -433,7 +435,8 @@ object SQLConf { val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED = buildConf("spark.sql.parquet.filterPushdown.string.startsWith") .doc("If true, enables Parquet filter push-down optimization for string startsWith function. " + - "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") + s"This configuration only has an effect when '${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " + + "enabled.") .internal() .booleanConf .createWithDefault(true) @@ -444,7 +447,8 @@ object SQLConf { "Large threshold won't necessarily provide much better performance. " + "The experiment argued that 300 is the limit threshold. " + "By setting this value to 0 this feature can be disabled. " + -"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +s"This configuration only has an effect when '${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " + +"enabled.") .internal() .intConf .checkValue(threshold => threshold >= 0, "The threshold must not be negative.") @@ -459,14 +463,6 @@ object SQLConf { .booleanConf .createWithDefault(false) - val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordLevelFilter.enabled") -.doc("If true, enables Parquet's native record-level filtering using the pushed down " + - "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " + - "is enabled and the vectorized reader is not used. You can ensure the vectorized reader " + - "is not used by setting
spark git commit: [SPARK-25672][SQL] schema_of_csv() - schema inference from an example
Repository: spark Updated Branches: refs/heads/master c5ef477d2 -> c9667aff4 [SPARK-25672][SQL] schema_of_csv() - schema inference from an example ## What changes were proposed in this pull request? In the PR, I propose to add new function - *schema_of_csv()* which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example: ```sql select schema_of_csv('1|abc', map('delimiter', '|')) ``` ``` struct<_c0:int,_c1:string> ``` ## How was this patch tested? Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to `csv-functions.sql` Closes #22666 from MaxGekk/schema_of_csv-function. Lead-authored-by: hyukjinkwon Co-authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c9667aff Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c9667aff Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c9667aff Branch: refs/heads/master Commit: c9667aff4f4888b650fad2ed41698025b1e84166 Parents: c5ef477 Author: hyukjinkwon Authored: Thu Nov 1 09:14:16 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 1 09:14:16 2018 +0800 -- python/pyspark/sql/functions.py | 41 +++- .../catalyst/analysis/FunctionRegistry.scala| 3 +- .../spark/sql/catalyst/csv/CSVInferSchema.scala | 220 +++ .../sql/catalyst/expressions/ExprUtils.scala| 33 ++- .../catalyst/expressions/csvExpressions.scala | 54 + .../catalyst/expressions/jsonExpressions.scala | 16 +- .../sql/catalyst/csv/CSVInferSchemaSuite.scala | 142 .../sql/catalyst/csv/UnivocityParserSuite.scala | 199 + .../expressions/CsvExpressionsSuite.scala | 10 + .../datasources/csv/CSVDataSource.scala | 2 +- .../datasources/csv/CSVInferSchema.scala| 214 -- .../scala/org/apache/spark/sql/functions.scala | 35 +++ .../sql-tests/inputs/csv-functions.sql | 8 + .../sql-tests/results/csv-functions.sql.out | 54 - .../apache/spark/sql/CsvFunctionsSuite.scala| 15 ++ .../datasources/csv/CSVInferSchemaSuite.scala | 143 .../datasources/csv/UnivocityParserSuite.scala | 200 - 17 files changed, 803 insertions(+), 586 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c9667aff/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index ca2a256..beb1a06 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2364,6 +2364,33 @@ def schema_of_json(json, options={}): return Column(jc) +@ignore_unicode_prefix +@since(3.0) +def schema_of_csv(csv, options={}): +""" +Parses a CSV string and infers its schema in DDL format. + +:param col: a CSV string or a string literal containing a CSV string. +:param options: options to control parsing. accepts the same options as the CSV datasource + +>>> df = spark.range(1) +>>> df.select(schema_of_csv(lit('1|a'), {'sep':'|'}).alias("csv")).collect() +[Row(csv=u'struct<_c0:int,_c1:string>')] +>>> df.select(schema_of_csv('1|a', {'sep':'|'}).alias("csv")).collect() +[Row(csv=u'struct<_c0:int,_c1:string>')] +""" +if isinstance(csv, basestring): +col = _create_column_from_literal(csv) +elif isinstance(csv, Column): +col = _to_java_column(csv) +else: +raise TypeError("schema argument should be a column or string") + +sc = SparkContext._active_spark_context +jc = sc._jvm.functions.schema_of_csv(col, options) +return Column(jc) + + @since(1.5) def size(col): """ @@ -2664,13 +2691,13 @@ def from_csv(col, schema, options={}): :param schema: a string with schema in DDL format to use when parsing the CSV column. :param options: options to control parsing. accepts the same options as the CSV datasource ->>> data = [(1, '1')] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(from_csv(df.value, "a INT").alias("csv")).collect() -[Row(csv=Row(a=1))] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(from_csv(df.value, lit("a INT")).alias("csv")).collect() -[Row(csv=Row(a=1))] +>>> data = [("1,2,3",)] +>>> df = spark.createDataFrame(data, ("value",)) +>>> df.select(from_csv(df.value, "a INT, b INT, c INT").alias("csv")).collect() +[Row(csv=Row(a=1, b=2, c=3))] +>>> value = data[0][0] +>>> df.select(from_csv(df.value, schema_of_csv(value)).alias("csv")).collect() +[Row(csv=Row(_c0=1, _c1=2, _c2=3))] """ sc = SparkContext._active_spark_context
spark git commit: [SPARK-25886][SQL][MINOR] Improve error message of `FailureSafeParser` and `from_avro` in FAILFAST mode
Repository: spark Updated Branches: refs/heads/master 3c0e9ce94 -> 57eddc718 [SPARK-25886][SQL][MINOR] Improve error message of `FailureSafeParser` and `from_avro` in FAILFAST mode ## What changes were proposed in this pull request? Currently in `FailureSafeParser` and `from_avro`, the exception is created with such code ``` throw new SparkException("Malformed records are detected in record parsing. " + s"Parse Mode: ${FailFastMode.name}.", e.cause) ``` 1. The cause part should be `e` instead of `e.cause` 2. If `e` contains non-null message, it should be shown in `from_json`/`from_csv`/`from_avro`, e.g. ``` com.fasterxml.jackson.core.JsonParseException: Unexpected character ('1' (code 49)): was expecting a colon to separate field name and value at [Source: (InputStreamReader); line: 1, column: 7] ``` 3.Kindly show hint for trying PERMISSIVE in error message. ## How was this patch tested? Unit test. Closes #22895 from gengliangwang/improve_error_msg. Authored-by: Gengliang Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57eddc71 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57eddc71 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57eddc71 Branch: refs/heads/master Commit: 57eddc7182ece0030f6d0cc02339c0b8d8c0be5c Parents: 3c0e9ce Author: Gengliang Wang Authored: Wed Oct 31 20:22:57 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 31 20:22:57 2018 +0800 -- .../main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala | 2 +- .../org/apache/spark/sql/catalyst/util/FailureSafeParser.scala| 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/57eddc71/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala -- diff --git a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala index ae61587..5656ac7 100644 --- a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala +++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala @@ -102,7 +102,7 @@ case class AvroDataToCatalyst( case FailFastMode => throw new SparkException("Malformed records are detected in record parsing. " + s"Current parse Mode: ${FailFastMode.name}. To process malformed records as null " + -"result, try setting the option 'mode' as 'PERMISSIVE'.", e.getCause) +"result, try setting the option 'mode' as 'PERMISSIVE'.", e) case _ => throw new AnalysisException(unacceptableModeMessage(parseMode.name)) } http://git-wip-us.apache.org/repos/asf/spark/blob/57eddc71/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala index fecfff5..76745b1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala @@ -73,7 +73,8 @@ class FailureSafeParser[IN]( Iterator.empty case FailFastMode => throw new SparkException("Malformed records are detected in record parsing. " + -s"Parse Mode: ${FailFastMode.name}.", e.cause) +s"Parse Mode: ${FailFastMode.name}. To process malformed records as null " + +"result, try setting the option 'mode' as 'PERMISSIVE'.", e) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARKR] found some extra whitespace in the R tests
Repository: spark Updated Branches: refs/heads/master f6ff6329e -> 243ce319a [SPARKR] found some extra whitespace in the R tests ## What changes were proposed in this pull request? during my ubuntu-port testing, i found some extra whitespace that for some reason wasn't getting caught on the centos lint-r build step. ## How was this patch tested? the build system will test this! i used one of my ubuntu testing builds and scped over the modified file. before my fix: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-ubuntu-testing/22/console after my fix: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-ubuntu-testing/23/console Closes #22896 from shaneknapp/remove-extra-whitespace. Authored-by: shane knapp Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/243ce319 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/243ce319 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/243ce319 Branch: refs/heads/master Commit: 243ce319a06f20365d5b08d479642d75748645d9 Parents: f6ff632 Author: shane knapp Authored: Wed Oct 31 10:32:26 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 31 10:32:26 2018 +0800 -- R/pkg/tests/fulltests/test_sparkSQL_eager.R | 16 1 file changed, 8 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/243ce319/R/pkg/tests/fulltests/test_sparkSQL_eager.R -- diff --git a/R/pkg/tests/fulltests/test_sparkSQL_eager.R b/R/pkg/tests/fulltests/test_sparkSQL_eager.R index df7354f..9b4489a 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL_eager.R +++ b/R/pkg/tests/fulltests/test_sparkSQL_eager.R @@ -22,12 +22,12 @@ context("test show SparkDataFrame when eager execution is enabled.") test_that("eager execution is not enabled", { # Start Spark session without eager execution enabled sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) - + df <- createDataFrame(faithful) expect_is(df, "SparkDataFrame") expected <- "eruptions:double, waiting:double" expect_output(show(df), expected) - + # Stop Spark session sparkR.session.stop() }) @@ -35,9 +35,9 @@ test_that("eager execution is not enabled", { test_that("eager execution is enabled", { # Start Spark session with eager execution enabled sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true") - + sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, sparkConfig = sparkConfig) - + df <- createDataFrame(faithful) expect_is(df, "SparkDataFrame") expected <- paste0("(+-+---+\n", @@ -45,7 +45,7 @@ test_that("eager execution is enabled", { "+-+---+\n)*", "(only showing top 20 rows)") expect_output(show(df), expected) - + # Stop Spark session sparkR.session.stop() }) @@ -55,9 +55,9 @@ test_that("eager execution is enabled with maxNumRows and truncate set", { sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true", spark.sql.repl.eagerEval.maxNumRows = as.integer(5), spark.sql.repl.eagerEval.truncate = as.integer(2)) - + sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, sparkConfig = sparkConfig) - + df <- arrange(createDataFrame(faithful), "waiting") expect_is(df, "SparkDataFrame") expected <- paste0("(+-+---+\n", @@ -66,7 +66,7 @@ test_that("eager execution is enabled with maxNumRows and truncate set", { "| 1.| 43|\n)*", "(only showing top 5 rows)") expect_output(show(df), expected) - + # Stop Spark session sparkR.session.stop() }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method
Repository: spark Updated Branches: refs/heads/master 891032da6 -> f6ff6329e [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method ## What changes were proposed in this pull request? Refactor JSONBenchmark to use main method use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"` ## How was this patch tested? manual tests Closes #22844 from heary-cao/JSONBenchmarks. Lead-authored-by: caoxuewen Co-authored-by: heary Co-authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6ff6329 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6ff6329 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6ff6329 Branch: refs/heads/master Commit: f6ff6329eee720e19a56b90c0ffda9da5cecca5b Parents: 891032d Author: caoxuewen Authored: Wed Oct 31 10:28:17 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 31 10:28:17 2018 +0800 -- sql/core/benchmarks/JSONBenchmark-results.txt | 37 .../datasources/json/JsonBenchmark.scala| 183 .../datasources/json/JsonBenchmarks.scala | 217 --- 3 files changed, 220 insertions(+), 217 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f6ff6329/sql/core/benchmarks/JSONBenchmark-results.txt -- diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt b/sql/core/benchmarks/JSONBenchmark-results.txt new file mode 100644 index 000..9993730 --- /dev/null +++ b/sql/core/benchmarks/JSONBenchmark-results.txt @@ -0,0 +1,37 @@ + +Benchmark for performance of JSON parsing + + +Preparing data for benchmarking ... +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +JSON schema inferring: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 62946 / 63310 1.6 629.5 1.0X +UTF-8 is set 112814 / 112866 0.9 1128.1 0.6X + +Preparing data for benchmarking ... +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +JSON per-line parsing: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 16468 / 16553 6.1 164.7 1.0X +UTF-8 is set16420 / 16441 6.1 164.2 1.0X + +Preparing data for benchmarking ... +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +JSON parsing of wide lines: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 39789 / 40053 0.3 3978.9 1.0X +UTF-8 is set39505 / 39584 0.3 3950.5 1.0X + +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +Select 10 columns + count() 15997 / 16015 0.6 1599.7 1.0X +Select 1 column + count() 13280 / 13326 0.8 1328.0 1.2X +count() 3006 / 3021 3.3 300.6 5.3X + + http://git-wip-us.apache.org/repos/asf/spark/blob/f6ff6329/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala -- diff --git
spark git commit: [SPARK-24709][SQL][2.4] use str instead of basestring in isinstance
Repository: spark Updated Branches: refs/heads/branch-2.4 f575616db -> 0f74bac64 [SPARK-24709][SQL][2.4] use str instead of basestring in isinstance ## What changes were proposed in this pull request? after backport https://github.com/apache/spark/pull/22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes #22858 from cloud-fan/python. Authored-by: Wenchen Fan Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f74bac6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f74bac6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f74bac6 Branch: refs/heads/branch-2.4 Commit: 0f74bac647c9f8fce112eada7913504b2c6d08fa Parents: f575616 Author: Wenchen Fan Authored: Sun Oct 28 10:50:46 2018 +0800 Committer: hyukjinkwon Committed: Sun Oct 28 10:50:46 2018 +0800 -- python/pyspark/sql/functions.py | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0f74bac6/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 9583a98..e1d6ea3 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -25,6 +25,9 @@ import warnings if sys.version < "3": from itertools import imap as map +if sys.version >= '3': +basestring = str + from pyspark import since, SparkContext from pyspark.rdd import ignore_unicode_prefix, PythonEvalType from pyspark.sql.column import Column, _to_java_column, _to_seq, _create_column_from_literal - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25638][SQL] Adding new function - to_csv()
Repository: spark Updated Branches: refs/heads/master 1a7abf3f4 -> 39399f40b [SPARK-25638][SQL] Adding new function - to_csv() ## What changes were proposed in this pull request? New functions takes a struct and converts it to a CSV strings using passed CSV options. It accepts the same CSV options as CSV data source does. ## How was this patch tested? Added `CsvExpressionsSuite`, `CsvFunctionsSuite` as well as R, Python and SQL tests similar to tests for `to_json()` Closes #22626 from MaxGekk/to_csv. Lead-authored-by: Maxim Gekk Co-authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/39399f40 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/39399f40 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/39399f40 Branch: refs/heads/master Commit: 39399f40b861f7d8e60d0e25d2f8801343477834 Parents: 1a7abf3 Author: Maxim Gekk Authored: Sun Nov 4 14:57:38 2018 +0800 Committer: hyukjinkwon Committed: Sun Nov 4 14:57:38 2018 +0800 -- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 31 +-- R/pkg/R/generics.R | 4 + R/pkg/tests/fulltests/test_sparkSQL.R | 5 ++ python/pyspark/sql/functions.py | 22 + .../catalyst/analysis/FunctionRegistry.scala| 3 +- .../sql/catalyst/csv/UnivocityGenerator.scala | 93 .../catalyst/expressions/csvExpressions.scala | 67 ++ .../expressions/CsvExpressionsSuite.scala | 44 + .../datasources/csv/CSVFileFormat.scala | 2 +- .../datasources/csv/UnivocityGenerator.scala| 90 --- .../scala/org/apache/spark/sql/functions.scala | 26 ++ .../sql-tests/inputs/csv-functions.sql | 6 ++ .../sql-tests/results/csv-functions.sql.out | 36 +++- .../apache/spark/sql/CsvFunctionsSuite.scala| 14 ++- 15 files changed, 345 insertions(+), 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/39399f40/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index f9f556e..9d4f05a 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -380,6 +380,7 @@ exportMethods("%<=>%", "tanh", "toDegrees", "toRadians", + "to_csv", "to_date", "to_json", "to_timestamp", http://git-wip-us.apache.org/repos/asf/spark/blob/39399f40/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index d2ca1d6..9292363 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -187,6 +187,7 @@ NULL #' \itemize{ #' \item \code{to_json}: it is the column containing the struct, array of the structs, #' the map or array of maps. +#' \item \code{to_csv}: it is the column containing the struct. #' \item \code{from_json}: it is the column containing the JSON string. #' \item \code{from_csv}: it is the column containing the CSV string. #' } @@ -204,11 +205,11 @@ NULL #' also supported for the schema. #' \item \code{from_csv}: a DDL-formatted string #' } -#' @param ... additional argument(s). In \code{to_json} and \code{from_json}, this contains -#'additional named properties to control how it is converted, accepts the same -#'options as the JSON data source. Additionally \code{to_json} supports the "pretty" -#'option which enables pretty JSON generation. In \code{arrays_zip}, this contains -#'additional Columns of arrays to be merged. +#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and \code{from_json}, +#'this contains additional named properties to control how it is converted, accepts +#'the same options as the JSON/CSV data source. Additionally \code{to_json} supports +#'the "pretty" option which enables pretty JSON generation. In \code{arrays_zip}, +#'this contains additional Columns of arrays to be merged. #' @name column_collection_functions #' @rdname column_collection_functions #' @family collection functions @@ -1741,6 +1742,26 @@ setMethod("to_json", signature(x = "Column"), }) #' @details +#' \code{to_csv}: Converts a column containing a \code{structType} into a Column of CSV string. +#' Resolving the Column can fail if an unsupported type is encountered. +#' +#' @rdname column_collection_functions +#' @aliases to_csv to_csv,Column-method +#' @examples +#' +#' \dontrun{ +#' # Converts a
spark git commit: [INFRA] Close stale PRs
Repository: spark Updated Branches: refs/heads/master 39399f40b -> 463a67668 [INFRA] Close stale PRs Closes https://github.com/apache/spark/pull/22859 Closes https://github.com/apache/spark/pull/22849 Closes https://github.com/apache/spark/pull/22591 Closes https://github.com/apache/spark/pull/22322 Closes https://github.com/apache/spark/pull/22312 Closes https://github.com/apache/spark/pull/19590 Closes #22934 from wangyum/CloseStalePRs. Authored-by: Yuming Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/463a6766 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/463a6766 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/463a6766 Branch: refs/heads/master Commit: 463a6766876942e90f10d1ce2d1e36a8284bfbc2 Parents: 39399f4 Author: Yuming Wang Authored: Sun Nov 4 14:59:33 2018 +0800 Committer: hyukjinkwon Committed: Sun Nov 4 14:59:33 2018 +0800 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25819][SQL] Support parse mode option for the function `from_avro`
Repository: spark Updated Branches: refs/heads/master 79f3babcc -> 24e8c27df [SPARK-25819][SQL] Support parse mode option for the function `from_avro` ## What changes were proposed in this pull request? Current the function `from_avro` throws exception on reading corrupt records. In practice, there could be various reasons of data corruption. It would be good to support `PERMISSIVE` mode and allow the function from_avro to process all the input file/streaming, which is consistent with from_json and from_csv. There is no obvious down side for supporting `PERMISSIVE` mode. Different from `from_csv` and `from_json`, the default parse mode is `FAILFAST` for the following reasons: 1. Since Avro is structured data format, input data is usually able to be parsed by certain schema. In such case, exposing the problems of input data to users is better than hiding it. 2. For `PERMISSIVE` mode, we have to force the data schema as fully nullable. This seems quite unnecessary for Avro. Reversing non-null schema might archive more perf optimizations in Spark. 3. To be consistent with the behavior in Spark 2.4 . ## How was this patch tested? Unit test Manual previewing generated html for the Avro data source doc: ![image](https://user-images.githubusercontent.com/1097932/47510100-02558880-d8aa-11e8-9d57-a43daee4c6b9.png) Closes #22814 from gengliangwang/improve_from_avro. Authored-by: Gengliang Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/24e8c27d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/24e8c27d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/24e8c27d Branch: refs/heads/master Commit: 24e8c27dfe31e6e0a53c89e6ddc36327e537931b Parents: 79f3bab Author: Gengliang Wang Authored: Fri Oct 26 11:39:38 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 26 11:39:38 2018 +0800 -- docs/sql-data-sources-avro.md | 18 +++- .../spark/sql/avro/AvroDataToCatalyst.scala | 90 +--- .../org/apache/spark/sql/avro/AvroOptions.scala | 16 +++- .../org/apache/spark/sql/avro/package.scala | 28 +- .../avro/AvroCatalystDataConversionSuite.scala | 58 +++-- .../spark/sql/avro/AvroFunctionsSuite.scala | 36 +++- 6 files changed, 219 insertions(+), 27 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/docs/sql-data-sources-avro.md -- diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md index d3b81f0..bfe641d 100644 --- a/docs/sql-data-sources-avro.md +++ b/docs/sql-data-sources-avro.md @@ -142,7 +142,10 @@ StreamingQuery query = output ## Data Source Option -Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`. +Data source options of Avro can be set via: + * the `.option` method on `DataFrameReader` or `DataFrameWriter`. + * the `options` parameter in function `from_avro`. + Property NameDefaultMeaningScope @@ -177,6 +180,19 @@ Data source options of Avro can be set using the `.option` method on `DataFrameR Currently supported codecs are uncompressed, snappy, deflate, bzip2 and xz. If the option is not set, the configuration spark.sql.avro.compression.codec config is taken into account. write + +mode +FAILFAST +The mode option allows to specify parse mode for function from_avro. + Currently supported modes are: + +FAILFAST: Throws an exception on processing corrupted record. +PERMISSIVE: Corrupt records are processed as null result. Therefore, the +data schema is forced to be fully nullable, which might be different from the one user provided. + + +function from_avro + ## Configuration http://git-wip-us.apache.org/repos/asf/spark/blob/24e8c27d/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala -- diff --git a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala index 915769f..43d3f6e 100644 --- a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala +++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala @@ -17,20 +17,37 @@ package org.apache.spark.sql.avro +import scala.util.control.NonFatal + import org.apache.avro.Schema import org.apache.avro.generic.GenericDatumReader import org.apache.avro.io.{BinaryDecoder, DecoderFactory} -import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import
spark git commit: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test.
Repository: spark Updated Branches: refs/heads/master 1117fc35f -> e80f18dbd [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test. ## What changes were proposed in this pull request? Currently each test in `SQLTest` in PySpark is not cleaned properly. We should introduce and use more `contextmanager` to be convenient to clean up the context properly. ## How was this patch tested? Modified tests. Closes #22762 from ueshin/issues/SPARK-25763/cleanup_sqltests. Authored-by: Takuya UESHIN Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e80f18db Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e80f18db Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e80f18db Branch: refs/heads/master Commit: e80f18dbd8bc4c2aca9ba6dd487b50e95c55d2e6 Parents: 1117fc3 Author: Takuya UESHIN Authored: Fri Oct 19 00:31:01 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 19 00:31:01 2018 +0800 -- python/pyspark/sql/tests.py | 556 ++- 1 file changed, 318 insertions(+), 238 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e80f18db/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 8065d82..82dc5a6 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -225,6 +225,63 @@ class SQLTestUtils(object): else: self.spark.conf.set(key, old_value) +@contextmanager +def database(self, *databases): +""" +A convenient context manager to test with some specific databases. This drops the given +databases if exist and sets current database to "default" when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +try: +yield +finally: +for db in databases: +self.spark.sql("DROP DATABASE IF EXISTS %s CASCADE" % db) +self.spark.catalog.setCurrentDatabase("default") + +@contextmanager +def table(self, *tables): +""" +A convenient context manager to test with some specific tables. This drops the given tables +if exist when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +try: +yield +finally: +for t in tables: +self.spark.sql("DROP TABLE IF EXISTS %s" % t) + +@contextmanager +def tempView(self, *views): +""" +A convenient context manager to test with some specific views. This drops the given views +if exist when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +try: +yield +finally: +for v in views: +self.spark.catalog.dropTempView(v) + +@contextmanager +def function(self, *functions): +""" +A convenient context manager to test with some specific functions. This drops the given +functions if exist when it exits. +""" +assert hasattr(self, "spark"), "it should have 'spark' attribute, having a spark session." + +try: +yield +finally: +for f in functions: +self.spark.sql("DROP FUNCTION IF EXISTS %s" % f) + class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils): @classmethod @@ -332,6 +389,7 @@ class SQLTests(ReusedSQLTestCase): @classmethod def setUpClass(cls): ReusedSQLTestCase.setUpClass() +cls.spark.catalog._reset() cls.tempdir = tempfile.NamedTemporaryFile(delete=False) os.unlink(cls.tempdir.name) cls.testData = [Row(key=i, value=str(i)) for i in range(100)] @@ -347,12 +405,6 @@ class SQLTests(ReusedSQLTestCase): sqlContext2 = SQLContext(self.sc) self.assertTrue(sqlContext1.sparkSession is sqlContext2.sparkSession) -def tearDown(self): -super(SQLTests, self).tearDown() - -# tear down test_bucketed_write state -self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket") - def test_row_should_be_read_only(self): row = Row(a=1, b=2) self.assertEqual(1, row.a) @@ -473,11 +525,12 @@ class SQLTests(ReusedSQLTestCase): self.assertEqual(row[0], 4) def test_udf2(self): -self.spark.catalog.registerFunction("strlen", lambda string: len(string), IntegerType()) -self.spark.createDataFrame(self.sc.parallelize([Row(a="test")]))\ -
spark git commit: [HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character
Repository: spark Updated Branches: refs/heads/master 3b4f35f56 -> 5330c192b [HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character ## What changes were proposed in this pull request? PIP installation requires to package bin scripts together. https://github.com/apache/spark/blob/master/python/setup.py#L71 The recent fix introduced non-ascii compatible (non-breackable space I guess) at https://github.com/apache/spark/commit/ec96d34e74148803190db8dcf9fda527eeea9255 fix. This is usually not the problem but looks Jenkins's default encoding is `ascii` and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189 ## How was this patch tested? Jenkins Closes #22782 from HyukjinKwon/pip-failure-fix. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5330c192 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5330c192 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5330c192 Branch: refs/heads/master Commit: 5330c192bd87eb18351e72e390baf29855d99b0a Parents: 3b4f35f Author: hyukjinkwon Authored: Sun Oct 21 02:04:45 2018 +0800 Committer: hyukjinkwon Committed: Sun Oct 21 02:04:45 2018 +0800 -- bin/docker-image-tool.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5330c192/bin/docker-image-tool.sh -- diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh index 001590a..7256355 100755 --- a/bin/docker-image-tool.sh +++ b/bin/docker-image-tool.sh @@ -79,7 +79,7 @@ function build { fi # Verify that Spark has actually been built/is a runnable distribution - #Â i.e. the Spark JARs that the Docker files will place into the image are present + # i.e. the Spark JARs that the Docker files will place into the image are present local TOTAL_JARS=$(ls $JARS/spark-* | wc -l) TOTAL_JARS=$(( $TOTAL_JARS )) if [ "${TOTAL_JARS}" -eq 0 ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25950][SQL] from_csv should respect to spark.sql.columnNameOfCorruptRecord
Repository: spark Updated Branches: refs/heads/master 63ca4bbe7 -> 76813cfa1 [SPARK-25950][SQL] from_csv should respect to spark.sql.columnNameOfCorruptRecord ## What changes were proposed in this pull request? Fix for `CsvToStructs` to take into account SQL config `spark.sql.columnNameOfCorruptRecord` similar to `from_json`. ## How was this patch tested? Added new test where `spark.sql.columnNameOfCorruptRecord` is set to corrupt column name different from default. Closes #22956 from MaxGekk/csv-tests. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76813cfa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76813cfa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76813cfa Branch: refs/heads/master Commit: 76813cfa1e2607ea3b669a79e59b568e96395b2e Parents: 63ca4bb Author: Maxim Gekk Authored: Wed Nov 7 11:26:17 2018 +0800 Committer: hyukjinkwon Committed: Wed Nov 7 11:26:17 2018 +0800 -- .../catalyst/expressions/csvExpressions.scala | 9 +- .../apache/spark/sql/CsvFunctionsSuite.scala| 31 2 files changed, 39 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/76813cfa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala index 74b670a..aff372b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala @@ -27,6 +27,7 @@ import org.apache.spark.sql.catalyst.analysis.TypeCheckResult import org.apache.spark.sql.catalyst.csv._ import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String @@ -92,8 +93,14 @@ case class CsvToStructs( } } + val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD) + @transient lazy val parser = { -val parsedOptions = new CSVOptions(options, columnPruning = true, timeZoneId.get) +val parsedOptions = new CSVOptions( + options, + columnPruning = true, + defaultTimeZoneId = timeZoneId.get, + defaultColumnNameOfCorruptRecord = nameOfCorruptRecord) val mode = parsedOptions.parseMode if (mode != PermissiveMode && mode != FailFastMode) { throw new AnalysisException(s"from_csv() doesn't support the ${mode.name} mode. " + http://git-wip-us.apache.org/repos/asf/spark/blob/76813cfa/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala index eb6b248..1dd8ec3 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala @@ -19,7 +19,9 @@ package org.apache.spark.sql import scala.collection.JavaConverters._ +import org.apache.spark.SparkException import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SharedSQLContext import org.apache.spark.sql.types._ @@ -86,4 +88,33 @@ class CsvFunctionsSuite extends QueryTest with SharedSQLContext { checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") :: Nil) } + + test("from_csv invalid csv - check modes") { +withSQLConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD.key -> "_unparsed") { + val schema = new StructType() +.add("a", IntegerType) +.add("b", IntegerType) +.add("_unparsed", StringType) + val badRec = "\"" + val df = Seq(badRec, "2,12").toDS() + + checkAnswer( +df.select(from_csv($"value", schema, Map("mode" -> "PERMISSIVE"))), +Row(Row(null, null, badRec)) :: Row(Row(2, 12, null)) :: Nil) + + val exception1 = intercept[SparkException] { +df.select(from_csv($"value", schema, Map("mode" -> "FAILFAST"))).collect() + }.getMessage + assert(exception1.contains( +"Malformed records are detected in record parsing. Parse Mode: FAILFAST.")) + + val exception2 = intercept[SparkException] { +df.select(from_csv($"value", schema, Map("mode" ->
spark git commit: [SPARK-25962][BUILD][PYTHON] Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script
Repository: spark Updated Branches: refs/heads/master e4561e1c5 -> a8e1c9815 [SPARK-25962][BUILD][PYTHON] Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script ## What changes were proposed in this pull request? This PR explicitly specifies `flake8` and `pydocstyle` versions. - It checks flake8 binary executable - flake8 version check >= 3.5.0 - pydocstyle >= 3.0.0 (previously it was == 3.0.0) ## How was this patch tested? Manually tested. Closes #22963 from HyukjinKwon/SPARK-25962. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a8e1c981 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a8e1c981 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a8e1c981 Branch: refs/heads/master Commit: a8e1c9815fef0deb45c9a516d415cea6be511415 Parents: e4561e1 Author: hyukjinkwon Authored: Thu Nov 8 12:26:21 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 8 12:26:21 2018 +0800 -- dev/lint-python | 58 +--- 1 file changed, 41 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a8e1c981/dev/lint-python -- diff --git a/dev/lint-python b/dev/lint-python index 2e353e1..27d87f6 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -26,9 +26,13 @@ PYCODESTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-report.txt" PYDOCSTYLE_REPORT_PATH="$SPARK_ROOT_DIR/dev/pydocstyle-report.txt" PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt" PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt" + PYDOCSTYLEBUILD="pydocstyle" -EXPECTED_PYDOCSTYLEVERSION="3.0.0" -PYDOCSTYLEVERSION=$(python -c 'import pkg_resources; print(pkg_resources.get_distribution("pydocstyle").version)' 2> /dev/null) +MINIMUM_PYDOCSTYLEVERSION="3.0.0" + +FLAKE8BUILD="flake8" +MINIMUM_FLAKE8="3.5.0" + SPHINXBUILD=${SPHINXBUILD:=sphinx-build} SPHINX_REPORT_PATH="$SPARK_ROOT_DIR/dev/sphinx-report.txt" @@ -87,27 +91,47 @@ else rm "$PYCODESTYLE_REPORT_PATH" fi -# stop the build if there are Python syntax errors or undefined names -flake8 . --count --select=E901,E999,F821,F822,F823 --max-line-length=100 --show-source --statistics -flake8_status="${PIPESTATUS[0]}" +# Check by flake8 +if hash "$FLAKE8BUILD" 2> /dev/null; then +FLAKE8VERSION="$( $FLAKE8BUILD --version 2> /dev/null )" +VERSION=($FLAKE8VERSION) +IS_EXPECTED_FLAKE8=$(python -c 'from distutils.version import LooseVersion; \ +print(LooseVersion("""'${VERSION[0]}'""") >= LooseVersion("""'$MINIMUM_FLAKE8'"""))' 2> /dev/null) +if [[ "$IS_EXPECTED_FLAKE8" == "True" ]]; then +# stop the build if there are Python syntax errors or undefined names +$FLAKE8BUILD . --count --select=E901,E999,F821,F822,F823 --max-line-length=100 --show-source --statistics +flake8_status="${PIPESTATUS[0]}" + +if [ "$flake8_status" -eq 0 ]; then +lint_status=0 +else +lint_status=1 +fi -if [ "$flake8_status" -eq 0 ]; then -lint_status=0 +if [ "$lint_status" -ne 0 ]; then +echo "flake8 checks failed." +exit "$lint_status" +else +echo "flake8 checks passed." +fi +else +echo "The flake8 version needs to be "$MINIMUM_FLAKE8" at latest. Your current version is '"$FLAKE8VERSION"'." +echo "flake8 checks failed." +exit 1 +fi else -lint_status=1 -fi - -if [ "$lint_status" -ne 0 ]; then +echo >&2 "The flake8 command was not found." echo "flake8 checks failed." -exit "$lint_status" -else -echo "flake8 checks passed." +exit 1 fi # Check python document style, skip check if pydocstyle is not installed. if hash "$PYDOCSTYLEBUILD" 2> /dev/null; then -if [[ "$PYDOCSTYLEVERSION" == "$EXPECTED_PYDOCSTYLEVERSION" ]]; then -pydocstyle --config=dev/tox.ini $DOC_PATHS_TO_CHECK >> "$PYDOCSTYLE_REPORT_PATH" +PYDOCSTYLEVERSION="$( $PYDOCSTYLEBUILD --version 2> /dev/null )" +IS_EXPECTED_PYDOCSTYLEVERSION=$(python -c 'from distutils.version import LooseVersion; \ +print(LooseVersion("""'$PYDOCSTYLEVERSION'""") >= LooseVersion("""'$MINIMUM_PYDOCSTYLEVERSION'"""))') +if [[ "$IS_EXPECTED_PYDOCSTYLEVERSION" == "True" ]]; then +$PYDOCSTYLEBUILD --config=dev/tox.ini $DOC_PATHS_TO_CHECK >> "$PYDOCSTYLE_REPORT_PATH" pydocstyle_status="${PIPESTATUS[0]}" if [ "$compile_status" -eq 0 -a "$pydocstyle_status" -eq 0 ]; then @@ -121,7 +145,7 @@ if hash "$PYDOCSTYLEBUILD" 2> /dev/null; then fi else -echo "The pydocstyle version needs to be latest 3.0.0. Skipping pydoc checks for now" +
spark git commit: [SPARK-25955][TEST] Porting JSON tests for CSV functions
Repository: spark Updated Branches: refs/heads/master 17449a2e6 -> ee03f760b [SPARK-25955][TEST] Porting JSON tests for CSV functions ## What changes were proposed in this pull request? In the PR, I propose to port existing JSON tests from `JsonFunctionsSuite` that are applicable for CSV, and put them to `CsvFunctionsSuite`. In particular: - roundtrip `from_csv` to `to_csv`, and `to_csv` to `from_csv` - using `schema_of_csv` in `from_csv` - Java API `from_csv` - using `from_csv` and `to_csv` in exprs. Closes #22960 from MaxGekk/csv-additional-tests. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ee03f760 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ee03f760 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ee03f760 Branch: refs/heads/master Commit: ee03f760b305e70a57c3b4409ec25897af348600 Parents: 17449a2 Author: Maxim Gekk Authored: Thu Nov 8 14:51:29 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 8 14:51:29 2018 +0800 -- .../apache/spark/sql/CsvFunctionsSuite.scala| 47 1 file changed, 47 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ee03f760/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala index 1dd8ec3..b97ac38 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala @@ -117,4 +117,51 @@ class CsvFunctionsSuite extends QueryTest with SharedSQLContext { "Acceptable modes are PERMISSIVE and FAILFAST.")) } } + + test("from_csv uses DDL strings for defining a schema - java") { +val df = Seq("""1,"haa).toDS() +checkAnswer( + df.select( +from_csv($"value", lit("a INT, b STRING"), new java.util.HashMap[String, String]())), + Row(Row(1, "haa")) :: Nil) + } + + test("roundtrip to_csv -> from_csv") { +val df = Seq(Tuple1(Tuple1(1)), Tuple1(null)).toDF("struct") +val schema = df.schema(0).dataType.asInstanceOf[StructType] +val options = Map.empty[String, String] +val readback = df.select(to_csv($"struct").as("csv")) + .select(from_csv($"csv", schema, options).as("struct")) + +checkAnswer(df, readback) + } + + test("roundtrip from_csv -> to_csv") { +val df = Seq(Some("1"), None).toDF("csv") +val schema = new StructType().add("a", IntegerType) +val options = Map.empty[String, String] +val readback = df.select(from_csv($"csv", schema, options).as("struct")) + .select(to_csv($"struct").as("csv")) + +checkAnswer(df, readback) + } + + test("infers schemas of a CSV string and pass to to from_csv") { +val in = Seq("""0.123456789,987654321,"San Francisco).toDS() +val options = Map.empty[String, String].asJava +val out = in.select(from_csv('value, schema_of_csv("0.1,1,a"), options) as "parsed") +val expected = StructType(Seq(StructField( + "parsed", + StructType(Seq( +StructField("_c0", DoubleType, true), +StructField("_c1", IntegerType, true), +StructField("_c2", StringType, true)) + +assert(out.schema == expected) + } + + test("Support to_csv in SQL") { +val df1 = Seq(Tuple1(Tuple1(1))).toDF("a") +checkAnswer(df1.selectExpr("to_csv(a)"), Row("1") :: Nil) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25952][SQL] Passing actual schema to JacksonParser
Repository: spark Updated Branches: refs/heads/master d68f3a726 -> 17449a2e6 [SPARK-25952][SQL] Passing actual schema to JacksonParser ## What changes were proposed in this pull request? The PR fixes an issue when the corrupt record column specified via `spark.sql.columnNameOfCorruptRecord` or JSON options `columnNameOfCorruptRecord` is propagated to JacksonParser, and returned row breaks an assumption in `FailureSafeParser` that the row must contain only actual data. The issue is fixed by passing actual schema without the corrupt record field into `JacksonParser`. ## How was this patch tested? Added a test with the corrupt record column in the middle of user's schema. Closes #22958 from MaxGekk/from_json-corrupt-record-schema. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17449a2e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17449a2e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17449a2e Branch: refs/heads/master Commit: 17449a2e6b28ecce7a273284eab037e8aceb3611 Parents: d68f3a7 Author: Maxim Gekk Authored: Thu Nov 8 14:48:23 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 8 14:48:23 2018 +0800 -- .../sql/catalyst/expressions/jsonExpressions.scala| 14 -- .../org/apache/spark/sql/JsonFunctionsSuite.scala | 13 + 2 files changed, 21 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/17449a2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index eafcb61..52d0677 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -569,14 +569,16 @@ case class JsonToStructs( throw new IllegalArgumentException(s"from_json() doesn't support the ${mode.name} mode. " + s"Acceptable modes are ${PermissiveMode.name} and ${FailFastMode.name}.") } -val rawParser = new JacksonParser(nullableSchema, parsedOptions, allowArrayAsStructs = false) -val createParser = CreateJacksonParser.utf8String _ - -val parserSchema = nullableSchema match { - case s: StructType => s - case other => StructType(StructField("value", other) :: Nil) +val (parserSchema, actualSchema) = nullableSchema match { + case s: StructType => +(s, StructType(s.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord))) + case other => +(StructType(StructField("value", other) :: Nil), other) } +val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = false) +val createParser = CreateJacksonParser.utf8String _ + new FailureSafeParser[UTF8String]( input => rawParser.parse(input, createParser, identity[UTF8String]), mode, http://git-wip-us.apache.org/repos/asf/spark/blob/17449a2e/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala index 2b09782..d6b7338 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala @@ -578,4 +578,17 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { "Acceptable modes are PERMISSIVE and FAILFAST.")) } } + + test("corrupt record column in the middle") { +val schema = new StructType() + .add("a", IntegerType) + .add("_unparsed", StringType) + .add("b", IntegerType) +val badRec = """{"a" 1, "b": 11}""" +val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS() + +checkAnswer( + df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> "_unparsed"))), + Row(Row(null, badRec, null)) :: Row(Row(2, null, 12)) :: Nil) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"
Repository: spark Updated Branches: refs/heads/master ee03f760b -> 0a2e45fdb Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader" This reverts commit a75571b46f813005a6d4b076ec39081ffab11844. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a2e45fd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a2e45fd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a2e45fd Branch: refs/heads/master Commit: 0a2e45fdb8baadf7a57eb06f319e96f95eedf298 Parents: ee03f76 Author: hyukjinkwon Authored: Thu Nov 8 16:32:25 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 8 16:32:25 2018 +0800 -- .../apache/spark/sql/hive/client/IsolatedClientLoader.scala| 1 - .../org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala | 6 -- 2 files changed, 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0a2e45fd/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala index 1e7a0b1..c1d8fe5 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala @@ -186,7 +186,6 @@ private[hive] class IsolatedClientLoader( name.startsWith("org.slf4j") || name.startsWith("org.apache.log4j") || // log4j1.x name.startsWith("org.apache.logging.log4j") || // log4j2 -name.startsWith("org.apache.derby.") || name.startsWith("org.apache.spark.") || (sharesHadoopClasses && isHadoopClass) || name.startsWith("scala.") || http://git-wip-us.apache.org/repos/asf/spark/blob/0a2e45fd/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala index 1de258f..0a522b6 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala @@ -113,10 +113,4 @@ class HiveExternalCatalogSuite extends ExternalCatalogSuite { catalog.createDatabase(newDb("dbWithNullDesc").copy(description = null), ignoreIfExists = false) assert(catalog.getDatabase("dbWithNullDesc").description == "") } - - test("SPARK-23831: Add org.apache.derby to IsolatedClientLoader") { -val client1 = HiveUtils.newClientForMetadata(new SparkConf, new Configuration) -val client2 = HiveUtils.newClientForMetadata(new SparkConf, new Configuration) -assert(!client1.equals(client2)) - } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"
Repository: spark Updated Branches: refs/heads/branch-2.4 4c91b224a -> 947462f5a Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader" This reverts commit a75571b46f813005a6d4b076ec39081ffab11844. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/947462f5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/947462f5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/947462f5 Branch: refs/heads/branch-2.4 Commit: 947462f5a36e2751f5a9160c676efbd4e5b08eb4 Parents: 4c91b22 Author: hyukjinkwon Authored: Thu Nov 8 16:32:25 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 8 16:35:41 2018 +0800 -- .../apache/spark/sql/hive/client/IsolatedClientLoader.scala| 1 - .../org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala | 6 -- 2 files changed, 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/947462f5/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala index 6a90c44..2f34f69 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala @@ -182,7 +182,6 @@ private[hive] class IsolatedClientLoader( name.startsWith("org.slf4j") || name.startsWith("org.apache.log4j") || // log4j1.x name.startsWith("org.apache.logging.log4j") || // log4j2 -name.startsWith("org.apache.derby.") || name.startsWith("org.apache.spark.") || (sharesHadoopClasses && isHadoopClass) || name.startsWith("scala.") || http://git-wip-us.apache.org/repos/asf/spark/blob/947462f5/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala index 1de258f..0a522b6 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala @@ -113,10 +113,4 @@ class HiveExternalCatalogSuite extends ExternalCatalogSuite { catalog.createDatabase(newDb("dbWithNullDesc").copy(description = null), ignoreIfExists = false) assert(catalog.getDatabase("dbWithNullDesc").description == "") } - - test("SPARK-23831: Add org.apache.derby to IsolatedClientLoader") { -val client1 = HiveUtils.newClientForMetadata(new SparkConf, new Configuration) -val client2 = HiveUtils.newClientForMetadata(new SparkConf, new Configuration) -assert(!client1.equals(client2)) - } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen
Repository: spark Updated Branches: refs/heads/master 79551f558 -> 0558d021c [SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen ## What changes were proposed in this pull request? Remove `BenchmarkWithCodegen` as we don't use it anymore. More details: https://github.com/apache/spark/pull/22484#discussion_r221397904 ## How was this patch tested? N/A Closes #22985 from wangyum/SPARK-25510. Authored-by: Yuming Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0558d021 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0558d021 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0558d021 Branch: refs/heads/master Commit: 0558d021cc0aeae37ef0e043d244fd0300a57cd5 Parents: 79551f5 Author: Yuming Wang Authored: Fri Nov 9 11:45:03 2018 +0800 Committer: hyukjinkwon Committed: Fri Nov 9 11:45:03 2018 +0800 -- .../benchmark/BenchmarkWithCodegen.scala| 54 1 file changed, 54 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0558d021/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala deleted file mode 100644 index 5133150..000 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWithCodegen.scala +++ /dev/null @@ -1,54 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution.benchmark - -import org.apache.spark.SparkFunSuite -import org.apache.spark.benchmark.Benchmark -import org.apache.spark.sql.SparkSession - -/** - * Common base trait for micro benchmarks that are supposed to run standalone (i.e. not together - * with other test suites). - */ -private[benchmark] trait BenchmarkWithCodegen extends SparkFunSuite { - - lazy val sparkSession = SparkSession.builder -.master("local[1]") -.appName("microbenchmark") -.config("spark.sql.shuffle.partitions", 1) -.config("spark.sql.autoBroadcastJoinThreshold", 1) -.getOrCreate() - - /** Runs function `f` with whole stage codegen on and off. */ - def runBenchmark(name: String, cardinality: Long)(f: => Unit): Unit = { -val benchmark = new Benchmark(name, cardinality) - -benchmark.addCase(s"$name wholestage off", numIters = 2) { iter => - sparkSession.conf.set("spark.sql.codegen.wholeStage", value = false) - f -} - -benchmark.addCase(s"$name wholestage on", numIters = 5) { iter => - sparkSession.conf.set("spark.sql.codegen.wholeStage", value = true) - f -} - -benchmark.run() - } - -} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON
Repository: spark Updated Branches: refs/heads/master 973f7c01d -> 79551f558 [SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON ## What changes were proposed in this pull request? In the PR, I propose to add new option `locale` into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded to `Locale.US`. ## How was this patch tested? Added two tests for parsing a date from CSV/JSON - `Ð½Ð¾Ñ 2018`. Closes #22951 from MaxGekk/locale. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79551f55 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79551f55 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79551f55 Branch: refs/heads/master Commit: 79551f558dafed41177b605b0436e9340edf5712 Parents: 973f7c0 Author: Maxim Gekk Authored: Fri Nov 9 09:45:06 2018 +0800 Committer: hyukjinkwon Committed: Fri Nov 9 09:45:06 2018 +0800 -- python/pyspark/sql/readwriter.py | 15 +++ python/pyspark/sql/streaming.py | 14 ++ .../spark/sql/catalyst/csv/CSVOptions.scala | 7 +-- .../spark/sql/catalyst/json/JSONOptions.scala| 7 +-- .../expressions/CsvExpressionsSuite.scala| 19 ++- .../expressions/JsonExpressionsSuite.scala | 19 ++- .../org/apache/spark/sql/DataFrameReader.scala | 4 .../spark/sql/streaming/DataStreamReader.scala | 4 .../org/apache/spark/sql/CsvFunctionsSuite.scala | 17 + .../apache/spark/sql/JsonFunctionsSuite.scala| 17 + 10 files changed, 109 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/79551f55/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 690b130..726de4a 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -177,7 +177,7 @@ class DataFrameReader(OptionUtils): allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, multiLine=None, allowUnquotedControlChars=None, lineSep=None, samplingRatio=None, - dropFieldIfAllNull=None, encoding=None): + dropFieldIfAllNull=None, encoding=None, locale=None): """ Loads JSON files and returns the results as a :class:`DataFrame`. @@ -249,6 +249,9 @@ class DataFrameReader(OptionUtils): :param dropFieldIfAllNull: whether to ignore column of all null values or empty array/struct during schema inference. If None is set, it uses the default value, ``false``. +:param locale: sets a locale as language tag in IETF BCP 47 format. If None is set, + it uses the default value, ``en-US``. For instance, ``locale`` is used while + parsing dates and timestamps. >>> df1 = spark.read.json('python/test_support/sql/people.json') >>> df1.dtypes @@ -267,7 +270,8 @@ class DataFrameReader(OptionUtils): mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, dateFormat=dateFormat, timestampFormat=timestampFormat, multiLine=multiLine, allowUnquotedControlChars=allowUnquotedControlChars, lineSep=lineSep, -samplingRatio=samplingRatio, dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding) +samplingRatio=samplingRatio, dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding, +locale=locale) if isinstance(path, basestring): path = [path] if type(path) == list: @@ -349,7 +353,7 @@ class DataFrameReader(OptionUtils): negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, -samplingRatio=None, enforceSchema=None, emptyValue=None): +samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None): r"""Loads a CSV file and returns the result as a :class:`DataFrame`. This function will go through the input once to determine the input schema if @@ -446,6 +450,9 @@ class DataFrameReader(OptionUtils): If None is set, it uses the default value, ``1.0``. :param emptyValue: sets the string representation of
spark git commit: [INFRA] Close stale PRs
Repository: spark Updated Branches: refs/heads/master 6cd23482d -> a3ba3a899 [INFRA] Close stale PRs Closes https://github.com/apache/spark/pull/21766 Closes https://github.com/apache/spark/pull/21679 Closes https://github.com/apache/spark/pull/21161 Closes https://github.com/apache/spark/pull/20846 Closes https://github.com/apache/spark/pull/19434 Closes https://github.com/apache/spark/pull/18080 Closes https://github.com/apache/spark/pull/17648 Closes https://github.com/apache/spark/pull/17169 Add: Closes #22813 Closes #21994 Closes #22005 Closes #22463 Add: Closes #15899 Add: Closes #22539 Closes #21868 Closes #21514 Closes #21402 Closes #21322 Closes #21257 Closes #20163 Closes #19691 Closes #18697 Closes #18636 Closes #17176 Closes #23001 from wangyum/CloseStalePRs. Authored-by: Yuming Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3ba3a89 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3ba3a89 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3ba3a89 Branch: refs/heads/master Commit: a3ba3a899b3b43958820dc82fcdd3a8b28653bcb Parents: 6cd2348 Author: Yuming Wang Authored: Sun Nov 11 14:05:19 2018 +0800 Committer: hyukjinkwon Committed: Sun Nov 11 14:05:19 2018 +0800 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25972][PYTHON] Missed JSON options in streaming.py
Repository: spark Updated Branches: refs/heads/master a3ba3a899 -> aec0af4a9 [SPARK-25972][PYTHON] Missed JSON options in streaming.py ## What changes were proposed in this pull request? Added JSON options for `json()` in streaming.py that are presented in the similar method in readwriter.py. In particular, missed options are `dropFieldIfAllNull` and `encoding`. Closes #22973 from MaxGekk/streaming-missed-options. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aec0af4a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aec0af4a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aec0af4a Branch: refs/heads/master Commit: aec0af4a952df2957e21d39d1e0546a36ab7ab86 Parents: a3ba3a8 Author: Maxim Gekk Authored: Sun Nov 11 21:01:29 2018 +0800 Committer: hyukjinkwon Committed: Sun Nov 11 21:01:29 2018 +0800 -- python/pyspark/sql/streaming.py | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aec0af4a/python/pyspark/sql/streaming.py -- diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index 02b14ea..58ca7b8 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -404,7 +404,8 @@ class DataStreamReader(OptionUtils): allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, - multiLine=None, allowUnquotedControlChars=None, lineSep=None, locale=None): + multiLine=None, allowUnquotedControlChars=None, lineSep=None, locale=None, + dropFieldIfAllNull=None, encoding=None): """ Loads a JSON file stream and returns the results as a :class:`DataFrame`. @@ -472,6 +473,13 @@ class DataStreamReader(OptionUtils): :param locale: sets a locale as language tag in IETF BCP 47 format. If None is set, it uses the default value, ``en-US``. For instance, ``locale`` is used while parsing dates and timestamps. +:param dropFieldIfAllNull: whether to ignore column of all null values or empty + array/struct during schema inference. If None is set, it + uses the default value, ``false``. +:param encoding: allows to forcibly set one of standard basic or extended encoding for + the JSON files. For example UTF-16BE, UTF-32LE. If None is set, + the encoding of input JSON will be detected automatically + when the multiLine option is set to ``true``. >>> json_sdf = spark.readStream.json(tempfile.mkdtemp(), schema = sdf_schema) >>> json_sdf.isStreaming @@ -486,7 +494,8 @@ class DataStreamReader(OptionUtils): allowBackslashEscapingAnyCharacter=allowBackslashEscapingAnyCharacter, mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, dateFormat=dateFormat, timestampFormat=timestampFormat, multiLine=multiLine, -allowUnquotedControlChars=allowUnquotedControlChars, lineSep=lineSep, locale=locale) +allowUnquotedControlChars=allowUnquotedControlChars, lineSep=lineSep, locale=locale, +dropFieldIfAllNull=dropFieldIfAllNull, encoding=encoding) if isinstance(path, basestring): return self._df(self._jreader.json(path)) else: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-26007][SQL] DataFrameReader.csv() respects to spark.sql.columnNameOfCorruptRecord
Repository: spark Updated Branches: refs/heads/master 88c826272 -> c49193437 [SPARK-26007][SQL] DataFrameReader.csv() respects to spark.sql.columnNameOfCorruptRecord ## What changes were proposed in this pull request? Passing current value of SQL config `spark.sql.columnNameOfCorruptRecord` to `CSVOptions` inside of `DataFrameReader`.`csv()`. ## How was this patch tested? Added a test where default value of `spark.sql.columnNameOfCorruptRecord` is changed. Closes #23006 from MaxGekk/csv-corrupt-sql-config. Lead-authored-by: Maxim Gekk Co-authored-by: Dongjoon Hyun Co-authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4919343 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4919343 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4919343 Branch: refs/heads/master Commit: c49193437745f072767d26e6b9099f4949cabf95 Parents: 88c8262 Author: Maxim Gekk Authored: Tue Nov 13 12:26:19 2018 +0800 Committer: hyukjinkwon Committed: Tue Nov 13 12:26:19 2018 +0800 -- .../apache/spark/sql/catalyst/csv/CSVOptions.scala| 14 +- .../sql/execution/datasources/csv/CSVSuite.scala | 11 +++ 2 files changed, 24 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c4919343/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala index 6428235..6bb50b4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala @@ -25,6 +25,7 @@ import org.apache.commons.lang3.time.FastDateFormat import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.internal.SQLConf class CSVOptions( @transient val parameters: CaseInsensitiveMap[String], @@ -36,8 +37,19 @@ class CSVOptions( def this( parameters: Map[String, String], columnPruning: Boolean, +defaultTimeZoneId: String) = { +this( + CaseInsensitiveMap(parameters), + columnPruning, + defaultTimeZoneId, + SQLConf.get.columnNameOfCorruptRecord) + } + + def this( +parameters: Map[String, String], +columnPruning: Boolean, defaultTimeZoneId: String, -defaultColumnNameOfCorruptRecord: String = "") = { +defaultColumnNameOfCorruptRecord: String) = { this( CaseInsensitiveMap(parameters), columnPruning, http://git-wip-us.apache.org/repos/asf/spark/blob/c4919343/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index d43efc8..2efe1dd 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -1848,4 +1848,15 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te val schema = new StructType().add("a", StringType).add("b", IntegerType) checkAnswer(spark.read.schema(schema).option("delimiter", delimiter).csv(input), Row("abc", 1)) } + + test("using spark.sql.columnNameOfCorruptRecord") { +withSQLConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD.key -> "_unparsed") { + val csv = "\"" + val df = spark.read +.schema("a int, _unparsed string") +.csv(Seq(csv).toDS()) + + checkAnswer(df, Row(null, csv)) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: [SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files
[SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/streaming/tests.py` into ...: ``` pyspark âââ __init__.py ... âââ streaming â  âââ __init__.py ... â  âââ tests â  â  âââ __init__.py â  â  âââ test_context.py â  â  âââ test_dstream.py â  â  âââ test_kinesis.py â  â  âââ test_listener.py ... âââ testing ... â  âââ streamingutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23034 from HyukjinKwon/SPARK-26035. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3649fe59 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3649fe59 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3649fe59 Branch: refs/heads/master Commit: 3649fe599f1aa27fea0abd61c18d3ffa275d267b Parents: 9a5fda6 Author: hyukjinkwon Authored: Fri Nov 16 07:58:09 2018 +0800 Committer: hyukjinkwon Committed: Fri Nov 16 07:58:09 2018 +0800 -- dev/sparktestsupport/modules.py |7 +- python/pyspark/streaming/tests.py | 1185 -- python/pyspark/streaming/tests/__init__.py | 16 + python/pyspark/streaming/tests/test_context.py | 184 +++ python/pyspark/streaming/tests/test_dstream.py | 640 ++ python/pyspark/streaming/tests/test_kinesis.py | 89 ++ python/pyspark/streaming/tests/test_listener.py | 158 +++ python/pyspark/testing/streamingutils.py| 190 +++ 8 files changed, 1283 insertions(+), 1186 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3649fe59/dev/sparktestsupport/modules.py -- diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index d5fcc06..58b48f4 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -398,8 +398,13 @@ pyspark_streaming = Module( "python/pyspark/streaming" ], python_test_goals=[ +# doctests "pyspark.streaming.util", -"pyspark.streaming.tests", +# unittests +"pyspark.streaming.tests.test_context", +"pyspark.streaming.tests.test_dstream", +"pyspark.streaming.tests.test_kinesis", +"pyspark.streaming.tests.test_listener", ] ) http://git-wip-us.apache.org/repos/asf/spark/blob/3649fe59/python/pyspark/streaming/tests.py -- diff --git a/python/pyspark/streaming/tests.py b/python/pyspark/streaming/tests.py deleted file mode 100644 index 8df00bc..000 --- a/python/pyspark/streaming/tests.py +++ /dev/null @@ -1,1185 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -import glob -import os -import sys -from itertools import chain -import time -import operator -import tempfile -import random -import struct -import shutil -from functools import reduce - -try: -import xmlrunner -except ImportError: -xmlrunner = None - -if sys.version_info[:2] <= (2, 6): -try: -import unittest2 as unittest -except ImportError: -sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') -sys.exit(1) -else: -import unittest - -if sys.version >= "3": -long = int - -from pyspark.context import
[1/2] spark git commit: [SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files
Repository: spark Updated Branches: refs/heads/master 9a5fda60e -> 3649fe599 http://git-wip-us.apache.org/repos/asf/spark/blob/3649fe59/python/pyspark/streaming/tests/test_listener.py -- diff --git a/python/pyspark/streaming/tests/test_listener.py b/python/pyspark/streaming/tests/test_listener.py new file mode 100644 index 000..7c874b6 --- /dev/null +++ b/python/pyspark/streaming/tests/test_listener.py @@ -0,0 +1,158 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +from pyspark.streaming import StreamingListener +from pyspark.testing.streamingutils import PySparkStreamingTestCase + + +class StreamingListenerTests(PySparkStreamingTestCase): + +duration = .5 + +class BatchInfoCollector(StreamingListener): + +def __init__(self): +super(StreamingListener, self).__init__() +self.batchInfosCompleted = [] +self.batchInfosStarted = [] +self.batchInfosSubmitted = [] +self.streamingStartedTime = [] + +def onStreamingStarted(self, streamingStarted): +self.streamingStartedTime.append(streamingStarted.time) + +def onBatchSubmitted(self, batchSubmitted): +self.batchInfosSubmitted.append(batchSubmitted.batchInfo()) + +def onBatchStarted(self, batchStarted): +self.batchInfosStarted.append(batchStarted.batchInfo()) + +def onBatchCompleted(self, batchCompleted): +self.batchInfosCompleted.append(batchCompleted.batchInfo()) + +def test_batch_info_reports(self): +batch_collector = self.BatchInfoCollector() +self.ssc.addStreamingListener(batch_collector) +input = [[1], [2], [3], [4]] + +def func(dstream): +return dstream.map(int) +expected = [[1], [2], [3], [4]] +self._test_func(input, func, expected) + +batchInfosSubmitted = batch_collector.batchInfosSubmitted +batchInfosStarted = batch_collector.batchInfosStarted +batchInfosCompleted = batch_collector.batchInfosCompleted +streamingStartedTime = batch_collector.streamingStartedTime + +self.wait_for(batchInfosCompleted, 4) + +self.assertEqual(len(streamingStartedTime), 1) + +self.assertGreaterEqual(len(batchInfosSubmitted), 4) +for info in batchInfosSubmitted: +self.assertGreaterEqual(info.batchTime().milliseconds(), 0) +self.assertGreaterEqual(info.submissionTime(), 0) + +for streamId in info.streamIdToInputInfo(): +streamInputInfo = info.streamIdToInputInfo()[streamId] +self.assertGreaterEqual(streamInputInfo.inputStreamId(), 0) +self.assertGreaterEqual(streamInputInfo.numRecords, 0) +for key in streamInputInfo.metadata(): +self.assertIsNotNone(streamInputInfo.metadata()[key]) +self.assertIsNotNone(streamInputInfo.metadataDescription()) + +for outputOpId in info.outputOperationInfos(): +outputInfo = info.outputOperationInfos()[outputOpId] +self.assertGreaterEqual(outputInfo.batchTime().milliseconds(), 0) +self.assertGreaterEqual(outputInfo.id(), 0) +self.assertIsNotNone(outputInfo.name()) +self.assertIsNotNone(outputInfo.description()) +self.assertGreaterEqual(outputInfo.startTime(), -1) +self.assertGreaterEqual(outputInfo.endTime(), -1) +self.assertIsNone(outputInfo.failureReason()) + +self.assertEqual(info.schedulingDelay(), -1) +self.assertEqual(info.processingDelay(), -1) +self.assertEqual(info.totalDelay(), -1) +self.assertEqual(info.numRecords(), 0) + +self.assertGreaterEqual(len(batchInfosStarted), 4) +for info in batchInfosStarted: +self.assertGreaterEqual(info.batchTime().milliseconds(), 0) +self.assertGreaterEqual(info.submissionTime(), 0) + +for streamId in info.streamIdToInputInfo(): +streamInputInfo = info.streamIdToInputInfo()[streamId] +
spark git commit: [SPARK-25883][BACKPORT][SQL][MINOR] Override method `prettyName` in `from_avro`/`to_avro`
Repository: spark Updated Branches: refs/heads/branch-2.4 96834fb77 -> 6148a77a5 [SPARK-25883][BACKPORT][SQL][MINOR] Override method `prettyName` in `from_avro`/`to_avro` Back port https://github.com/apache/spark/pull/22890 to branch-2.4. It is a bug fix for this issue: https://issues.apache.org/jira/browse/SPARK-26063 ## What changes were proposed in this pull request? Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct, Map(mode -> PERMISSIVE))#11] ``` It only makes the alias name quite long: `from_avro(col, struct, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)#11 ``` ## How was this patch tested? Manual check Closes #23047 from gengliangwang/backport_avro_pretty_name. Authored-by: Gengliang Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6148a77a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6148a77a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6148a77a Branch: refs/heads/branch-2.4 Commit: 6148a77a5da9ca33fb115269f1cba29cddfc652e Parents: 96834fb Author: Gengliang Wang Authored: Fri Nov 16 08:35:00 2018 +0800 Committer: hyukjinkwon Committed: Fri Nov 16 08:35:00 2018 +0800 -- .../scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala | 8 +--- .../scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala | 8 +--- 2 files changed, 2 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6148a77a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala -- diff --git a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala index 915769f..8641b9f 100644 --- a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala +++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala @@ -51,13 +51,7 @@ case class AvroDataToCatalyst(child: Expression, jsonFormatSchema: String) deserializer.deserialize(result) } - override def simpleString: String = { -s"from_avro(${child.sql}, ${dataType.simpleString})" - } - - override def sql: String = { -s"from_avro(${child.sql}, ${dataType.catalogString})" - } + override def prettyName: String = "from_avro" override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val expr = ctx.addReferenceObj("this", this) http://git-wip-us.apache.org/repos/asf/spark/blob/6148a77a/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala -- diff --git a/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala b/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala index 141ff37..6ed330d 100644 --- a/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala +++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala @@ -52,13 +52,7 @@ case class CatalystDataToAvro(child: Expression) extends UnaryExpression { out.toByteArray } - override def simpleString: String = { -s"to_avro(${child.sql}, ${child.dataType.simpleString})" - } - - override def sql: String = { -s"to_avro(${child.sql}, ${child.dataType.catalogString})" - } + override def prettyName: String = "to_avro" override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val expr = ctx.addReferenceObj("this", this) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell
Repository: spark Updated Branches: refs/heads/master 78fa1be29 -> cc38abc27 [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell ## What changes were proposed in this pull request? This PR targets to document `-I` option from Spark 2.4.x (previously `-i` option until Spark 2.3.x). After we upgraded Scala to 2.11.12, `-i` option (`:load`) was replaced to `-I`(SI-7898). Existing `-i` became `:paste` which does not respect Spark's implicit import (for instance `toDF`, symbol as column, etc.). Therefore, `-i` option does not correctly from Spark 2.4.x and it's not documented. I checked other Scala REPL options but looks not applicable or working from quick tests. This PR only targets to document `-I` for now. ## How was this patch tested? Manually tested. **Mac:** ```bash $ ./bin/spark-shell --help Usage: ./bin/spark-shell [options] Scala REPL options: -Ipreload , enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ... ``` **Windows:** ```cmd C:\...\spark>.\bin\spark-shell --help Usage: .\bin\spark-shell.cmd [options] Scala REPL options: -Ipreload , enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ... ``` Closes #22919 from HyukjinKwon/SPARK-25906. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc38abc2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc38abc2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc38abc2 Branch: refs/heads/master Commit: cc38abc27a671f345e3b4c170977a1976a02a0d0 Parents: 78fa1be Author: hyukjinkwon Authored: Tue Nov 6 10:39:58 2018 +0800 Committer: hyukjinkwon Committed: Tue Nov 6 10:39:58 2018 +0800 -- bin/spark-shell | 5 - bin/spark-shell2.cmd | 8 +++- 2 files changed, 11 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cc38abc2/bin/spark-shell -- diff --git a/bin/spark-shell b/bin/spark-shell index 421f36c..e920137 100755 --- a/bin/spark-shell +++ b/bin/spark-shell @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" # SPARK-4161: scala does not assume use of the java classpath, # so we need to add the "-Dscala.usejavacp=true" flag manually. We http://git-wip-us.apache.org/repos/asf/spark/blob/cc38abc2/bin/spark-shell2.cmd -- diff --git a/bin/spark-shell2.cmd b/bin/spark-shell2.cmd index aaf7190..549bf43 100644 --- a/bin/spark-shell2.cmd +++ b/bin/spark-shell2.cmd @@ -20,7 +20,13 @@ rem rem Figure out where the Spark framework is installed call "%~dp0find-spark-home.cmd" -set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options] +set LF=^ + + +rem two empty lines are required +set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]^%LF%%LF%^%LF%%LF%^ +Scala REPL options:^%LF%%LF%^ + -I ^ preload ^, enforcing line-by-line interpretation rem SPARK-4161: scala does not assume use of the java classpath, rem so we need to add the "-Dscala.usejavacp=true" flag manually. We - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell
Repository: spark Updated Branches: refs/heads/branch-2.4 8526f2ee5 -> f98c0ad02 [SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell ## What changes were proposed in this pull request? This PR targets to document `-I` option from Spark 2.4.x (previously `-i` option until Spark 2.3.x). After we upgraded Scala to 2.11.12, `-i` option (`:load`) was replaced to `-I`(SI-7898). Existing `-i` became `:paste` which does not respect Spark's implicit import (for instance `toDF`, symbol as column, etc.). Therefore, `-i` option does not correctly from Spark 2.4.x and it's not documented. I checked other Scala REPL options but looks not applicable or working from quick tests. This PR only targets to document `-I` for now. ## How was this patch tested? Manually tested. **Mac:** ```bash $ ./bin/spark-shell --help Usage: ./bin/spark-shell [options] Scala REPL options: -Ipreload , enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ... ``` **Windows:** ```cmd C:\...\spark>.\bin\spark-shell --help Usage: .\bin\spark-shell.cmd [options] Scala REPL options: -Ipreload , enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ... ``` Closes #22919 from HyukjinKwon/SPARK-25906. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon (cherry picked from commit cc38abc27a671f345e3b4c170977a1976a02a0d0) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f98c0ad0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f98c0ad0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f98c0ad0 Branch: refs/heads/branch-2.4 Commit: f98c0ad02ea087ae79fef277801d0b71a5019b48 Parents: 8526f2e Author: hyukjinkwon Authored: Tue Nov 6 10:39:58 2018 +0800 Committer: hyukjinkwon Committed: Tue Nov 6 10:40:17 2018 +0800 -- bin/spark-shell | 5 - bin/spark-shell2.cmd | 8 +++- 2 files changed, 11 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f98c0ad0/bin/spark-shell -- diff --git a/bin/spark-shell b/bin/spark-shell index 421f36c..e920137 100755 --- a/bin/spark-shell +++ b/bin/spark-shell @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" # SPARK-4161: scala does not assume use of the java classpath, # so we need to add the "-Dscala.usejavacp=true" flag manually. We http://git-wip-us.apache.org/repos/asf/spark/blob/f98c0ad0/bin/spark-shell2.cmd -- diff --git a/bin/spark-shell2.cmd b/bin/spark-shell2.cmd index aaf7190..549bf43 100644 --- a/bin/spark-shell2.cmd +++ b/bin/spark-shell2.cmd @@ -20,7 +20,13 @@ rem rem Figure out where the Spark framework is installed call "%~dp0find-spark-home.cmd" -set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options] +set LF=^ + + +rem two empty lines are required +set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]^%LF%%LF%^%LF%%LF%^ +Scala REPL options:^%LF%%LF%^ + -I ^ preload ^, enforcing line-by-line interpretation rem SPARK-4161: scala does not assume use of the java classpath, rem so we need to add the "-Dscala.usejavacp=true" flag manually. We - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[5/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/__init__.py -- diff --git a/python/pyspark/sql/tests/__init__.py b/python/pyspark/sql/tests/__init__.py new file mode 100644 index 000..cce3aca --- /dev/null +++ b/python/pyspark/sql/tests/__init__.py @@ -0,0 +1,16 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_appsubmit.py -- diff --git a/python/pyspark/sql/tests/test_appsubmit.py b/python/pyspark/sql/tests/test_appsubmit.py new file mode 100644 index 000..3c71151 --- /dev/null +++ b/python/pyspark/sql/tests/test_appsubmit.py @@ -0,0 +1,96 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import subprocess +import tempfile + +import py4j + +from pyspark import SparkContext +from pyspark.tests import SparkSubmitTests + + +class HiveSparkSubmitTests(SparkSubmitTests): + +@classmethod +def setUpClass(cls): +# get a SparkContext to check for availability of Hive +sc = SparkContext('local[4]', cls.__name__) +cls.hive_available = True +try: +sc._jvm.org.apache.hadoop.hive.conf.HiveConf() +except py4j.protocol.Py4JError: +cls.hive_available = False +except TypeError: +cls.hive_available = False +finally: +# we don't need this SparkContext for the test +sc.stop() + +def setUp(self): +super(HiveSparkSubmitTests, self).setUp() +if not self.hive_available: +self.skipTest("Hive is not available.") + +def test_hivecontext(self): +# This test checks that HiveContext is using Hive metastore (SPARK-16224). +# It sets a metastore url and checks if there is a derby dir created by +# Hive metastore. If this derby dir exists, HiveContext is using +# Hive metastore. +metastore_path = os.path.join(tempfile.mkdtemp(), "spark16224_metastore_db") +metastore_URL = "jdbc:derby:;databaseName=" + metastore_path + ";create=true" +hive_site_dir = os.path.join(self.programDir, "conf") +hive_site_file = self.createTempFile("hive-site.xml", (""" +| +| +| javax.jdo.option.ConnectionURL +| %s +| +| +""" % metastore_URL).lstrip(), "conf") +script = self.createTempFile("test.py", """ +|import os +| +|from pyspark.conf import SparkConf +|from pyspark.context import SparkContext +|from pyspark.sql import HiveContext +| +|conf = SparkConf() +|sc = SparkContext(conf=conf) +|hive_context = HiveContext(sc) +|print(hive_context.sql("show databases").collect()) +""") +proc = subprocess.Popen( +self.sparkSubmit + ["--master", "local-cluster[1,1,1024]", +"--driver-class-path", hive_site_dir, script], +stdout=subprocess.PIPE) +out, err = proc.communicate() +self.assertEqual(0, proc.returncode) +self.assertIn("default", out.decode('utf-8')) +self.assertTrue(os.path.exists(metastore_path)) + + +if __name__ == "__main__": +import unittest +from
[1/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
Repository: spark Updated Branches: refs/heads/master f26cd1881 -> a7a331df6 http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_udf.py -- diff --git a/python/pyspark/sql/tests/test_udf.py b/python/pyspark/sql/tests/test_udf.py new file mode 100644 index 000..630b215 --- /dev/null +++ b/python/pyspark/sql/tests/test_udf.py @@ -0,0 +1,654 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import functools +import pydoc +import shutil +import tempfile +import unittest + +from pyspark import SparkContext +from pyspark.sql import SparkSession, Column, Row +from pyspark.sql.functions import UserDefinedFunction +from pyspark.sql.types import * +from pyspark.sql.utils import AnalysisException +from pyspark.testing.sqlutils import ReusedSQLTestCase, test_compiled, test_not_compiled_message +from pyspark.tests import QuietTest + + +class UDFTests(ReusedSQLTestCase): + +def test_udf_with_callable(self): +d = [Row(number=i, squared=i**2) for i in range(10)] +rdd = self.sc.parallelize(d) +data = self.spark.createDataFrame(rdd) + +class PlusFour: +def __call__(self, col): +if col is not None: +return col + 4 + +call = PlusFour() +pudf = UserDefinedFunction(call, LongType()) +res = data.select(pudf(data['number']).alias('plus_four')) +self.assertEqual(res.agg({'plus_four': 'sum'}).collect()[0][0], 85) + +def test_udf_with_partial_function(self): +d = [Row(number=i, squared=i**2) for i in range(10)] +rdd = self.sc.parallelize(d) +data = self.spark.createDataFrame(rdd) + +def some_func(col, param): +if col is not None: +return col + param + +pfunc = functools.partial(some_func, param=4) +pudf = UserDefinedFunction(pfunc, LongType()) +res = data.select(pudf(data['number']).alias('plus_four')) +self.assertEqual(res.agg({'plus_four': 'sum'}).collect()[0][0], 85) + +def test_udf(self): +self.spark.catalog.registerFunction("twoArgs", lambda x, y: len(x) + y, IntegerType()) +[row] = self.spark.sql("SELECT twoArgs('test', 1)").collect() +self.assertEqual(row[0], 5) + +# This is to check if a deprecated 'SQLContext.registerFunction' can call its alias. +sqlContext = self.spark._wrapped +sqlContext.registerFunction("oneArg", lambda x: len(x), IntegerType()) +[row] = sqlContext.sql("SELECT oneArg('test')").collect() +self.assertEqual(row[0], 4) + +def test_udf2(self): +with self.tempView("test"): +self.spark.catalog.registerFunction("strlen", lambda string: len(string), IntegerType()) +self.spark.createDataFrame(self.sc.parallelize([Row(a="test")]))\ +.createOrReplaceTempView("test") +[res] = self.spark.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 1").collect() +self.assertEqual(4, res[0]) + +def test_udf3(self): +two_args = self.spark.catalog.registerFunction( +"twoArgs", UserDefinedFunction(lambda x, y: len(x) + y)) +self.assertEqual(two_args.deterministic, True) +[row] = self.spark.sql("SELECT twoArgs('test', 1)").collect() +self.assertEqual(row[0], u'5') + +def test_udf_registration_return_type_none(self): +two_args = self.spark.catalog.registerFunction( +"twoArgs", UserDefinedFunction(lambda x, y: len(x) + y, "integer"), None) +self.assertEqual(two_args.deterministic, True) +[row] = self.spark.sql("SELECT twoArgs('test', 1)").collect() +self.assertEqual(row[0], 5) + +def test_udf_registration_return_type_not_none(self): +with QuietTest(self.sc): +with self.assertRaisesRegexp(TypeError, "Invalid returnType"): +self.spark.catalog.registerFunction( +"f", UserDefinedFunction(lambda x, y: len(x) + y, StringType()), StringType()) + +def test_nondeterministic_udf(self): +# Test that nondeterministic UDFs are evaluated only once in chained UDF evaluations
[6/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py deleted file mode 100644 index ea02691..000 --- a/python/pyspark/sql/tests.py +++ /dev/null @@ -1,7079 +0,0 @@ -# -*- encoding: utf-8 -*- -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -""" -Unit tests for pyspark.sql; additional tests are implemented as doctests in -individual modules. -""" -import os -import sys -import subprocess -import pydoc -import shutil -import tempfile -import threading -import pickle -import functools -import time -import datetime -import array -import ctypes -import warnings -import py4j -from contextlib import contextmanager - -try: -import xmlrunner -except ImportError: -xmlrunner = None - -if sys.version_info[:2] <= (2, 6): -try: -import unittest2 as unittest -except ImportError: -sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') -sys.exit(1) -else: -import unittest - -from pyspark.util import _exception_message - -_pandas_requirement_message = None -try: -from pyspark.sql.utils import require_minimum_pandas_version -require_minimum_pandas_version() -except ImportError as e: -# If Pandas version requirement is not satisfied, skip related tests. -_pandas_requirement_message = _exception_message(e) - -_pyarrow_requirement_message = None -try: -from pyspark.sql.utils import require_minimum_pyarrow_version -require_minimum_pyarrow_version() -except ImportError as e: -# If Arrow version requirement is not satisfied, skip related tests. -_pyarrow_requirement_message = _exception_message(e) - -_test_not_compiled_message = None -try: -from pyspark.sql.utils import require_test_compiled -require_test_compiled() -except Exception as e: -_test_not_compiled_message = _exception_message(e) - -_have_pandas = _pandas_requirement_message is None -_have_pyarrow = _pyarrow_requirement_message is None -_test_compiled = _test_not_compiled_message is None - -from pyspark import SparkConf, SparkContext -from pyspark.sql import SparkSession, SQLContext, HiveContext, Column, Row -from pyspark.sql.types import * -from pyspark.sql.types import UserDefinedType, _infer_type, _make_type_verifier -from pyspark.sql.types import _array_signed_int_typecode_ctype_mappings, _array_type_mappings -from pyspark.sql.types import _array_unsigned_int_typecode_ctype_mappings -from pyspark.sql.types import _merge_type -from pyspark.tests import QuietTest, ReusedPySparkTestCase, PySparkTestCase, SparkSubmitTests -from pyspark.sql.functions import UserDefinedFunction, sha2, lit -from pyspark.sql.window import Window -from pyspark.sql.utils import AnalysisException, ParseException, IllegalArgumentException - - -class UTCOffsetTimezone(datetime.tzinfo): -""" -Specifies timezone in UTC offset -""" - -def __init__(self, offset=0): -self.ZERO = datetime.timedelta(hours=offset) - -def utcoffset(self, dt): -return self.ZERO - -def dst(self, dt): -return self.ZERO - - -class ExamplePointUDT(UserDefinedType): -""" -User-defined type (UDT) for ExamplePoint. -""" - -@classmethod -def sqlType(self): -return ArrayType(DoubleType(), False) - -@classmethod -def module(cls): -return 'pyspark.sql.tests' - -@classmethod -def scalaUDT(cls): -return 'org.apache.spark.sql.test.ExamplePointUDT' - -def serialize(self, obj): -return [obj.x, obj.y] - -def deserialize(self, datum): -return ExamplePoint(datum[0], datum[1]) - - -class ExamplePoint: -""" -An example class to demonstrate UDT in Scala, Java, and Python. -""" - -__UDT__ = ExamplePointUDT() - -def __init__(self, x, y): -self.x = x -self.y = y - -def __repr__(self): -return "ExamplePoint(%s,%s)" % (self.x, self.y) - -def __str__(self): -return "(%s,%s)" % (self.x, self.y) - -def __eq__(self, other): -return isinstance(other, self.__class__) and \ -other.x == self.x and other.y ==
[3/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py -- diff --git a/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py b/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py new file mode 100644 index 000..4d44388 --- /dev/null +++ b/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py @@ -0,0 +1,530 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import datetime +import unittest + +from pyspark.sql import Row +from pyspark.sql.types import * +from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, have_pyarrow, \ +pandas_requirement_message, pyarrow_requirement_message +from pyspark.tests import QuietTest + + +@unittest.skipIf( +not have_pandas or not have_pyarrow, +pandas_requirement_message or pyarrow_requirement_message) +class GroupedMapPandasUDFTests(ReusedSQLTestCase): + +@property +def data(self): +from pyspark.sql.functions import array, explode, col, lit +return self.spark.range(10).toDF('id') \ +.withColumn("vs", array([lit(i) for i in range(20, 30)])) \ +.withColumn("v", explode(col('vs'))).drop('vs') + +def test_supported_types(self): +from decimal import Decimal +from distutils.version import LooseVersion +import pyarrow as pa +from pyspark.sql.functions import pandas_udf, PandasUDFType + +values = [ +1, 2, 3, +4, 5, 1.1, +2.2, Decimal(1.123), +[1, 2, 2], True, 'hello' +] +output_fields = [ +('id', IntegerType()), ('byte', ByteType()), ('short', ShortType()), +('int', IntegerType()), ('long', LongType()), ('float', FloatType()), +('double', DoubleType()), ('decim', DecimalType(10, 3)), +('array', ArrayType(IntegerType())), ('bool', BooleanType()), ('str', StringType()) +] + +# TODO: Add BinaryType to variables above once minimum pyarrow version is 0.10.0 +if LooseVersion(pa.__version__) >= LooseVersion("0.10.0"): +values.append(bytearray([0x01, 0x02])) +output_fields.append(('bin', BinaryType())) + +output_schema = StructType([StructField(*x) for x in output_fields]) +df = self.spark.createDataFrame([values], schema=output_schema) + +# Different forms of group map pandas UDF, results of these are the same +udf1 = pandas_udf( +lambda pdf: pdf.assign( +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), +output_schema, +PandasUDFType.GROUPED_MAP +) + +udf2 = pandas_udf( +lambda _, pdf: pdf.assign( +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), +output_schema, +PandasUDFType.GROUPED_MAP +) + +udf3 = pandas_udf( +lambda key, pdf: pdf.assign( +id=key[0], +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), +output_schema, +PandasUDFType.GROUPED_MAP +) + +result1 =
[7/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files ## What changes were proposed in this pull request? This is the official first attempt to break huge single `tests.py` file - I did it locally before few times and gave up for some reasons. Now, currently it really makes the unittests super hard to read and difficult to check. To me, it even bothers me to to scroll down the big file. It's one single 7000 lines file! This is not only readability issue. Since one big test takes most of tests time, the tests don't run in parallel fully - although it will costs to start and stop the context. We could pick up one example and follow. Given my investigation, the current style looks closer to NumPy structure and looks easier to follow. Please see https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/sql/tests.py` into ...: ```bash pyspark ... âââ sql ... â  âââ tests # Includes all tests broken down from 'pyspark/sql/tests.py' â â â # Each matchs to module in 'pyspark/sql'. Additionally, some logical group can â â â # be added. For instance, 'test_arrow.py', 'test_datasources.py' ... â  â  âââ __init__.py â  â  âââ test_appsubmit.py â  â  âââ test_arrow.py â  â  âââ test_catalog.py â  â  âââ test_column.py â  â  âââ test_conf.py â  â  âââ test_context.py â  â  âââ test_dataframe.py â  â  âââ test_datasources.py â  â  âââ test_functions.py â  â  âââ test_group.py â  â  âââ test_pandas_udf.py â  â  âââ test_pandas_udf_grouped_agg.py â  â  âââ test_pandas_udf_grouped_map.py â  â  âââ test_pandas_udf_scalar.py â  â  âââ test_pandas_udf_window.py â  â  âââ test_readwriter.py â  â  âââ test_serde.py â  â  âââ test_session.py â  â  âââ test_streaming.py â  â  âââ test_types.py â  â  âââ test_udf.py â  â  âââ test_utils.py ... âââ testing # Includes testing utils that can be used in unittests. â  âââ __init__.py â  âââ sqlutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and `./run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ``` SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests.test_pandas_udf_scalar ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23021 from HyukjinKwon/SPARK-25344. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a7a331df Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a7a331df Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a7a331df Branch: refs/heads/master Commit: a7a331df6e6fbcb181caf2363bffc3e05bdfc009 Parents: f26cd18 Author: hyukjinkwon Authored: Wed Nov 14 14:51:11 2018 +0800 Committer: hyukjinkwon Committed: Wed Nov 14 14:51:11 2018 +0800 -- dev/sparktestsupport/modules.py | 25 +- python/pyspark/sql/tests.py | 7079 -- python/pyspark/sql/tests/__init__.py| 16 + python/pyspark/sql/tests/test_appsubmit.py | 96 + python/pyspark/sql/tests/test_arrow.py | 399 + python/pyspark/sql/tests/test_catalog.py| 199 + python/pyspark/sql/tests/test_column.py | 157 + python/pyspark/sql/tests/test_conf.py | 55 + python/pyspark/sql/tests/test_context.py| 263 + python/pyspark/sql/tests/test_dataframe.py | 737 ++ python/pyspark/sql/tests/test_datasources.py| 170 + python/pyspark/sql/tests/test_functions.py | 278 + python/pyspark/sql/tests/test_group.py | 45 + python/pyspark/sql/tests/test_pandas_udf.py | 216 + .../sql/tests/test_pandas_udf_grouped_agg.py| 503 ++ .../sql/tests/test_pandas_udf_grouped_map.py| 530 ++ .../pyspark/sql/tests/test_pandas_udf_scalar.py | 807 ++ .../pyspark/sql/tests/test_pandas_udf_window.py | 262 + python/pyspark/sql/tests/test_readwriter.py | 153 + python/pyspark/sql/tests/test_serde.py | 138 + python/pyspark/sql/tests/test_session.py| 320 + python/pyspark/sql/tests/test_streaming.py | 566 ++ python/pyspark/sql/tests/test_types.py | 944 +++ python/pyspark/sql/tests/test_udf.py| 654 ++ python/pyspark/sql/tests/test_utils.py | 54 + python/pyspark/testing/__init__.py | 16 + python/pyspark/testing/sqlutils.py
[4/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_dataframe.py -- diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py new file mode 100644 index 000..eba00b5 --- /dev/null +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -0,0 +1,737 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import pydoc +import time +import unittest + +from pyspark.sql import SparkSession, Row +from pyspark.sql.types import * +from pyspark.sql.utils import AnalysisException, IllegalArgumentException +from pyspark.testing.sqlutils import ReusedSQLTestCase, SQLTestUtils, have_pyarrow, have_pandas, \ +pandas_requirement_message, pyarrow_requirement_message +from pyspark.tests import QuietTest + + +class DataFrameTests(ReusedSQLTestCase): + +def test_range(self): +self.assertEqual(self.spark.range(1, 1).count(), 0) +self.assertEqual(self.spark.range(1, 0, -1).count(), 1) +self.assertEqual(self.spark.range(0, 1 << 40, 1 << 39).count(), 2) +self.assertEqual(self.spark.range(-2).count(), 0) +self.assertEqual(self.spark.range(3).count(), 3) + +def test_duplicated_column_names(self): +df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) +row = df.select('*').first() +self.assertEqual(1, row[0]) +self.assertEqual(2, row[1]) +self.assertEqual("Row(c=1, c=2)", str(row)) +# Cannot access columns +self.assertRaises(AnalysisException, lambda: df.select(df[0]).first()) +self.assertRaises(AnalysisException, lambda: df.select(df.c).first()) +self.assertRaises(AnalysisException, lambda: df.select(df["c"]).first()) + +def test_freqItems(self): +vals = [Row(a=1, b=-2.0) if i % 2 == 0 else Row(a=i, b=i * 1.0) for i in range(100)] +df = self.sc.parallelize(vals).toDF() +items = df.stat.freqItems(("a", "b"), 0.4).collect()[0] +self.assertTrue(1 in items[0]) +self.assertTrue(-2.0 in items[1]) + +def test_help_command(self): +# Regression test for SPARK-5464 +rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) +df = self.spark.read.json(rdd) +# render_doc() reproduces the help() exception without printing output +pydoc.render_doc(df) +pydoc.render_doc(df.foo) +pydoc.render_doc(df.take(1)) + +def test_dropna(self): +schema = StructType([ +StructField("name", StringType(), True), +StructField("age", IntegerType(), True), +StructField("height", DoubleType(), True)]) + +# shouldn't drop a non-null row +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', 50, 80.1)], schema).dropna().count(), +1) + +# dropping rows with a single null value +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', None, 80.1)], schema).dropna().count(), +0) +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', None, 80.1)], schema).dropna(how='any').count(), +0) + +# if how = 'all', only drop rows if all values are null +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', None, 80.1)], schema).dropna(how='all').count(), +1) +self.assertEqual(self.spark.createDataFrame( +[(None, None, None)], schema).dropna(how='all').count(), +0) + +# how and subset +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', 50, None)], schema).dropna(how='any', subset=['name', 'age']).count(), +1) +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', None, None)], schema).dropna(how='any', subset=['name', 'age']).count(), +0) + +# threshold +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', None, 80.1)], schema).dropna(thresh=2).count(), +1) +self.assertEqual(self.spark.createDataFrame( +[(u'Alice', None, None)], schema).dropna(thresh=2).count(), +
[2/7] spark git commit: [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/a7a331df/python/pyspark/sql/tests/test_session.py -- diff --git a/python/pyspark/sql/tests/test_session.py b/python/pyspark/sql/tests/test_session.py new file mode 100644 index 000..b811047 --- /dev/null +++ b/python/pyspark/sql/tests/test_session.py @@ -0,0 +1,320 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import unittest + +from pyspark import SparkConf, SparkContext +from pyspark.sql import SparkSession, SQLContext, Row +from pyspark.testing.sqlutils import ReusedSQLTestCase +from pyspark.tests import PySparkTestCase + + +class SparkSessionTests(ReusedSQLTestCase): +def test_sqlcontext_reuses_sparksession(self): +sqlContext1 = SQLContext(self.sc) +sqlContext2 = SQLContext(self.sc) +self.assertTrue(sqlContext1.sparkSession is sqlContext2.sparkSession) + + +class SparkSessionTests1(ReusedSQLTestCase): + +# We can't include this test into SQLTests because we will stop class's SparkContext and cause +# other tests failed. +def test_sparksession_with_stopped_sparkcontext(self): +self.sc.stop() +sc = SparkContext('local[4]', self.sc.appName) +spark = SparkSession.builder.getOrCreate() +try: +df = spark.createDataFrame([(1, 2)], ["c", "c"]) +df.collect() +finally: +spark.stop() +sc.stop() + + +class SparkSessionTests2(PySparkTestCase): + +# This test is separate because it's closely related with session's start and stop. +# See SPARK-23228. +def test_set_jvm_default_session(self): +spark = SparkSession.builder.getOrCreate() +try: + self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isDefined()) +finally: +spark.stop() + self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isEmpty()) + +def test_jvm_default_session_already_set(self): +# Here, we assume there is the default session already set in JVM. +jsession = self.sc._jvm.SparkSession(self.sc._jsc.sc()) +self.sc._jvm.SparkSession.setDefaultSession(jsession) + +spark = SparkSession.builder.getOrCreate() +try: + self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isDefined()) +# The session should be the same with the exiting one. + self.assertTrue(jsession.equals(spark._jvm.SparkSession.getDefaultSession().get())) +finally: +spark.stop() + + +class SparkSessionTests3(unittest.TestCase): + +def test_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +activeSession = SparkSession.getActiveSession() +df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name']) +self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) +finally: +spark.stop() + +def test_get_active_session_when_no_active_session(self): +active = SparkSession.getActiveSession() +self.assertEqual(active, None) +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +active = SparkSession.getActiveSession() +self.assertEqual(active, spark) +spark.stop() +active = SparkSession.getActiveSession() +self.assertEqual(active, None) + +def test_SparkSession(self): +spark = SparkSession.builder \ +.master("local") \ +.config("some-config", "v2") \ +.getOrCreate() +try: +self.assertEqual(spark.conf.get("some-config"), "v2") +self.assertEqual(spark.sparkContext._conf.get("some-config"), "v2") +self.assertEqual(spark.version, spark.sparkContext.version) +spark.sql("CREATE DATABASE test_db") +spark.catalog.setCurrentDatabase("test_db") +self.assertEqual(spark.catalog.currentDatabase(), "test_db") +spark.sql("CREATE TABLE table1 (name STRING, age INT) USING parquet") +
spark git commit: [MINOR][SQL] Add disable bucketedRead workaround when throw RuntimeException
Repository: spark Updated Branches: refs/heads/master ad853c567 -> f6255d7b7 [MINOR][SQL] Add disable bucketedRead workaround when throw RuntimeException ## What changes were proposed in this pull request? It will throw `RuntimeException` when read from bucketed table(about 1.7G per bucket file): ![image](https://user-images.githubusercontent.com/5399861/48346889-8041ce00-e6b7-11e8-83b0-ead83fb15821.png) Default(enable bucket read): ![image](https://user-images.githubusercontent.com/5399861/48347084-2c83b480-e6b8-11e8-913a-9cafc043e9e4.png) Disable bucket read: ![image](https://user-images.githubusercontent.com/5399861/48347099-3a393a00-e6b8-11e8-94af-cb814e1ba277.png) The reason is that each bucket file is too big. a workaround is disable bucket read. This PR add this workaround to Spark. ## How was this patch tested? manual tests Closes #23014 from wangyum/anotherWorkaround. Authored-by: Yuming Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6255d7b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6255d7b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6255d7b Branch: refs/heads/master Commit: f6255d7b7cc4cc5d1f4fe0e5e493a1efee22f38f Parents: ad853c5 Author: Yuming Wang Authored: Thu Nov 15 08:33:06 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 15 08:33:06 2018 +0800 -- .../spark/sql/execution/vectorized/WritableColumnVector.java| 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f6255d7b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java index b0e119d..4f5e72c 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java @@ -101,10 +101,11 @@ public abstract class WritableColumnVector extends ColumnVector { String message = "Cannot reserve additional contiguous bytes in the vectorized reader (" + (requiredCapacity >= 0 ? "requested " + requiredCapacity + " bytes" : "integer overflow") + "). As a workaround, you can reduce the vectorized reader batch size, or disable the " + -"vectorized reader. For parquet file format, refer to " + +"vectorized reader, or disable " + SQLConf.BUCKETING_ENABLED().key() + " if you read " + +"from bucket table. For Parquet file format, refer to " + SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().key() + " (default " + SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().defaultValueString() + -") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for orc file format, " + +") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for ORC file format, " + "refer to " + SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().key() + " (default " + SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().defaultValueString() + ") and " + SQLConf.ORC_VECTORIZED_READER_ENABLED().key() + "."; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/4] spark git commit: [SPARK-26036][PYTHON] Break large tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests/__init__.py -- diff --git a/python/pyspark/tests/__init__.py b/python/pyspark/tests/__init__.py new file mode 100644 index 000..12bdf0d --- /dev/null +++ b/python/pyspark/tests/__init__.py @@ -0,0 +1,16 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests/test_appsubmit.py -- diff --git a/python/pyspark/tests/test_appsubmit.py b/python/pyspark/tests/test_appsubmit.py new file mode 100644 index 000..92bcb11 --- /dev/null +++ b/python/pyspark/tests/test_appsubmit.py @@ -0,0 +1,248 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import re +import shutil +import subprocess +import tempfile +import unittest +import zipfile + + +class SparkSubmitTests(unittest.TestCase): + +def setUp(self): +self.programDir = tempfile.mkdtemp() +tmp_dir = tempfile.gettempdir() +self.sparkSubmit = [ +os.path.join(os.environ.get("SPARK_HOME"), "bin", "spark-submit"), +"--conf", "spark.driver.extraJavaOptions=-Djava.io.tmpdir={0}".format(tmp_dir), +"--conf", "spark.executor.extraJavaOptions=-Djava.io.tmpdir={0}".format(tmp_dir), +] + +def tearDown(self): +shutil.rmtree(self.programDir) + +def createTempFile(self, name, content, dir=None): +""" +Create a temp file with the given name and content and return its path. +Strips leading spaces from content up to the first '|' in each line. +""" +pattern = re.compile(r'^ *\|', re.MULTILINE) +content = re.sub(pattern, '', content.strip()) +if dir is None: +path = os.path.join(self.programDir, name) +else: +os.makedirs(os.path.join(self.programDir, dir)) +path = os.path.join(self.programDir, dir, name) +with open(path, "w") as f: +f.write(content) +return path + +def createFileInZip(self, name, content, ext=".zip", dir=None, zip_name=None): +""" +Create a zip archive containing a file with the given content and return its path. +Strips leading spaces from content up to the first '|' in each line. +""" +pattern = re.compile(r'^ *\|', re.MULTILINE) +content = re.sub(pattern, '', content.strip()) +if dir is None: +path = os.path.join(self.programDir, name + ext) +else: +path = os.path.join(self.programDir, dir, zip_name + ext) +zip = zipfile.ZipFile(path, 'w') +zip.writestr(name, content) +zip.close() +return path + +def create_spark_package(self, artifact_name): +group_id, artifact_id, version = artifact_name.split(":") +self.createTempFile("%s-%s.pom" % (artifact_id, version), (""" +| +|http://maven.apache.org/POM/4.0.0; +| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; +| xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 +| http://maven.apache.org/xsd/maven-4.0.0.xsd;> +| 4.0.0 +| %s +| %s +| %s +| +""" % (group_id, artifact_id, version)).lstrip(), +
[3/4] spark git commit: [SPARK-26036][PYTHON] Break large tests.py files into smaller files
http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests.py -- diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py deleted file mode 100644 index 131c51e..000 --- a/python/pyspark/tests.py +++ /dev/null @@ -1,2502 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -""" -Unit tests for PySpark; additional tests are implemented as doctests in -individual modules. -""" - -from array import array -from glob import glob -import os -import re -import shutil -import subprocess -import sys -import tempfile -import time -import zipfile -import random -import threading -import hashlib - -from py4j.protocol import Py4JJavaError -try: -import xmlrunner -except ImportError: -xmlrunner = None - -if sys.version_info[:2] <= (2, 6): -try: -import unittest2 as unittest -except ImportError: -sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') -sys.exit(1) -else: -import unittest -if sys.version_info[0] >= 3: -xrange = range -basestring = str - -if sys.version >= "3": -from io import StringIO -else: -from StringIO import StringIO - - -from pyspark import keyword_only -from pyspark.conf import SparkConf -from pyspark.context import SparkContext -from pyspark.rdd import RDD -from pyspark.files import SparkFiles -from pyspark.serializers import read_int, BatchedSerializer, MarshalSerializer, PickleSerializer, \ -CloudPickleSerializer, CompressedSerializer, UTF8Deserializer, NoOpSerializer, \ -PairDeserializer, CartesianDeserializer, AutoBatchedSerializer, AutoSerializer, \ -FlattenedValuesSerializer -from pyspark.shuffle import Aggregator, ExternalMerger, ExternalSorter -from pyspark import shuffle -from pyspark.profiler import BasicProfiler -from pyspark.taskcontext import BarrierTaskContext, TaskContext - -_have_scipy = False -_have_numpy = False -try: -import scipy.sparse -_have_scipy = True -except: -# No SciPy, but that's okay, we'll skip those tests -pass -try: -import numpy as np -_have_numpy = True -except: -# No NumPy, but that's okay, we'll skip those tests -pass - - -SPARK_HOME = os.environ["SPARK_HOME"] - - -class MergerTests(unittest.TestCase): - -def setUp(self): -self.N = 1 << 12 -self.l = [i for i in xrange(self.N)] -self.data = list(zip(self.l, self.l)) -self.agg = Aggregator(lambda x: [x], - lambda x, y: x.append(y) or x, - lambda x, y: x.extend(y) or x) - -def test_small_dataset(self): -m = ExternalMerger(self.agg, 1000) -m.mergeValues(self.data) -self.assertEqual(m.spills, 0) -self.assertEqual(sum(sum(v) for k, v in m.items()), - sum(xrange(self.N))) - -m = ExternalMerger(self.agg, 1000) -m.mergeCombiners(map(lambda x_y1: (x_y1[0], [x_y1[1]]), self.data)) -self.assertEqual(m.spills, 0) -self.assertEqual(sum(sum(v) for k, v in m.items()), - sum(xrange(self.N))) - -def test_medium_dataset(self): -m = ExternalMerger(self.agg, 20) -m.mergeValues(self.data) -self.assertTrue(m.spills >= 1) -self.assertEqual(sum(sum(v) for k, v in m.items()), - sum(xrange(self.N))) - -m = ExternalMerger(self.agg, 10) -m.mergeCombiners(map(lambda x_y2: (x_y2[0], [x_y2[1]]), self.data * 3)) -self.assertTrue(m.spills >= 1) -self.assertEqual(sum(sum(v) for k, v in m.items()), - sum(xrange(self.N)) * 3) - -def test_huge_dataset(self): -m = ExternalMerger(self.agg, 5, partitions=3) -m.mergeCombiners(map(lambda k_v: (k_v[0], [str(k_v[1])]), self.data * 10)) -self.assertTrue(m.spills >= 1) -self.assertEqual(sum(len(v) for k, v in m.items()), - self.N * 10) -m._cleanup() - -def test_group_by_key(self): - -def gen_data(N, step): -for i in range(1, N + 1, step): -for j in range(i): -yield (i, [j]) - -
[1/4] spark git commit: [SPARK-26036][PYTHON] Break large tests.py files into smaller files
Repository: spark Updated Branches: refs/heads/master f6255d7b7 -> 03306a6df http://git-wip-us.apache.org/repos/asf/spark/blob/03306a6d/python/pyspark/tests/test_readwrite.py -- diff --git a/python/pyspark/tests/test_readwrite.py b/python/pyspark/tests/test_readwrite.py new file mode 100644 index 000..e45f5b3 --- /dev/null +++ b/python/pyspark/tests/test_readwrite.py @@ -0,0 +1,499 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import os +import shutil +import sys +import tempfile +import unittest +from array import array + +from pyspark.testing.utils import ReusedPySparkTestCase, SPARK_HOME + + +class InputFormatTests(ReusedPySparkTestCase): + +@classmethod +def setUpClass(cls): +ReusedPySparkTestCase.setUpClass() +cls.tempdir = tempfile.NamedTemporaryFile(delete=False) +os.unlink(cls.tempdir.name) + cls.sc._jvm.WriteInputFormatTestDataGenerator.generateData(cls.tempdir.name, cls.sc._jsc) + +@classmethod +def tearDownClass(cls): +ReusedPySparkTestCase.tearDownClass() +shutil.rmtree(cls.tempdir.name) + +@unittest.skipIf(sys.version >= "3", "serialize array of byte") +def test_sequencefiles(self): +basepath = self.tempdir.name +ints = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfint/", + "org.apache.hadoop.io.IntWritable", + "org.apache.hadoop.io.Text").collect()) +ei = [(1, u'aa'), (1, u'aa'), (2, u'aa'), (2, u'bb'), (2, u'bb'), (3, u'cc')] +self.assertEqual(ints, ei) + +doubles = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfdouble/", + "org.apache.hadoop.io.DoubleWritable", + "org.apache.hadoop.io.Text").collect()) +ed = [(1.0, u'aa'), (1.0, u'aa'), (2.0, u'aa'), (2.0, u'bb'), (2.0, u'bb'), (3.0, u'cc')] +self.assertEqual(doubles, ed) + +bytes = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfbytes/", +"org.apache.hadoop.io.IntWritable", + "org.apache.hadoop.io.BytesWritable").collect()) +ebs = [(1, bytearray('aa', 'utf-8')), + (1, bytearray('aa', 'utf-8')), + (2, bytearray('aa', 'utf-8')), + (2, bytearray('bb', 'utf-8')), + (2, bytearray('bb', 'utf-8')), + (3, bytearray('cc', 'utf-8'))] +self.assertEqual(bytes, ebs) + +text = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sftext/", + "org.apache.hadoop.io.Text", + "org.apache.hadoop.io.Text").collect()) +et = [(u'1', u'aa'), + (u'1', u'aa'), + (u'2', u'aa'), + (u'2', u'bb'), + (u'2', u'bb'), + (u'3', u'cc')] +self.assertEqual(text, et) + +bools = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfbool/", +"org.apache.hadoop.io.IntWritable", + "org.apache.hadoop.io.BooleanWritable").collect()) +eb = [(1, False), (1, True), (2, False), (2, False), (2, True), (3, True)] +self.assertEqual(bools, eb) + +nulls = sorted(self.sc.sequenceFile(basepath + "/sftestdata/sfnull/", +"org.apache.hadoop.io.IntWritable", + "org.apache.hadoop.io.BooleanWritable").collect()) +en = [(1, None), (1, None), (2, None), (2, None), (2, None), (3, None)] +self.assertEqual(nulls, en) + +maps = self.sc.sequenceFile(basepath + "/sftestdata/sfmap/", +"org.apache.hadoop.io.IntWritable", + "org.apache.hadoop.io.MapWritable").collect() +em = [(1, {}), + (1, {3.0: u'bb'}), + (2, {1.0: u'aa'}), + (2, {1.0: u'cc'}), + (3, {2.0:
spark git commit: [SPARK-26014][R] Deprecate R prior to version 3.4 in SparkR
Repository: spark Updated Branches: refs/heads/master 03306a6df -> d4130ec1f [SPARK-26014][R] Deprecate R prior to version 3.4 in SparkR ## What changes were proposed in this pull request? This PR proposes to bump up the minimum versions of R from 3.1 to 3.4. R version. 3.1.x is too old. It's released 4.5 years ago. R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0, deprecating lower versions, bumping up R to 3.4 might be reasonable option. It should be good to deprecate and drop < R 3.4 support. ## How was this patch tested? Jenkins tests. Closes #23012 from HyukjinKwon/SPARK-26014. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4130ec1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4130ec1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4130ec1 Branch: refs/heads/master Commit: d4130ec1f3461dcc961eee9802005ba7a15212d1 Parents: 03306a6 Author: hyukjinkwon Authored: Thu Nov 15 17:20:49 2018 +0800 Committer: hyukjinkwon Committed: Thu Nov 15 17:20:49 2018 +0800 -- R/WINDOWS.md | 2 +- R/pkg/DESCRIPTION| 2 +- R/pkg/inst/profile/general.R | 4 R/pkg/inst/profile/shell.R | 4 docs/index.md| 3 ++- 5 files changed, 12 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/WINDOWS.md -- diff --git a/R/WINDOWS.md b/R/WINDOWS.md index da668a6..33a4c85 100644 --- a/R/WINDOWS.md +++ b/R/WINDOWS.md @@ -3,7 +3,7 @@ To build SparkR on Windows, the following steps are required 1. Install R (>= 3.1) and [Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to -include Rtools and R in `PATH`. +include Rtools and R in `PATH`. Note that support for R prior to version 3.4 is deprecated as of Spark 3.0.0. 2. Install [JDK8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) and set http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/pkg/DESCRIPTION -- diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index cdaaa61..736da46 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -15,7 +15,7 @@ URL: http://www.apache.org/ http://spark.apache.org/ BugReports: http://spark.apache.org/contributing.html SystemRequirements: Java (== 8) Depends: -R (>= 3.0), +R (>= 3.1), methods Suggests: knitr, http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/pkg/inst/profile/general.R -- diff --git a/R/pkg/inst/profile/general.R b/R/pkg/inst/profile/general.R index 8c75c19..3efb460 100644 --- a/R/pkg/inst/profile/general.R +++ b/R/pkg/inst/profile/general.R @@ -16,6 +16,10 @@ # .First <- function() { + if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), "3.4.0") == -1) { +warning("Support for R prior to version 3.4 is deprecated since Spark 3.0.0") + } + packageDir <- Sys.getenv("SPARKR_PACKAGE_DIR") dirs <- strsplit(packageDir, ",")[[1]] .libPaths(c(dirs, .libPaths())) http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/R/pkg/inst/profile/shell.R -- diff --git a/R/pkg/inst/profile/shell.R b/R/pkg/inst/profile/shell.R index 8a8111a..32eb367 100644 --- a/R/pkg/inst/profile/shell.R +++ b/R/pkg/inst/profile/shell.R @@ -16,6 +16,10 @@ # .First <- function() { + if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), "3.4.0") == -1) { +warning("Support for R prior to version 3.4 is deprecated since Spark 3.0.0") + } + home <- Sys.getenv("SPARK_HOME") .libPaths(c(file.path(home, "R", "lib"), .libPaths())) Sys.setenv(NOAWT = 1) http://git-wip-us.apache.org/repos/asf/spark/blob/d4130ec1/docs/index.md -- diff --git a/docs/index.md b/docs/index.md index ac38f1d..bd287e3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} +Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 support is deprecated as of Spark 3.0.0. +For the Scala API, Spark {{site.SPARK_VERSION}} uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a
spark git commit: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4.0 to 3.5.1 in AppVeyor build
Repository: spark Updated Branches: refs/heads/master 0ba9715c7 -> f9ff75653 [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4.0 to 3.5.1 in AppVeyor build ## What changes were proposed in this pull request? R tools 3.5.1 is released few months ago. Spark currently uses 3.4.0. We should better upgrade in AppVeyor. ## How was this patch tested? AppVeyor builds. Closes #23011 from HyukjinKwon/SPARK-26013. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9ff7565 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9ff7565 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9ff7565 Branch: refs/heads/master Commit: f9ff75653fa8cd055fbcbfe94243049c38c60507 Parents: 0ba9715 Author: hyukjinkwon Authored: Tue Nov 13 01:21:03 2018 +0800 Committer: hyukjinkwon Committed: Tue Nov 13 01:21:03 2018 +0800 -- dev/appveyor-install-dependencies.ps1 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f9ff7565/dev/appveyor-install-dependencies.ps1 -- diff --git a/dev/appveyor-install-dependencies.ps1 b/dev/appveyor-install-dependencies.ps1 index 06d9d70..cc68ffb 100644 --- a/dev/appveyor-install-dependencies.ps1 +++ b/dev/appveyor-install-dependencies.ps1 @@ -116,7 +116,7 @@ Pop-Location # == R $rVer = "3.5.1" -$rToolsVer = "3.4.0" +$rToolsVer = "3.5.1" InstallR InstallRtools - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24601] Update Jackson to 2.9.6
Repository: spark Updated Branches: refs/heads/master 459700727 -> ab1650d29 [SPARK-24601] Update Jackson to 2.9.6 Hi all, Jackson is incompatible with upstream versions, therefore bump the Jackson version to a more recent one. I bumped into some issues with Azure CosmosDB that is using a more recent version of Jackson. This can be fixed by adding exclusions and then it works without any issues. So no breaking changes in the API's. I would also consider bumping the version of Jackson in Spark. I would suggest to keep up to date with the dependencies, since in the future this issue will pop up more frequently. ## What changes were proposed in this pull request? Bump Jackson to 2.9.6 ## How was this patch tested? Compiled and tested it locally to see if anything broke. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21596 from Fokko/fd-bump-jackson. Authored-by: Fokko Driesprong Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab1650d2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab1650d2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab1650d2 Branch: refs/heads/master Commit: ab1650d2938db4901b8c28df945d6a0691a19d31 Parents: 4597007 Author: Fokko Driesprong Authored: Fri Oct 5 16:40:08 2018 +0800 Committer: hyukjinkwon Committed: Fri Oct 5 16:40:08 2018 +0800 -- .../deploy/rest/SubmitRestProtocolMessage.scala | 2 +- .../apache/spark/rdd/RDDOperationScope.scala| 2 +- .../scala/org/apache/spark/status/KVUtils.scala | 2 +- .../status/api/v1/JacksonMessageWriter.scala| 2 +- .../org/apache/spark/status/api/v1/api.scala| 3 ++ dev/deps/spark-deps-hadoop-2.6 | 16 +- dev/deps/spark-deps-hadoop-2.7 | 16 +- dev/deps/spark-deps-hadoop-3.1 | 16 +- pom.xml | 7 ++--- .../expressions/JsonExpressionsSuite.scala | 7 + .../datasources/json/JsonBenchmarks.scala | 33 +++- 11 files changed, 59 insertions(+), 47 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ab1650d2/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala b/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala index ef5a7e3..97b689c 100644 --- a/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala @@ -36,7 +36,7 @@ import org.apache.spark.util.Utils * (2) the Spark version of the client / server * (3) an optional message */ -@JsonInclude(Include.NON_NULL) +@JsonInclude(Include.NON_ABSENT) @JsonAutoDetect(getterVisibility = Visibility.ANY, setterVisibility = Visibility.ANY) @JsonPropertyOrder(alphabetic = true) private[rest] abstract class SubmitRestProtocolMessage { http://git-wip-us.apache.org/repos/asf/spark/blob/ab1650d2/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala b/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala index 53d69ba..3abb2d8 100644 --- a/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala +++ b/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala @@ -41,7 +41,7 @@ import org.apache.spark.internal.Logging * There is no particular relationship between an operation scope and a stage or a job. * A scope may live inside one stage (e.g. map) or span across multiple jobs (e.g. take). */ -@JsonInclude(Include.NON_NULL) +@JsonInclude(Include.NON_ABSENT) @JsonPropertyOrder(Array("id", "name", "parent")) private[spark] class RDDOperationScope( val name: String, http://git-wip-us.apache.org/repos/asf/spark/blob/ab1650d2/core/src/main/scala/org/apache/spark/status/KVUtils.scala -- diff --git a/core/src/main/scala/org/apache/spark/status/KVUtils.scala b/core/src/main/scala/org/apache/spark/status/KVUtils.scala index 99b1843..45348be 100644 --- a/core/src/main/scala/org/apache/spark/status/KVUtils.scala +++ b/core/src/main/scala/org/apache/spark/status/KVUtils.scala @@ -42,7 +42,7 @@ private[spark] object KVUtils extends Logging { private[spark] class KVStoreScalaSerializer extends KVStoreSerializer { mapper.registerModule(DefaultScalaModule) -
spark git commit: [SPARK-25659][PYTHON][TEST] Test type inference specification for createDataFrame in PySpark
Repository: spark Updated Branches: refs/heads/master f9935a3f8 -> f3fed2823 [SPARK-25659][PYTHON][TEST] Test type inference specification for createDataFrame in PySpark ## What changes were proposed in this pull request? This PR proposes to specify type inference and simple e2e tests. Looks we are not cleanly testing those logics. For instance, see https://github.com/apache/spark/blob/08c76b5d39127ae207d9d1fff99c2551e6ce2581/python/pyspark/sql/types.py#L894-L905 Looks we intended to support datetime.time and None for type inference too but it does not work: ``` >>> spark.createDataFrame([[datetime.time()]]) Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 432, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/.../spark/python/pyspark/sql/types.py", line 604, in toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 604, in for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 442, in toInternal return self.dataType.toInternal(obj) File "/.../spark/python/pyspark/sql/types.py", line 193, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple' >>> spark.createDataFrame([[None]]) Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 419, in _createFromLocal struct = self._inferSchemaFromList(data, names=schema) File "/.../python/pyspark/sql/session.py", line 353, in _inferSchemaFromList raise ValueError("Some of types cannot be determined after inferring") ValueError: Some of types cannot be determined after inferring ``` ## How was this patch tested? Manual tests and unit tests were added. Closes #22653 from HyukjinKwon/SPARK-25659. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f3fed282 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f3fed282 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f3fed282 Branch: refs/heads/master Commit: f3fed28230e4e5e08d182715e8cf901daf8f3b73 Parents: f9935a3 Author: hyukjinkwon Authored: Tue Oct 9 07:45:02 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 9 07:45:02 2018 +0800 -- python/pyspark/sql/tests.py | 69 1 file changed, 69 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f3fed282/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index ac87ccd..85712df 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1149,6 +1149,75 @@ class SQLTests(ReusedSQLTestCase): result = self.spark.sql("SELECT l[0].a from test2 where d['key'].d = '2'") self.assertEqual(1, result.head()[0]) +def test_infer_schema_specification(self): +from decimal import Decimal + +class A(object): +def __init__(self): +self.a = 1 + +data = [ +True, +1, +"a", +u"a", +datetime.date(1970, 1, 1), +datetime.datetime(1970, 1, 1, 0, 0), +1.0, +array.array("d", [1]), +[1], +(1, ), +{"a": 1}, +bytearray(1), +Decimal(1), +Row(a=1), +Row("a")(1), +A(), +] + +df = self.spark.createDataFrame([data]) +actual = list(map(lambda x: x.dataType.simpleString(), df.schema)) +expected = [ +'boolean', +'bigint', +'string', +'string', +'date', +'timestamp', +'double', +'array', +'array', +'struct<_1:bigint>', +'map', +'binary', +'decimal(38,18)', +'struct', +'struct', +'struct', +] +self.assertEqual(actual, expected) + +actual = list(df.first()) +expected = [ +True, +1, +'a', +u"a", +datetime.date(1970, 1, 1), +datetime.datetime(1970, 1, 1, 0, 0), +1.0, +[1.0],
spark git commit: [SPARK-25669][SQL] Check CSV header only when it exists
Repository: spark Updated Branches: refs/heads/branch-2.4 4baa4d42a -> 404c84039 [SPARK-25669][SQL] Check CSV header only when it exists ## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes #22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon (cherry picked from commit 46fe40838aa682a7073dd6f1373518b0c8498a94) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/404c8403 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/404c8403 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/404c8403 Branch: refs/heads/branch-2.4 Commit: 404c840393086290cf975652f596b4768aa5d4eb Parents: 4baa4d4 Author: Maxim Gekk Authored: Tue Oct 9 14:35:00 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 9 14:36:33 2018 +0800 -- .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 7 +-- .../apache/spark/sql/execution/datasources/csv/CSVSuite.scala | 6 ++ 2 files changed, 11 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/404c8403/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 27a1af2..869c584 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -505,7 +505,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val actualSchema = StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord)) -val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine => +val linesWithoutHeader = if (parsedOptions.headerFlag && maybeFirstLine.isDefined) { + val firstLine = maybeFirstLine.get val parser = new CsvParser(parsedOptions.asParserSettings) val columnNames = parser.parseLine(firstLine) CSVDataSource.checkHeaderColumnNames( @@ -515,7 +516,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { parsedOptions.enforceSchema, sparkSession.sessionState.conf.caseSensitiveAnalysis) filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, parsedOptions)) -}.getOrElse(filteredLines.rdd) +} else { + filteredLines.rdd +} val parsed = linesWithoutHeader.mapPartitions { iter => val rawParser = new UnivocityParser(actualSchema, parsedOptions) http://git-wip-us.apache.org/repos/asf/spark/blob/404c8403/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index f70df0b..5d4746c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -1820,4 +1820,10 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te checkAnswer(spark.read.option("multiLine", true).schema(schema).csv(input), Row(null)) assert(spark.read.csv(input).collect().toSet == Set(Row())) } + + test("field names of inferred schema shouldn't compare to the first row") { +val input = Seq("1,2").toDS() +val df = spark.read.option("enforceSchema", false).csv(input) +checkAnswer(df, Row("1", "2")) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25669][SQL] Check CSV header only when it exists
Repository: spark Updated Branches: refs/heads/master a4b14a9cf -> 46fe40838 [SPARK-25669][SQL] Check CSV header only when it exists ## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes #22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/46fe4083 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/46fe4083 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/46fe4083 Branch: refs/heads/master Commit: 46fe40838aa682a7073dd6f1373518b0c8498a94 Parents: a4b14a9 Author: Maxim Gekk Authored: Tue Oct 9 14:35:00 2018 +0800 Committer: hyukjinkwon Committed: Tue Oct 9 14:35:00 2018 +0800 -- .../src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 7 +-- .../apache/spark/sql/execution/datasources/csv/CSVSuite.scala | 6 ++ 2 files changed, 11 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/46fe4083/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index fe69f25..7269446 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -505,7 +505,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val actualSchema = StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord)) -val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine => +val linesWithoutHeader = if (parsedOptions.headerFlag && maybeFirstLine.isDefined) { + val firstLine = maybeFirstLine.get val parser = new CsvParser(parsedOptions.asParserSettings) val columnNames = parser.parseLine(firstLine) CSVDataSource.checkHeaderColumnNames( @@ -515,7 +516,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { parsedOptions.enforceSchema, sparkSession.sessionState.conf.caseSensitiveAnalysis) filteredLines.rdd.mapPartitions(CSVUtils.filterHeaderLine(_, firstLine, parsedOptions)) -}.getOrElse(filteredLines.rdd) +} else { + filteredLines.rdd +} val parsed = linesWithoutHeader.mapPartitions { iter => val rawParser = new UnivocityParser(actualSchema, parsedOptions) http://git-wip-us.apache.org/repos/asf/spark/blob/46fe4083/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index f70df0b..5d4746c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -1820,4 +1820,10 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te checkAnswer(spark.read.option("multiLine", true).schema(schema).csv(input), Row(null)) assert(spark.read.csv(input).collect().toSet == Set(Row())) } + + test("field names of inferred schema shouldn't compare to the first row") { +val input = Seq("1,2").toDS() +val df = spark.read.option("enforceSchema", false).csv(input) +checkAnswer(df, Row("1", "2")) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests
Repository: spark Updated Branches: refs/heads/branch-2.4 82990e5ef -> 426c2bd35 [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests ## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin Co-authored-by: hyukjinkwon Co-authored-by: Alexander Koryagin Signed-off-by: hyukjinkwon (cherry picked from commit 30f5d0f2ddfe56266ea81e4255f9b4f373dab237) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/426c2bd3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/426c2bd3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/426c2bd3 Branch: refs/heads/branch-2.4 Commit: 426c2bd35937add1a26e77d2f2879f0e3f0c2f45 Parents: 82990e5 Author: Aleksandr Koriagin Authored: Mon Oct 1 17:18:45 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 1 17:19:00 2018 +0800 -- python/pyspark/sql/tests.py | 107 +-- 1 file changed, 79 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/426c2bd3/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index dece1da..690035a 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -5478,32 +5478,81 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase): .withColumn("v", explode(col('vs'))).drop('vs') def test_supported_types(self): -from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col -df = self.data.withColumn("arr", array(col("id"))) +from decimal import Decimal +from distutils.version import LooseVersion +import pyarrow as pa +from pyspark.sql.functions import pandas_udf, PandasUDFType -# Different forms of group map pandas UDF, results of these are the same +values = [ +1, 2, 3, +4, 5, 1.1, +2.2, Decimal(1.123), +[1, 2, 2], True, 'hello' +] +output_fields = [ +('id', IntegerType()), ('byte', ByteType()), ('short', ShortType()), +('int', IntegerType()), ('long', LongType()), ('float', FloatType()), +('double', DoubleType()), ('decim', DecimalType(10, 3)), +('array', ArrayType(IntegerType())), ('bool', BooleanType()), ('str', StringType()) +] -output_schema = StructType( -[StructField('id', LongType()), - StructField('v', IntegerType()), - StructField('arr', ArrayType(LongType())), - StructField('v1', DoubleType()), - StructField('v2', LongType())]) +# TODO: Add BinaryType to variables above once minimum pyarrow version is 0.10.0 +if LooseVersion(pa.__version__) >= LooseVersion("0.10.0"): +values.append(bytearray([0x01, 0x02])) +output_fields.append(('bin', BinaryType())) +output_schema = StructType([StructField(*x) for x in output_fields]) +df = self.spark.createDataFrame([values], schema=output_schema) + +# Different forms of group map pandas UDF, results of these are the same udf1 = pandas_udf( -lambda pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id), +lambda pdf: pdf.assign( +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), output_schema, PandasUDFType.GROUPED_MAP ) udf2 = pandas_udf( -lambda _, pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id), +lambda _, pdf: pdf.assign( +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), output_schema, PandasUDFType.GROUPED_MAP ) udf3 = pandas_udf( -lambda key, pdf: pdf.assign(id=key[0], v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id), +lambda
spark git commit: [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests
Repository: spark Updated Branches: refs/heads/master 21f0b73db -> 30f5d0f2d [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests ## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin Co-authored-by: hyukjinkwon Co-authored-by: Alexander Koryagin Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30f5d0f2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30f5d0f2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30f5d0f2 Branch: refs/heads/master Commit: 30f5d0f2ddfe56266ea81e4255f9b4f373dab237 Parents: 21f0b73 Author: Aleksandr Koriagin Authored: Mon Oct 1 17:18:45 2018 +0800 Committer: hyukjinkwon Committed: Mon Oct 1 17:18:45 2018 +0800 -- python/pyspark/sql/tests.py | 107 +-- 1 file changed, 79 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/30f5d0f2/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index b88a655..815772d 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -5525,32 +5525,81 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase): .withColumn("v", explode(col('vs'))).drop('vs') def test_supported_types(self): -from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col -df = self.data.withColumn("arr", array(col("id"))) +from decimal import Decimal +from distutils.version import LooseVersion +import pyarrow as pa +from pyspark.sql.functions import pandas_udf, PandasUDFType -# Different forms of group map pandas UDF, results of these are the same +values = [ +1, 2, 3, +4, 5, 1.1, +2.2, Decimal(1.123), +[1, 2, 2], True, 'hello' +] +output_fields = [ +('id', IntegerType()), ('byte', ByteType()), ('short', ShortType()), +('int', IntegerType()), ('long', LongType()), ('float', FloatType()), +('double', DoubleType()), ('decim', DecimalType(10, 3)), +('array', ArrayType(IntegerType())), ('bool', BooleanType()), ('str', StringType()) +] -output_schema = StructType( -[StructField('id', LongType()), - StructField('v', IntegerType()), - StructField('arr', ArrayType(LongType())), - StructField('v1', DoubleType()), - StructField('v2', LongType())]) +# TODO: Add BinaryType to variables above once minimum pyarrow version is 0.10.0 +if LooseVersion(pa.__version__) >= LooseVersion("0.10.0"): +values.append(bytearray([0x01, 0x02])) +output_fields.append(('bin', BinaryType())) +output_schema = StructType([StructField(*x) for x in output_fields]) +df = self.spark.createDataFrame([values], schema=output_schema) + +# Different forms of group map pandas UDF, results of these are the same udf1 = pandas_udf( -lambda pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id), +lambda pdf: pdf.assign( +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), output_schema, PandasUDFType.GROUPED_MAP ) udf2 = pandas_udf( -lambda _, pdf: pdf.assign(v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id), +lambda _, pdf: pdf.assign( +byte=pdf.byte * 2, +short=pdf.short * 2, +int=pdf.int * 2, +long=pdf.long * 2, +float=pdf.float * 2, +double=pdf.double * 2, +decim=pdf.decim * 2, +bool=False if pdf.bool else True, +str=pdf.str + 'there', +array=pdf.array, +), output_schema, PandasUDFType.GROUPED_MAP ) udf3 = pandas_udf( -lambda key, pdf: pdf.assign(id=key[0], v1=pdf.v * pdf.id * 1.0, v2=pdf.v + pdf.id), +lambda key, pdf: pdf.assign( +id=key[0], +byte=pdf.byte * 2, +
spark git commit: [SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java
Repository: spark Updated Branches: refs/heads/master dcb9a97f3 -> 623c2ec4e [SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java ## What changes were proposed in this pull request? In the PR, I propose to extend implementation of existing method: ``` def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset ``` to support values of the struct type. This allows pivoting by multiple columns combined by `struct`: ``` trainingSales .groupBy($"sales.year") .pivot( pivotColumn = struct(lower($"sales.course"), $"training"), values = Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings")) ``` ## How was this patch tested? Added a test for values specified via `struct` in Java and Scala. Closes #22316 from MaxGekk/pivoting-by-multiple-columns2. Lead-authored-by: Maxim Gekk Co-authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/623c2ec4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/623c2ec4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/623c2ec4 Branch: refs/heads/master Commit: 623c2ec4ef3776bc5e2cac2c66300ddc6264db54 Parents: dcb9a97 Author: Maxim Gekk Authored: Sat Sep 29 21:50:35 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 21:50:35 2018 +0800 -- .../spark/sql/RelationalGroupedDataset.scala| 17 +-- .../apache/spark/sql/JavaDataFrameSuite.java| 16 ++ .../apache/spark/sql/DataFramePivotSuite.scala | 23 3 files changed, 54 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/623c2ec4/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala index d700fb8..dbacdbf 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala @@ -330,6 +330,15 @@ class RelationalGroupedDataset protected[sql]( * df.groupBy("year").pivot("course").sum("earnings") * }}} * + * From Spark 2.5.0, values can be literal columns, for instance, struct. For pivoting by + * multiple columns, use the `struct` function to combine the columns and values: + * + * {{{ + * df.groupBy("year") + * .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts" + * .agg(sum($"earnings")) + * }}} + * * @param pivotColumn Name of the column to pivot. * @param values List of values that will be translated to columns in the output DataFrame. * @since 1.6.0 @@ -413,10 +422,14 @@ class RelationalGroupedDataset protected[sql]( def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { groupType match { case RelationalGroupedDataset.GroupByType => +val valueExprs = values.map(_ match { + case c: Column => c.expr + case v => Literal.apply(v) +}) new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, valueExprs)) case _: RelationalGroupedDataset.PivotType => throw new UnsupportedOperationException("repeated pivots are not supported") case _ => @@ -561,5 +574,5 @@ private[sql] object RelationalGroupedDataset { /** * To indicate it's the PIVOT */ - private[sql] case class PivotType(pivotCol: Expression, values: Seq[Literal]) extends GroupType + private[sql] case class PivotType(pivotCol: Expression, values: Seq[Expression]) extends GroupType } http://git-wip-us.apache.org/repos/asf/spark/blob/623c2ec4/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java -- diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java index 3f37e58..00f41d6 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java @@ -317,6 +317,22 @@ public class JavaDataFrameSuite { Assert.assertEquals(3.0, actual.get(1).getDouble(2), 0.01); } + @Test + public void pivotColumnValues() { +Dataset df = spark.table("courseSales"); +List actual =
spark git commit: [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table
Repository: spark Updated Branches: refs/heads/branch-2.4 ec2c17abf -> a14306b1d [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table ## What changes were proposed in this pull request? Markdown links are not working inside html table. We should use html link tag. ## How was this patch tested? Verified in IntelliJ IDEA's markdown editor and online markdown editor. Closes #22588 from viirya/SPARK-25262-followup. Authored-by: Liang-Chi Hsieh Signed-off-by: hyukjinkwon (cherry picked from commit dcb9a97f3e16d4645529ac619c3197fcba1c9806) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a14306b1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a14306b1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a14306b1 Branch: refs/heads/branch-2.4 Commit: a14306b1d5a135cff0441c1c953032d0c6a51c47 Parents: ec2c17a Author: Liang-Chi Hsieh Authored: Sat Sep 29 18:18:37 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 18:18:52 2018 +0800 -- docs/running-on-kubernetes.md | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a14306b1/docs/running-on-kubernetes.md -- diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index fc7c9a5..f19aa41 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -667,15 +667,15 @@ specific to Spark on Kubernetes. spark.kubernetes.driver.limit.cores (none) -Specify a hard cpu [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. +Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit for the driver pod. spark.kubernetes.executor.request.cores (none) -Specify the cpu request for each executor pod. Values conform to the Kubernetes [convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu). -Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu units documented in [CPU units](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units). +Specify the cpu request for each executor pod. Values conform to the Kubernetes https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu;>convention. +Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu units documented in https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units;>CPU units. This is distinct from spark.executor.cores: it is only used and takes precedence over spark.executor.cores for specifying the executor pod cpu request if set. Task parallelism, e.g., number of tasks an executor can run concurrently is not affected by this. @@ -684,7 +684,7 @@ specific to Spark on Kubernetes. spark.kubernetes.executor.limit.cores (none) -Specify a hard cpu [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. +Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit for each executor pod launched for the Spark Application. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25447][SQL] Support JSON options by schema_of_json()
Repository: spark Updated Branches: refs/heads/master 1e437835e -> 1007cae20 [SPARK-25447][SQL] Support JSON options by schema_of_json() ## What changes were proposed in this pull request? In the PR, I propose to extended the `schema_of_json()` function, and accept JSON options since they can impact on schema inferring. Purpose is to support the same options that `from_json` can use during schema inferring. ## How was this patch tested? Added SQL, Python and Scala tests (`JsonExpressionsSuite` and `JsonFunctionsSuite`) that checks JSON options are used. Closes #22442 from MaxGekk/schema_of_json-options. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1007cae2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1007cae2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1007cae2 Branch: refs/heads/master Commit: 1007cae20e8f566e7d7c25f0f81c9b84f352b6d5 Parents: 1e43783 Author: Maxim Gekk Authored: Sat Sep 29 17:53:30 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 17:53:30 2018 +0800 -- python/pyspark/sql/functions.py | 11 ++-- .../catalyst/expressions/jsonExpressions.scala | 28 +++- .../expressions/JsonExpressionsSuite.scala | 12 +++-- .../scala/org/apache/spark/sql/functions.scala | 15 +++ .../sql-tests/inputs/json-functions.sql | 4 +++ .../sql-tests/results/json-functions.sql.out| 18 - .../apache/spark/sql/JsonFunctionsSuite.scala | 8 ++ 7 files changed, 85 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1007cae2/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index e5bc1ea..74f0685 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2348,11 +2348,15 @@ def to_json(col, options={}): @ignore_unicode_prefix @since(2.4) -def schema_of_json(col): +def schema_of_json(col, options={}): """ Parses a column containing a JSON string and infers its schema in DDL format. :param col: string column in json format +:param options: options to control parsing. accepts the same options as the JSON datasource + +.. versionchanged:: 2.5 + It accepts `options` parameter to control schema inferring. >>> from pyspark.sql.types import * >>> data = [(1, '{"a": 1}')] @@ -2361,10 +2365,13 @@ def schema_of_json(col): [Row(json=u'struct')] >>> df.select(schema_of_json(lit('{"a": 0}')).alias("json")).collect() [Row(json=u'struct')] +>>> schema = schema_of_json(lit('{a: 1}'), {'allowUnquotedFieldNames':'true'}) +>>> df.select(schema.alias("json")).collect() +[Row(json=u'struct')] """ sc = SparkContext._active_spark_context -jc = sc._jvm.functions.schema_of_json(_to_java_column(col)) +jc = sc._jvm.functions.schema_of_json(_to_java_column(col), options) return Column(jc) http://git-wip-us.apache.org/repos/asf/spark/blob/1007cae2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index bd9090a..f5297dd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -740,15 +740,31 @@ case class StructsToJson( examples = """ Examples: > SELECT _FUNC_('[{"col":0}]'); - array> + array> + > SELECT _FUNC_('[{"col":01}]', map('allowNumericLeadingZeros', 'true')); + array> """, since = "2.4.0") -case class SchemaOfJson(child: Expression) +case class SchemaOfJson( +child: Expression, +options: Map[String, String]) extends UnaryExpression with String2StringExpression with CodegenFallback { - private val jsonOptions = new JSONOptions(Map.empty, "UTC") - private val jsonFactory = new JsonFactory() - jsonOptions.setJacksonOptions(jsonFactory) + def this(child: Expression) = this(child, Map.empty[String, String]) + + def this(child: Expression, options: Expression) = this( + child = child, + options = JsonExprUtils.convertToMapData(options)) + + @transient + private lazy val jsonOptions = new JSONOptions(options, "UTC") + + @transient + private lazy val jsonFactory = { +val factory = new JsonFactory() +
spark git commit: [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table
Repository: spark Updated Branches: refs/heads/master 1007cae20 -> dcb9a97f3 [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table ## What changes were proposed in this pull request? Markdown links are not working inside html table. We should use html link tag. ## How was this patch tested? Verified in IntelliJ IDEA's markdown editor and online markdown editor. Closes #22588 from viirya/SPARK-25262-followup. Authored-by: Liang-Chi Hsieh Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dcb9a97f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dcb9a97f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dcb9a97f Branch: refs/heads/master Commit: dcb9a97f3e16d4645529ac619c3197fcba1c9806 Parents: 1007cae Author: Liang-Chi Hsieh Authored: Sat Sep 29 18:18:37 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 18:18:37 2018 +0800 -- docs/running-on-kubernetes.md | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dcb9a97f/docs/running-on-kubernetes.md -- diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index c7aea27..b4088d7 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -680,15 +680,15 @@ specific to Spark on Kubernetes. spark.kubernetes.driver.limit.cores (none) -Specify a hard cpu [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. +Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit for the driver pod. spark.kubernetes.executor.request.cores (none) -Specify the cpu request for each executor pod. Values conform to the Kubernetes [convention](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu). -Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu units documented in [CPU units](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units). +Specify the cpu request for each executor pod. Values conform to the Kubernetes https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu;>convention. +Example values include 0.1, 500m, 1.5, 5, etc., with the definition of cpu units documented in https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units;>CPU units. This is distinct from spark.executor.cores: it is only used and takes precedence over spark.executor.cores for specifying the executor pod cpu request if set. Task parallelism, e.g., number of tasks an executor can run concurrently is not affected by this. @@ -697,7 +697,7 @@ specific to Spark on Kubernetes. spark.kubernetes.executor.limit.cores (none) -Specify a hard cpu [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. +Specify a hard cpu https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container;>limit for each executor pod launched for the Spark Application. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25565][BUILD] Add scalastyle rule to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
Repository: spark Updated Branches: refs/heads/master b6b8a6632 -> a2f502cf5 [SPARK-25565][BUILD] Add scalastyle rule to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls ## What changes were proposed in this pull request? This PR adds a rule to force `.toLowerCase(Locale.ROOT)` or `toUpperCase(Locale.ROOT)`. It produces an error as below: ``` [error] Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you [error] should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. [error] If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with [error] // scalastyle:off caselocale [error] .toUpperCase [error] .toLowerCase [error] // scalastyle:on caselocale ``` This PR excludes the cases above for SQL code path for external calls like table name, column name and etc. For test suites, or when it's clear there's no locale problem like Turkish locale problem, it uses `Locale.ROOT`. One minor problem is, `UTF8String` has both methods, `toLowerCase` and `toUpperCase`, and the new rule detects them as well. They are ignored. ## How was this patch tested? Manually tested, and Jenkins tests. Closes #22581 from HyukjinKwon/SPARK-25565. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a2f502cf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a2f502cf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a2f502cf Branch: refs/heads/master Commit: a2f502cf53b6b00af7cb80b6f38e64cf46367595 Parents: b6b8a66 Author: hyukjinkwon Authored: Sun Sep 30 14:31:04 2018 +0800 Committer: hyukjinkwon Committed: Sun Sep 30 14:31:04 2018 +0800 -- .../types/UTF8StringPropertyCheckSuite.scala| 2 ++ .../apache/spark/metrics/sink/StatsdSink.scala | 5 ++-- .../apache/spark/rdd/OrderedRDDFunctions.scala | 3 +- .../scala/org/apache/spark/util/Utils.scala | 2 +- .../deploy/history/FsHistoryProviderSuite.scala | 4 +-- .../spark/ml/feature/StopWordsRemover.scala | 2 ++ .../org/apache/spark/ml/feature/Tokenizer.scala | 4 +++ .../submit/KubernetesClientApplication.scala| 4 +-- .../cluster/k8s/ExecutorPodsSnapshot.scala | 4 ++- .../deploy/mesos/MesosClusterDispatcher.scala | 3 +- scalastyle-config.xml | 13 + .../analysis/higherOrderFunctions.scala | 2 ++ .../expressions/stringExpressions.scala | 6 .../spark/sql/catalyst/parser/AstBuilder.scala | 2 ++ .../spark/sql/catalyst/util/StringUtils.scala | 2 ++ .../org/apache/spark/sql/internal/SQLConf.scala | 3 +- .../org/apache/spark/sql/util/SchemaUtils.scala | 2 ++ .../spark/sql/util/SchemaUtilsSuite.scala | 4 ++- .../InsertIntoHadoopFsRelationCommand.scala | 2 ++ .../datasources/csv/CSVDataSource.scala | 6 .../execution/streaming/WatermarkTracker.scala | 4 ++- .../state/SymmetricHashJoinStateManager.scala | 4 ++- .../spark/sql/ColumnExpressionSuite.scala | 4 +-- .../apache/spark/sql/DataFramePivotSuite.scala | 4 ++- .../scala/org/apache/spark/sql/JoinSuite.scala | 4 ++- .../sql/streaming/EventTimeWatermarkSuite.scala | 4 +-- .../spark/sql/hive/HiveExternalCatalog.scala| 16 +++ .../spark/sql/hive/HiveMetastoreCatalog.scala | 4 +++ .../spark/sql/hive/CompressionCodecSuite.scala | 29 .../sql/hive/HiveSchemaInferenceSuite.scala | 9 +++--- .../apache/spark/sql/hive/StatisticsSuite.scala | 15 ++ 31 files changed, 132 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a2f502cf/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala -- diff --git a/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala b/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala index 7d3331f..9656951 100644 --- a/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala +++ b/common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala @@ -63,6 +63,7 @@ class UTF8StringPropertyCheckSuite extends FunSuite with GeneratorDrivenProperty } } + // scalastyle:off caselocale test("toUpperCase") { forAll { (s: String) => assert(toUTF8(s).toUpperCase === toUTF8(s.toUpperCase)) @@ -74,6 +75,7 @@ class UTF8StringPropertyCheckSuite extends FunSuite with GeneratorDrivenProperty assert(toUTF8(s).toLowerCase === toUTF8(s.toLowerCase)) } } + //
spark git commit: [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement
Repository: spark Updated Branches: refs/heads/master 79dd4c964 -> 927e52793 [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+---+ | v2|sum_udf(v1)| +---+---+ | 1| 1| | 0| 5| +---+---+ ``` ## How was this patch tested? Manual test and unit test. Closes #22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/927e5279 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/927e5279 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/927e5279 Branch: refs/heads/master Commit: 927e527934a882fab89ca661c4eb31f84c45d830 Parents: 79dd4c9 Author: hyukjinkwon Authored: Thu Oct 4 09:38:06 2018 +0800 Committer: hyukjinkwon Committed: Thu Oct 4 09:38:06 2018 +0800 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement
Repository: spark Updated Branches: refs/heads/master 075dd620e -> 79dd4c964 [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+---+ | v2|sum_udf(v1)| +---+---+ | 1| 1| | 0| 5| +---+---+ ``` ## How was this patch tested? Manual test and unit test. Closes #22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79dd4c96 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79dd4c96 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79dd4c96 Branch: refs/heads/master Commit: 79dd4c96484c9be7ad9250b64f3fd8e088707641 Parents: 075dd62 Author: hyukjinkwon Authored: Thu Oct 4 09:36:23 2018 +0800 Committer: hyukjinkwon Committed: Thu Oct 4 09:36:23 2018 +0800 -- python/pyspark/sql/tests.py | 20 ++-- python/pyspark/sql/udf.py | 15 +-- 2 files changed, 31 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/79dd4c96/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 815772d..d3c29d0 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -5642,8 +5642,9 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase): foo_udf = pandas_udf(lambda x: x, "id long", PandasUDFType.GROUPED_MAP) with QuietTest(self.sc): -with self.assertRaisesRegexp(ValueError, 'f must be either SQL_BATCHED_UDF or ' - 'SQL_SCALAR_PANDAS_UDF'): +with self.assertRaisesRegexp( +ValueError, + 'f.*SQL_BATCHED_UDF.*SQL_SCALAR_PANDAS_UDF.*SQL_GROUPED_AGG_PANDAS_UDF.*'): self.spark.catalog.registerFunction("foo_udf", foo_udf) def test_decorator(self): @@ -6459,6 +6460,21 @@ class GroupedAggPandasUDFTests(ReusedSQLTestCase): 'mixture.*aggregate function.*group aggregate pandas UDF'): df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect() +def test_register_vectorized_udf_basic(self): +from pyspark.sql.functions import pandas_udf +from pyspark.rdd import PythonEvalType + +sum_pandas_udf = pandas_udf( +lambda v: v.sum(), "integer", PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF) + +self.assertEqual(sum_pandas_udf.evalType, PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF) +group_agg_pandas_udf = self.spark.udf.register("sum_pandas_udf", sum_pandas_udf) +self.assertEqual(group_agg_pandas_udf.evalType, PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF) +q = "SELECT sum_pandas_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" +actual = sorted(map(lambda r: r[0], self.spark.sql(q).collect())) +expected = [1, 5] +self.assertEqual(actual, expected) + @unittest.skipIf( not _have_pandas or not _have_pyarrow, http://git-wip-us.apache.org/repos/asf/spark/blob/79dd4c96/python/pyspark/sql/udf.py -- diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py index 9dbe49b..58f4e0d 100644 --- a/python/pyspark/sql/udf.py +++ b/python/pyspark/sql/udf.py @@ -298,6 +298,15 @@ class UDFRegistration(object): >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] +>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP +... def sum_udf(v): +... return v.sum() +... +>>> _ = spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP +>>> q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" +>>> spark.sql(q).collect() # doctest: +SKIP +[Row(sum_udf(v1)=1), Row(sum_udf(v1)=5)] + .. note:: Registration for a user-defined function (case 2.) was added from Spark 2.3.0. """ @@ -310,9 +319,11 @@ class UDFRegistration(object): "Invalid
spark git commit: [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement
Repository: spark Updated Branches: refs/heads/branch-2.4 443d12dbb -> 0763b758d [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+---+ | v2|sum_udf(v1)| +---+---+ | 1| 1| | 0| 5| +---+---+ ``` ## How was this patch tested? Manual test and unit test. Closes #22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0763b758 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0763b758 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0763b758 Branch: refs/heads/branch-2.4 Commit: 0763b758de55fd14d7da4832d01b5713e582b257 Parents: 443d12d Author: hyukjinkwon Authored: Thu Oct 4 09:36:23 2018 +0800 Committer: hyukjinkwon Committed: Thu Oct 4 09:43:42 2018 +0800 -- python/pyspark/sql/tests.py | 20 ++-- python/pyspark/sql/udf.py | 15 +-- 2 files changed, 31 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0763b758/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 690035a..e991032 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -5595,8 +5595,9 @@ class GroupedMapPandasUDFTests(ReusedSQLTestCase): foo_udf = pandas_udf(lambda x: x, "id long", PandasUDFType.GROUPED_MAP) with QuietTest(self.sc): -with self.assertRaisesRegexp(ValueError, 'f must be either SQL_BATCHED_UDF or ' - 'SQL_SCALAR_PANDAS_UDF'): +with self.assertRaisesRegexp( +ValueError, + 'f.*SQL_BATCHED_UDF.*SQL_SCALAR_PANDAS_UDF.*SQL_GROUPED_AGG_PANDAS_UDF.*'): self.spark.catalog.registerFunction("foo_udf", foo_udf) def test_decorator(self): @@ -6412,6 +6413,21 @@ class GroupedAggPandasUDFTests(ReusedSQLTestCase): 'mixture.*aggregate function.*group aggregate pandas UDF'): df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect() +def test_register_vectorized_udf_basic(self): +from pyspark.sql.functions import pandas_udf +from pyspark.rdd import PythonEvalType + +sum_pandas_udf = pandas_udf( +lambda v: v.sum(), "integer", PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF) + +self.assertEqual(sum_pandas_udf.evalType, PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF) +group_agg_pandas_udf = self.spark.udf.register("sum_pandas_udf", sum_pandas_udf) +self.assertEqual(group_agg_pandas_udf.evalType, PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF) +q = "SELECT sum_pandas_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" +actual = sorted(map(lambda r: r[0], self.spark.sql(q).collect())) +expected = [1, 5] +self.assertEqual(actual, expected) + @unittest.skipIf( not _have_pandas or not _have_pyarrow, http://git-wip-us.apache.org/repos/asf/spark/blob/0763b758/python/pyspark/sql/udf.py -- diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py index 9dbe49b..58f4e0d 100644 --- a/python/pyspark/sql/udf.py +++ b/python/pyspark/sql/udf.py @@ -298,6 +298,15 @@ class UDFRegistration(object): >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] +>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP +... def sum_udf(v): +... return v.sum() +... +>>> _ = spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP +>>> q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" +>>> spark.sql(q).collect() # doctest: +SKIP +[Row(sum_udf(v1)=1), Row(sum_udf(v1)=5)] + .. note:: Registration for a user-defined function (case 2.) was added from Spark 2.3.0. """ @@ -310,9 +319,11 @@ class UDFRegistration(object):
spark git commit: [SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled
Repository: spark Updated Branches: refs/heads/master d6be46eb9 -> 928d0739c [SPARK-25595] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled ## What changes were proposed in this pull request? With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. ## How was this patch tested? Unit test Closes #22611 from gengliangwang/ignoreCorruptAvro. Authored-by: Gengliang Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/928d0739 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/928d0739 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/928d0739 Branch: refs/heads/master Commit: 928d0739c45d0fbb1d3bfc09c0ed7a213f09f3e5 Parents: d6be46e Author: Gengliang Wang Authored: Wed Oct 3 17:08:55 2018 +0800 Committer: hyukjinkwon Committed: Wed Oct 3 17:08:55 2018 +0800 -- .../apache/spark/sql/avro/AvroFileFormat.scala | 78 +--- .../org/apache/spark/sql/avro/AvroSuite.scala | 43 +++ 2 files changed, 93 insertions(+), 28 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/928d0739/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala -- diff --git a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala index 6df23c9..e60fa88 100755 --- a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala +++ b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala @@ -32,14 +32,14 @@ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, Path} import org.apache.hadoop.mapreduce.Job -import org.apache.spark.TaskContext +import org.apache.spark.{SparkException, TaskContext} import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriterFactory, PartitionedFile} import org.apache.spark.sql.sources.{DataSourceRegister, Filter} import org.apache.spark.sql.types.StructType -import org.apache.spark.util.SerializableConfiguration +import org.apache.spark.util.{SerializableConfiguration, Utils} private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister with Logging with Serializable { @@ -59,36 +59,13 @@ private[avro] class AvroFileFormat extends FileFormat val conf = spark.sessionState.newHadoopConf() val parsedOptions = new AvroOptions(options, conf) -// Schema evolution is not supported yet. Here we only pick a single random sample file to -// figure out the schema of the whole dataset. -val sampleFile = - if (parsedOptions.ignoreExtension) { -files.headOption.getOrElse { - throw new FileNotFoundException("Files for schema inferring have been not found.") -} - } else { -files.find(_.getPath.getName.endsWith(".avro")).getOrElse { - throw new FileNotFoundException( -"No Avro files found. If files don't have .avro extension, set ignoreExtension to true") -} - } - // User can specify an optional avro json schema. val avroSchema = parsedOptions.schema .map(new Schema.Parser().parse) .getOrElse { -val in = new FsInput(sampleFile.getPath, conf) -try { - val reader = DataFileReader.openReader(in, new GenericDatumReader[GenericRecord]()) - try { -reader.getSchema - } finally { -reader.close() - } -} finally { - in.close() -} - } +inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension, + spark.sessionState.conf.ignoreCorruptFiles) +} SchemaConverters.toSqlType(avroSchema).dataType match { case t: StructType => Some(t) @@ -100,6 +77,51 @@ private[avro] class AvroFileFormat extends FileFormat } } + private def inferAvroSchemaFromFiles( + files: Seq[FileStatus], + conf: Configuration, + ignoreExtension: Boolean, + ignoreCorruptFiles: Boolean): Schema = { +// Schema evolution is not supported yet. Here we only pick first random readable sample file to +// figure out the schema of the whole dataset. +val avroReader = files.iterator.map { f => + val path = f.getPath + if (!ignoreExtension && !path.getName.endsWith(".avro")) { +None + } else { +Utils.tryWithResource { + new FsInput(path, conf) +} { in => + try { +
spark git commit: [SPARK-25655][BUILD] Add -Pspark-ganglia-lgpl to the scala style check.
Repository: spark Updated Branches: refs/heads/master 58287a398 -> 44cf800c8 [SPARK-25655][BUILD] Add -Pspark-ganglia-lgpl to the scala style check. ## What changes were proposed in this pull request? Our lint failed due to the following errors: ``` [INFO] --- scalastyle-maven-plugin:1.0.0:check (default) spark-ganglia-lgpl_2.11 --- error file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala message= Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with // scalastyle:off caselocale .toUpperCase .toLowerCase // scalastyle:on caselocale line=67 column=49 error file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala message= Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with // scalastyle:off caselocale .toUpperCase .toLowerCase // scalastyle:on caselocale line=71 column=32 Saving to outputFile=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/target/scalastyle-output.xml ``` See https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/8890/ ## How was this patch tested? N/A Closes #22647 from gatorsmile/fixLint. Authored-by: gatorsmile Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44cf800c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44cf800c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44cf800c Branch: refs/heads/master Commit: 44cf800c831588b1f7940dd8eef7ecb6cde28f23 Parents: 58287a3 Author: gatorsmile Authored: Sat Oct 6 14:25:48 2018 +0800 Committer: hyukjinkwon Committed: Sat Oct 6 14:25:48 2018 +0800 -- dev/scalastyle| 1 + .../scala/org/apache/spark/metrics/sink/GangliaSink.scala | 7 --- 2 files changed, 5 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/44cf800c/dev/scalastyle -- diff --git a/dev/scalastyle b/dev/scalastyle index b8053df..b0ad025 100755 --- a/dev/scalastyle +++ b/dev/scalastyle @@ -29,6 +29,7 @@ ERRORS=$(echo -e "q\n" \ -Pflume \ -Phive \ -Phive-thriftserver \ +-Pspark-ganglia-lgpl \ scalastyle test:scalastyle \ | awk '{if($1~/error/)print}' \ ) http://git-wip-us.apache.org/repos/asf/spark/blob/44cf800c/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala -- diff --git a/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala b/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala index 0cd795f..93db477 100644 --- a/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala +++ b/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala @@ -17,7 +17,7 @@ package org.apache.spark.metrics.sink -import java.util.Properties +import java.util.{Locale, Properties} import java.util.concurrent.TimeUnit import com.codahale.metrics.MetricRegistry @@ -64,11 +64,12 @@ class GangliaSink(val property: Properties, val registry: MetricRegistry, val ttl = propertyToOption(GANGLIA_KEY_TTL).map(_.toInt).getOrElse(GANGLIA_DEFAULT_TTL) val dmax = propertyToOption(GANGLIA_KEY_DMAX).map(_.toInt).getOrElse(GANGLIA_DEFAULT_DMAX) val mode: UDPAddressingMode = propertyToOption(GANGLIA_KEY_MODE) -.map(u => GMetric.UDPAddressingMode.valueOf(u.toUpperCase)).getOrElse(GANGLIA_DEFAULT_MODE) +.map(u => GMetric.UDPAddressingMode.valueOf(u.toUpperCase(Locale.Root))) +.getOrElse(GANGLIA_DEFAULT_MODE) val pollPeriod = propertyToOption(GANGLIA_KEY_PERIOD).map(_.toInt) .getOrElse(GANGLIA_DEFAULT_PERIOD) val pollUnit: TimeUnit = propertyToOption(GANGLIA_KEY_UNIT) -.map(u => TimeUnit.valueOf(u.toUpperCase)) +.map(u => TimeUnit.valueOf(u.toUpperCase(Locale.Root))) .getOrElse(GANGLIA_DEFAULT_UNIT) MetricsSystem.checkMinimalPollingPeriod(pollUnit,
spark git commit: [SPARK-25621][SPARK-25622][TEST] Reduce test time of BucketedReadWithHiveSupportSuite
Repository: spark Updated Branches: refs/heads/master f2f4e7afe -> 1ee472eec [SPARK-25621][SPARK-25622][TEST] Reduce test time of BucketedReadWithHiveSupportSuite ## What changes were proposed in this pull request? By replacing loops with random possible value. - `read partitioning bucketed tables with bucket pruning filters` reduce from 55s to 7s - `read partitioning bucketed tables having composite filters` reduce from 54s to 8s - total time: reduce from 288s to 192s ## How was this patch tested? Unit test Closes #22640 from gengliangwang/fastenBucketedReadSuite. Authored-by: Gengliang Wang Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ee472ee Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ee472ee Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ee472ee Branch: refs/heads/master Commit: 1ee472eec15e104c4cd087179a9491dc542e15d7 Parents: f2f4e7a Author: Gengliang Wang Authored: Sat Oct 6 14:54:04 2018 +0800 Committer: hyukjinkwon Committed: Sat Oct 6 14:54:04 2018 +0800 -- .../spark/sql/sources/BucketedReadSuite.scala | 181 ++- 1 file changed, 91 insertions(+), 90 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1ee472ee/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala index a941420..a2bc651 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala @@ -20,6 +20,8 @@ package org.apache.spark.sql.sources import java.io.File import java.net.URI +import scala.util.Random + import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.catalog.BucketSpec import org.apache.spark.sql.catalyst.expressions @@ -47,11 +49,13 @@ class BucketedReadWithoutHiveSupportSuite extends BucketedReadSuite with SharedS abstract class BucketedReadSuite extends QueryTest with SQLTestUtils { import testImplicits._ - private lazy val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k") + private val maxI = 5 + private val maxJ = 13 + private lazy val df = (0 until 50).map(i => (i % maxI, i % maxJ, i.toString)).toDF("i", "j", "k") private lazy val nullDF = (for { i <- 0 to 50 s <- Seq(null, "a", "b", "c", "d", "e", "f", null, "g") - } yield (i % 5, s, i % 13)).toDF("i", "j", "k") + } yield (i % maxI, s, i % maxJ)).toDF("i", "j", "k") // number of buckets that doesn't yield empty buckets when bucketing on column j on df/nullDF // empty buckets before filtering might hide bugs in pruning logic @@ -66,23 +70,22 @@ abstract class BucketedReadSuite extends QueryTest with SQLTestUtils { .bucketBy(8, "j", "k") .saveAsTable("bucketed_table") - for (i <- 0 until 5) { -val table = spark.table("bucketed_table").filter($"i" === i) -val query = table.queryExecution -val output = query.analyzed.output -val rdd = query.toRdd - -assert(rdd.partitions.length == 8) - -val attrs = table.select("j", "k").queryExecution.analyzed.output -val checkBucketId = rdd.mapPartitionsWithIndex((index, rows) => { - val getBucketId = UnsafeProjection.create( -HashPartitioning(attrs, 8).partitionIdExpression :: Nil, -output) - rows.map(row => getBucketId(row).getInt(0) -> index) -}) -checkBucketId.collect().foreach(r => assert(r._1 == r._2)) - } + val bucketValue = Random.nextInt(maxI) + val table = spark.table("bucketed_table").filter($"i" === bucketValue) + val query = table.queryExecution + val output = query.analyzed.output + val rdd = query.toRdd + + assert(rdd.partitions.length == 8) + + val attrs = table.select("j", "k").queryExecution.analyzed.output + val checkBucketId = rdd.mapPartitionsWithIndex((index, rows) => { +val getBucketId = UnsafeProjection.create( + HashPartitioning(attrs, 8).partitionIdExpression :: Nil, + output) +rows.map(row => getBucketId(row).getInt(0) -> index) + }) + checkBucketId.collect().foreach(r => assert(r._1 == r._2)) } } @@ -145,36 +148,36 @@ abstract class BucketedReadSuite extends QueryTest with SQLTestUtils { .bucketBy(numBuckets, "j") .saveAsTable("bucketed_table") - for (j <- 0 until 13) { -// Case 1: EqualTo -checkPrunedAnswers( - bucketSpec, - bucketValues = j :: Nil, -
spark git commit: [SPARK-25202][SQL] Implements split with limit sql function
Repository: spark Updated Branches: refs/heads/master 44cf800c8 -> 17781d753 [SPARK-25202][SQL] Implements split with limit sql function ## What changes were proposed in this pull request? Adds support for the setting limit in the sql split function ## How was this patch tested? 1. Updated unit tests 2. Tested using Scala spark shell Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #7 from phegstrom/master. Authored-by: Parker Hegstrom Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17781d75 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17781d75 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17781d75 Branch: refs/heads/master Commit: 17781d75308c328b11cab3658ca4f358539414f2 Parents: 44cf800 Author: Parker Hegstrom Authored: Sat Oct 6 14:30:43 2018 +0800 Committer: hyukjinkwon Committed: Sat Oct 6 14:30:43 2018 +0800 -- R/pkg/R/functions.R | 15 +-- R/pkg/R/generics.R | 2 +- R/pkg/tests/fulltests/test_sparkSQL.R | 8 .../apache/spark/unsafe/types/UTF8String.java | 6 +++ .../spark/unsafe/types/UTF8StringSuite.java | 14 --- python/pyspark/sql/functions.py | 28 + .../expressions/regexpExpressions.scala | 44 ++-- .../expressions/RegexpExpressionsSuite.scala| 15 +-- .../scala/org/apache/spark/sql/functions.scala | 32 -- .../sql-tests/inputs/string-functions.sql | 6 ++- .../sql-tests/results/string-functions.sql.out | 18 +++- .../apache/spark/sql/StringFunctionsSuite.scala | 44 ++-- 12 files changed, 189 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/17781d75/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 2cb4cb8..6a8fef5 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -3473,13 +3473,21 @@ setMethod("collect_set", #' @details #' \code{split_string}: Splits string on regular expression. -#' Equivalent to \code{split} SQL function. +#' Equivalent to \code{split} SQL function. Optionally a +#' \code{limit} can be specified #' #' @rdname column_string_functions +#' @param limit determines the length of the returned array. +#' \itemize{ +#' \item \code{limit > 0}: length of the array will be at most \code{limit} +#' \item \code{limit <= 0}: the returned array can have any length +#' } +#' #' @aliases split_string split_string,Column-method #' @examples #' #' \dontrun{ +#' head(select(df, split_string(df$Class, "\\d", 2))) #' head(select(df, split_string(df$Sex, "a"))) #' head(select(df, split_string(df$Class, "\\d"))) #' # This is equivalent to the following SQL expression @@ -3487,8 +3495,9 @@ setMethod("collect_set", #' @note split_string 2.3.0 setMethod("split_string", signature(x = "Column", pattern = "character"), - function(x, pattern) { -jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern) + function(x, pattern, limit = -1) { +jc <- callJStatic("org.apache.spark.sql.functions", + "split", x@jc, pattern, as.integer(limit)) column(jc) }) http://git-wip-us.apache.org/repos/asf/spark/blob/17781d75/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 27c1b31..697d124 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -1258,7 +1258,7 @@ setGeneric("sort_array", function(x, asc = TRUE) { standardGeneric("sort_array") #' @rdname column_string_functions #' @name NULL -setGeneric("split_string", function(x, pattern) { standardGeneric("split_string") }) +setGeneric("split_string", function(x, pattern, ...) { standardGeneric("split_string") }) #' @rdname column_string_functions #' @name NULL http://git-wip-us.apache.org/repos/asf/spark/blob/17781d75/R/pkg/tests/fulltests/test_sparkSQL.R -- diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R index 50eff37..5cc75aa 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL.R +++ b/R/pkg/tests/fulltests/test_sparkSQL.R @@ -1819,6 +1819,14 @@ test_that("string operators", { collect(select(df4, split_string(df4$a, "")))[1, 1], list(list("a.b@c.d 1", "b")) ) + expect_equal( +collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], +list(list("a", "b@c.d 1\\b")) + ) +
spark git commit: [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema.
Repository: spark Updated Branches: refs/heads/master 17781d753 -> f2f4e7afe [SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. ## What changes were proposed in this pull request? Current the CSV's infer schema code inlines `TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of the common type coercion code when applicable. This way we can take advantage of any improvement to the base method. Thanks to MaxGekk for finding this while reviewing another PR. ## How was this patch tested? This is a minor refactor. Existing tests are used to verify the change. Closes #22619 from dilipbiswal/csv_minor. Authored-by: Dilip Biswal Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2f4e7af Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2f4e7af Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2f4e7af Branch: refs/heads/master Commit: f2f4e7afe730badaf443f459b27fe40879947d51 Parents: 17781d7 Author: Dilip Biswal Authored: Sat Oct 6 14:49:51 2018 +0800 Committer: hyukjinkwon Committed: Sat Oct 6 14:49:51 2018 +0800 -- .../datasources/csv/CSVInferSchema.scala| 37 1 file changed, 14 insertions(+), 23 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f2f4e7af/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala index a585cbe..3596ff1 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala @@ -70,7 +70,7 @@ private[csv] object CSVInferSchema { def mergeRowTypes(first: Array[DataType], second: Array[DataType]): Array[DataType] = { first.zipAll(second, NullType, NullType).map { case (a, b) => - findTightestCommonType(a, b).getOrElse(NullType) + compatibleType(a, b).getOrElse(NullType) } } @@ -88,7 +88,7 @@ private[csv] object CSVInferSchema { case LongType => tryParseLong(field, options) case _: DecimalType => // DecimalTypes have different precisions and scales, so we try to find the common type. - findTightestCommonType(typeSoFar, tryParseDecimal(field, options)).getOrElse(StringType) + compatibleType(typeSoFar, tryParseDecimal(field, options)).getOrElse(StringType) case DoubleType => tryParseDouble(field, options) case TimestampType => tryParseTimestamp(field, options) case BooleanType => tryParseBoolean(field, options) @@ -172,35 +172,27 @@ private[csv] object CSVInferSchema { StringType } - private val numericPrecedence: IndexedSeq[DataType] = TypeCoercion.numericPrecedence + /** + * Returns the common data type given two input data types so that the return type + * is compatible with both input data types. + */ + private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { +TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) + } /** - * Copied from internal Spark api - * [[org.apache.spark.sql.catalyst.analysis.TypeCoercion]] + * The following pattern matching represents additional type promotion rules that + * are CSV specific. */ - val findTightestCommonType: (DataType, DataType) => Option[DataType] = { -case (t1, t2) if t1 == t2 => Some(t1) -case (NullType, t1) => Some(t1) -case (t1, NullType) => Some(t1) + private val findCompatibleTypeForCSV: (DataType, DataType) => Option[DataType] = { case (StringType, t2) => Some(StringType) case (t1, StringType) => Some(StringType) -// Promote numeric types to the highest of the two and all numeric types to unlimited decimal -case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) => - val index = numericPrecedence.lastIndexWhere(t => t == t1 || t == t2) - Some(numericPrecedence(index)) - -// These two cases below deal with when `DecimalType` is larger than `IntegralType`. -case (t1: IntegralType, t2: DecimalType) if t2.isWiderThan(t1) => - Some(t2) -case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) => - Some(t1) - // These two cases below deal with when `IntegralType` is larger than `DecimalType`. case (t1: IntegralType, t2: DecimalType) => - findTightestCommonType(DecimalType.forType(t1), t2) +
spark git commit: [SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf
Repository: spark Updated Branches: refs/heads/master fba722e31 -> 3eb842969 [SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf ## What changes were proposed in this pull request? For Pandas UDFs, we get arrow type from defined Catalyst return data type of UDFs. We use this arrow type to do serialization of data. If the defined return data type doesn't match with actual return type of Pandas.Series returned by Pandas UDFs, it has a risk to return incorrect data from Python side. Currently we don't have reliable approach to check if the data conversion is safe or not. We leave some document to notify this to users for now. When there is next upgrade of PyArrow available we can use to check it, we should add the option to check it. ## How was this patch tested? Only document change. Closes #22610 from viirya/SPARK-25461. Authored-by: Liang-Chi Hsieh Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3eb84296 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3eb84296 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3eb84296 Branch: refs/heads/master Commit: 3eb842969906d6e81a137af6dc4339881df0a315 Parents: fba722e Author: Liang-Chi Hsieh Authored: Sun Oct 7 23:18:46 2018 +0800 Committer: hyukjinkwon Committed: Sun Oct 7 23:18:46 2018 +0800 -- python/pyspark/sql/functions.py | 6 ++ 1 file changed, 6 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3eb84296/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 7685264..be089ee 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2948,6 +2948,12 @@ def pandas_udf(f=None, returnType=None, functionType=None): can fail on special rows, the workaround is to incorporate the condition into the functions. .. note:: The user-defined functions do not take keyword arguments on the calling side. + +.. note:: The data type of returned `pandas.Series` from the user-defined functions should be +matched with defined returnType (see :meth:`types.to_arrow_type` and +:meth:`types.from_arrow_type`). When there is mismatch between them, Spark might do +conversion on returned data. The conversion is not guaranteed to be correct and results +should be checked for accuracy by users. """ # decorator @pandas_udf(returnType, functionType) is_decorator = f is None or isinstance(f, (str, DataType)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25262][DOC][FOLLOWUP] Fix missing markup tag
Repository: spark Updated Branches: refs/heads/master 5d726b865 -> e99ba8d7c [SPARK-25262][DOC][FOLLOWUP] Fix missing markup tag ## What changes were proposed in this pull request? This adds a missing end markup tag. This should go `master` branch only. ## How was this patch tested? This is a doc-only change. Manual via `SKIP_API=1 jekyll build`. Closes #22584 from dongjoon-hyun/SPARK-25262. Authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e99ba8d7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e99ba8d7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e99ba8d7 Branch: refs/heads/master Commit: e99ba8d7c8ec4b4cdd63fd1621f54be993bb0404 Parents: 5d726b8 Author: Dongjoon Hyun Authored: Sat Sep 29 11:23:37 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 11:23:37 2018 +0800 -- docs/running-on-kubernetes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e99ba8d7/docs/running-on-kubernetes.md -- diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 840e306..c7aea27 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -800,7 +800,7 @@ specific to Spark on Kubernetes. spark.kubernetes.local.dirs.tmpfs - false + false Configure the emptyDir volumes used to back SPARK_LOCAL_DIRS within the Spark driver and executor pods to use tmpfs backing i.e. RAM. See Local Storage earlier on this page for more discussion of this. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
Repository: spark Updated Branches: refs/heads/master e99ba8d7c -> 1e437835e [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` will not fail because [SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a fallback logic. However, it will cause many trials and fallbacks in all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. ## How was this patch tested? Pass the Jenkins with the updated version. Closes #22587 from dongjoon-hyun/SPARK-25570. Authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e437835 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e437835 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e437835 Branch: refs/heads/master Commit: 1e437835e96c4417117f44c29eba5ebc0112926f Parents: e99ba8d Author: Dongjoon Hyun Authored: Sat Sep 29 11:43:58 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 11:43:58 2018 +0800 -- .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e437835/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala index a7d6972..fd4985d 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala @@ -206,7 +206,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { object PROCESS_TABLES extends QueryTest with SQLTestUtils { // Tests the latest version of every release line. - val testingVersions = Seq("2.1.3", "2.2.2", "2.3.1") + val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2") protected var spark: SparkSession = _ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
Repository: spark Updated Branches: refs/heads/branch-2.4 7614313c9 -> ec2c17abf [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` will not fail because [SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a fallback logic. However, it will cause many trials and fallbacks in all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. ## How was this patch tested? Pass the Jenkins with the updated version. Closes #22587 from dongjoon-hyun/SPARK-25570. Authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon (cherry picked from commit 1e437835e96c4417117f44c29eba5ebc0112926f) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec2c17ab Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec2c17ab Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec2c17ab Branch: refs/heads/branch-2.4 Commit: ec2c17abf43d304fab26dde3ae624f553cdbd32e Parents: 7614313 Author: Dongjoon Hyun Authored: Sat Sep 29 11:43:58 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 11:44:12 2018 +0800 -- .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ec2c17ab/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala index 25df333..46b66c1 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala @@ -203,7 +203,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { object PROCESS_TABLES extends QueryTest with SQLTestUtils { // Tests the latest version of every release line. - val testingVersions = Seq("2.1.3", "2.2.2", "2.3.1") + val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2") protected var spark: SparkSession = _ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
Repository: spark Updated Branches: refs/heads/branch-2.3 f13565b6e -> eb78380c0 [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` will not fail because [SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a fallback logic. However, it will cause many trials and fallbacks in all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. ## How was this patch tested? Pass the Jenkins with the updated version. Closes #22587 from dongjoon-hyun/SPARK-25570. Authored-by: Dongjoon Hyun Signed-off-by: hyukjinkwon (cherry picked from commit 1e437835e96c4417117f44c29eba5ebc0112926f) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb78380c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb78380c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb78380c Branch: refs/heads/branch-2.3 Commit: eb78380c0e1e620e996435a4c08acb652c868795 Parents: f13565b Author: Dongjoon Hyun Authored: Sat Sep 29 11:43:58 2018 +0800 Committer: hyukjinkwon Committed: Sat Sep 29 11:44:27 2018 +0800 -- .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/eb78380c/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala index 5103aa8..af15da6 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala @@ -203,7 +203,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { object PROCESS_TABLES extends QueryTest with SQLTestUtils { // Tests the latest version of every release line. - val testingVersions = Seq("2.1.3", "2.2.2", "2.3.1") + val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2") protected var spark: SparkSession = _ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25273][DOC] How to install testthat 1.0.2
Repository: spark Updated Branches: refs/heads/master e9fce2a4c -> 3c67cb0b5 [SPARK-25273][DOC] How to install testthat 1.0.2 ## What changes were proposed in this pull request? R tests require `testthat` v1.0.2. In the PR, I described how to install the version in the section http://spark.apache.org/docs/latest/building-spark.html#running-r-tests. Closes #22272 from MaxGekk/r-testthat-doc. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c67cb0b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c67cb0b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c67cb0b Branch: refs/heads/master Commit: 3c67cb0b52c14f1cee1a0aaf74d6d71f28cbb5f2 Parents: e9fce2a Author: Maxim Gekk Authored: Thu Aug 30 20:25:26 2018 +0800 Committer: hyukjinkwon Committed: Thu Aug 30 20:25:26 2018 +0800 -- docs/README.md | 3 ++- docs/building-spark.md | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3c67cb0b/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index 7da543d..fb67c4b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -22,8 +22,9 @@ $ sudo gem install jekyll jekyll-redirect-from pygments.rb $ sudo pip install Pygments # Following is needed only for generating API docs $ sudo pip install sphinx pypandoc mkdocs -$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "testthat", "rmarkdown"), repos="http://cran.stat.ucla.edu/;)' +$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "rmarkdown"), repos="http://cran.stat.ucla.edu/;)' $ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", repos="http://cran.stat.ucla.edu/;)' +$ sudo Rscript -e 'devtools::install_version("testthat", version = "1.0.2", repos="http://cran.stat.ucla.edu/;)' ``` Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to replace gem with gem2.0. http://git-wip-us.apache.org/repos/asf/spark/blob/3c67cb0b/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index d3dfd49..0086aea 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -236,7 +236,8 @@ The run-tests script also can be limited to a specific Python version or a speci To run the SparkR tests you will need to install the [knitr](https://cran.r-project.org/package=knitr), [rmarkdown](https://cran.r-project.org/package=rmarkdown), [testthat](https://cran.r-project.org/package=testthat), [e1071](https://cran.r-project.org/package=e1071) and [survival](https://cran.r-project.org/package=survival) packages first: -R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" +R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" +R -e "devtools::install_version('testthat', version = '1.0.2', repos='http://cran.us.r-project.org')" You can run just the SparkR tests using the command: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25273][DOC] How to install testthat 1.0.2
Repository: spark Updated Branches: refs/heads/branch-2.3 306e881b6 -> b072717b3 [SPARK-25273][DOC] How to install testthat 1.0.2 ## What changes were proposed in this pull request? R tests require `testthat` v1.0.2. In the PR, I described how to install the version in the section http://spark.apache.org/docs/latest/building-spark.html#running-r-tests. Closes #22272 from MaxGekk/r-testthat-doc. Authored-by: Maxim Gekk Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b072717b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b072717b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b072717b Branch: refs/heads/branch-2.3 Commit: b072717b3f6178e728c0bf855aca243c275e58f0 Parents: 306e881 Author: Maxim Gekk Authored: Thu Aug 30 20:25:26 2018 +0800 Committer: hyukjinkwon Committed: Thu Aug 30 20:26:36 2018 +0800 -- docs/README.md | 3 ++- docs/building-spark.md | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b072717b/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index 166a7e5..174a735 100644 --- a/docs/README.md +++ b/docs/README.md @@ -22,8 +22,9 @@ $ sudo gem install jekyll jekyll-redirect-from pygments.rb $ sudo pip install Pygments # Following is needed only for generating API docs $ sudo pip install sphinx pypandoc mkdocs -$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "testthat", "rmarkdown"), repos="http://cran.stat.ucla.edu/;)' +$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "rmarkdown"), repos="http://cran.stat.ucla.edu/;)' $ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", repos="http://cran.stat.ucla.edu/;)' +$ sudo Rscript -e 'devtools::install_version("testthat", version = "1.0.2", repos="http://cran.stat.ucla.edu/;)' ``` Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to replace gem with gem2.0. http://git-wip-us.apache.org/repos/asf/spark/blob/b072717b/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index 9f78c04..cd80835 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -232,7 +232,8 @@ The run-tests script also can be limited to a specific Python version or a speci To run the SparkR tests you will need to install the [knitr](https://cran.r-project.org/package=knitr), [rmarkdown](https://cran.r-project.org/package=rmarkdown), [testthat](https://cran.r-project.org/package=testthat), [e1071](https://cran.r-project.org/package=e1071) and [survival](https://cran.r-project.org/package=survival) packages first: -R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" +R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" +R -e "devtools::install_version('testthat', version = '1.0.2', repos='http://cran.us.r-project.org')" You can run just the SparkR tests using the command: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23
Repository: spark Updated Branches: refs/heads/branch-2.4 a9a8d3a4b -> 99ae693b3 [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23 ## What changes were proposed in this pull request? Fix test that constructs a Pandas DataFrame by specifying the column order. Previously this test assumed the columns would be sorted alphabetically, however when using Python 3.6 with Pandas 0.23 or higher, the original column order is maintained. This causes the columns to get mixed up and the test errors. Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4 Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471. Authored-by: Bryan Cutler Signed-off-by: hyukjinkwon (cherry picked from commit 90e3955f384ca07bdf24faa6cdb60ded944cf0d8) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99ae693b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99ae693b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99ae693b Branch: refs/heads/branch-2.4 Commit: 99ae693b3722db6e01825b8cf2c3f2ef74a65ddb Parents: a9a8d3a Author: Bryan Cutler Authored: Thu Sep 20 09:29:29 2018 +0800 Committer: hyukjinkwon Committed: Thu Sep 20 09:29:49 2018 +0800 -- python/pyspark/sql/tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/99ae693b/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 08d7cfa..603f994 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -3266,7 +3266,7 @@ class SQLTests(ReusedSQLTestCase): import pandas as pd from datetime import datetime pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)], -"d": [pd.Timestamp.now().date()]}) +"d": [pd.Timestamp.now().date()]}, columns=["d", "ts"]) # test types are inferred correctly without specifying schema df = self.spark.createDataFrame(pdf) self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23
Repository: spark Updated Branches: refs/heads/master 6f681d429 -> 90e3955f3 [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23 ## What changes were proposed in this pull request? Fix test that constructs a Pandas DataFrame by specifying the column order. Previously this test assumed the columns would be sorted alphabetically, however when using Python 3.6 with Pandas 0.23 or higher, the original column order is maintained. This causes the columns to get mixed up and the test errors. Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4 Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471. Authored-by: Bryan Cutler Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90e3955f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90e3955f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90e3955f Branch: refs/heads/master Commit: 90e3955f384ca07bdf24faa6cdb60ded944cf0d8 Parents: 6f681d4 Author: Bryan Cutler Authored: Thu Sep 20 09:29:29 2018 +0800 Committer: hyukjinkwon Committed: Thu Sep 20 09:29:29 2018 +0800 -- python/pyspark/sql/tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/90e3955f/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 08d7cfa..603f994 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -3266,7 +3266,7 @@ class SQLTests(ReusedSQLTestCase): import pandas as pd from datetime import datetime pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)], -"d": [pd.Timestamp.now().date()]}) +"d": [pd.Timestamp.now().date()]}, columns=["d", "ts"]) # test types are inferred correctly without specifying schema df = self.spark.createDataFrame(pdf) self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23
Repository: spark Updated Branches: refs/heads/branch-2.3 7b5da37c0 -> e319a624e [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23 ## What changes were proposed in this pull request? Fix test that constructs a Pandas DataFrame by specifying the column order. Previously this test assumed the columns would be sorted alphabetically, however when using Python 3.6 with Pandas 0.23 or higher, the original column order is maintained. This causes the columns to get mixed up and the test errors. Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4 Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471. Authored-by: Bryan Cutler Signed-off-by: hyukjinkwon (cherry picked from commit 90e3955f384ca07bdf24faa6cdb60ded944cf0d8) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e319a624 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e319a624 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e319a624 Branch: refs/heads/branch-2.3 Commit: e319a624e2f366a941bd92a685e1b48504c887b1 Parents: 7b5da37 Author: Bryan Cutler Authored: Thu Sep 20 09:29:29 2018 +0800 Committer: hyukjinkwon Committed: Thu Sep 20 09:30:06 2018 +0800 -- python/pyspark/sql/tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e319a624/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 6bfb329..3c5fc97 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -2885,7 +2885,7 @@ class SQLTests(ReusedSQLTestCase): import pandas as pd from datetime import datetime pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)], -"d": [pd.Timestamp.now().date()]}) +"d": [pd.Timestamp.now().date()]}, columns=["d", "ts"]) # test types are inferred correctly without specifying schema df = self.spark.createDataFrame(pdf) self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent
Repository: spark Updated Branches: refs/heads/master 0e31a6f25 -> 7ff5386ed [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent ## What changes were proposed in this pull request? This PR replace an effective `show()` to `collect()` to make the output silent. **Before:** ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... +---+--+ |key| val| +---+--+ | 0|[0.0, 0.0]| | 1|[1.0, 1.0]| | 2|[2.0, 2.0]| | 0|[3.0, 3.0]| | 1|[4.0, 4.0]| | 2|[5.0, 5.0]| | 0|[6.0, 6.0]| | 1|[7.0, 7.0]| | 2|[8.0, 8.0]| | 0|[9.0, 9.0]| +---+--+ ``` **After:** ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... ok ``` ## How was this patch tested? Manually tested. Closes #22479 from HyukjinKwon/minor-udf-test. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ff5386e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ff5386e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ff5386e Branch: refs/heads/master Commit: 7ff5386ed934190344b2cda1069bde4bc68a3e63 Parents: 0e31a6f Author: hyukjinkwon Authored: Thu Sep 20 15:03:16 2018 +0800 Committer: hyukjinkwon Committed: Thu Sep 20 15:03:16 2018 +0800 -- python/pyspark/sql/tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7ff5386e/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 603f994..8724bbc 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1168,7 +1168,7 @@ class SQLTests(ReusedSQLTestCase): df = self.spark.createDataFrame( [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) -df.show() +df.collect() def test_nested_udt_in_df(self): schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT())) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent
Repository: spark Updated Branches: refs/heads/branch-2.4 dfcff3839 -> e07042a35 [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent ## What changes were proposed in this pull request? This PR replace an effective `show()` to `collect()` to make the output silent. **Before:** ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... +---+--+ |key| val| +---+--+ | 0|[0.0, 0.0]| | 1|[1.0, 1.0]| | 2|[2.0, 2.0]| | 0|[3.0, 3.0]| | 1|[4.0, 4.0]| | 2|[5.0, 5.0]| | 0|[6.0, 6.0]| | 1|[7.0, 7.0]| | 2|[8.0, 8.0]| | 0|[9.0, 9.0]| +---+--+ ``` **After:** ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... ok ``` ## How was this patch tested? Manually tested. Closes #22479 from HyukjinKwon/minor-udf-test. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon (cherry picked from commit 7ff5386ed934190344b2cda1069bde4bc68a3e63) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e07042a3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e07042a3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e07042a3 Branch: refs/heads/branch-2.4 Commit: e07042a3593199f5045e1476b6b324f7f0901143 Parents: dfcff38 Author: hyukjinkwon Authored: Thu Sep 20 15:03:16 2018 +0800 Committer: hyukjinkwon Committed: Thu Sep 20 15:03:34 2018 +0800 -- python/pyspark/sql/tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e07042a3/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 603f994..8724bbc 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1168,7 +1168,7 @@ class SQLTests(ReusedSQLTestCase): df = self.spark.createDataFrame( [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) -df.show() +df.collect() def test_nested_udt_in_df(self): schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT())) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol
Repository: spark Updated Branches: refs/heads/branch-2.4 c64e7506d -> 36e7c8fcc [SPARKR] Match pyspark features in SparkR communication protocol Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36e7c8fc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36e7c8fc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36e7c8fc Branch: refs/heads/branch-2.4 Commit: 36e7c8fcc1aeff0b15deb1243bd9615a202d320f Parents: c64e750 Author: hyukjinkwon Authored: Mon Sep 24 19:25:02 2018 +0800 Committer: hyukjinkwon Committed: Mon Sep 24 19:28:31 2018 +0800 -- R/pkg/R/context.R | 43 ++-- R/pkg/tests/fulltests/test_Serde.R | 32 +++ R/pkg/tests/fulltests/test_sparkSQL.R | 12 -- .../scala/org/apache/spark/api/r/RRDD.scala | 33 ++- .../scala/org/apache/spark/api/r/RUtils.scala | 4 ++ 5 files changed, 98 insertions(+), 26 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/36e7c8fc/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index f168ca7..e991367 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -167,18 +167,30 @@ parallelize <- function(sc, coll, numSlices = 1) { # 2-tuples of raws serializedSlices <- lapply(slices, serialize, connection = NULL) - # The PRC backend cannot handle arguments larger than 2GB (INT_MAX) + # The RPC backend cannot handle arguments larger than 2GB (INT_MAX) # If serialized data is safely less than that threshold we send it over the PRC channel. # Otherwise, we write it to a file and send the file name if (objectSize < sizeLimit) { jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) } else { -fileName <- writeToTempFile(serializedSlices) -jrdd <- tryCatch(callJStatic( -"org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)), - finally = { -file.remove(fileName) -}) +if (callJStatic("org.apache.spark.api.r.RUtils", "getEncryptionEnabled", sc)) { + # the length of slices here is the parallelism to use in the jvm's sc.parallelize() + parallelism <- as.integer(numSlices) + jserver <- newJObject("org.apache.spark.api.r.RParallelizeServer", sc, parallelism) + authSecret <- callJMethod(jserver, "secret") + port <- callJMethod(jserver, "port") + conn <- socketConnection(port = port, blocking = TRUE, open = "wb", timeout = 1500) + doServerAuth(conn, authSecret) + writeToConnection(serializedSlices, conn) + jrdd <- callJMethod(jserver, "getResult") +} else { + fileName <- writeToTempFile(serializedSlices) + jrdd <- tryCatch(callJStatic( + "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)), +finally = { + file.remove(fileName) + }) +} } RDD(jrdd, "byte") @@ -194,14 +206,21 @@ getMaxAllocationLimit <- function(sc) { )) } +writeToConnection <- function(serializedSlices, conn) { + tryCatch({ +for (slice in serializedSlices) { + writeBin(as.integer(length(slice)), conn, endian = "big") + writeBin(slice, conn, endian = "big") +} + }, finally = { +close(conn) + }) +} + writeToTempFile <- function(serializedSlices) { fileName <- tempfile() conn <- file(fileName, "wb") - for (slice in serializedSlices) { -writeBin(as.integer(length(slice)), conn, endian = "big") -writeBin(slice, conn, endian = "big") - } - close(conn) + writeToConnection(serializedSlices, conn) fileName } http://git-wip-us.apache.org/repos/asf/spark/blob/36e7c8fc/R/pkg/tests/fulltests/test_Serde.R -- diff --git a/R/pkg/tests/fulltests/test_Serde.R b/R/pkg/tests/fulltests/test_Serde.R index 3577929..1525bdb 100644 --- a/R/pkg/tests/fulltests/test_Serde.R +++ b/R/pkg/tests/fulltests/test_Serde.R @@ -124,3 +124,35 @@ test_that("SerDe of list of lists", { }) sparkR.session.stop() + +# Note that this test should be at the end of tests since the configruations used here are not +# specific to sessions, and the Spark context is restarted. +test_that("createDataFrame large objects", { + for (encryptionEnabled in list("true", "false")) { +# To simulate a large object scenario, we set spark.r.maxAllocationLimit to a smaller value +conf <- list(spark.r.maxAllocationLimit = "100", + spark.io.encryption.enabled = encryptionEnabled) + +suppressWarnings(sparkR.session(master = sparkRTestMaster, +sparkConfig = conf, +
spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol
Repository: spark Updated Branches: refs/heads/master c79072aaf -> c3b4a94a9 [SPARKR] Match pyspark features in SparkR communication protocol Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c3b4a94a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c3b4a94a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c3b4a94a Branch: refs/heads/master Commit: c3b4a94a91d66c172cf332321d3a78dba29ef8f0 Parents: c79072a Author: hyukjinkwon Authored: Mon Sep 24 19:25:02 2018 +0800 Committer: hyukjinkwon Committed: Mon Sep 24 19:25:02 2018 +0800 -- R/pkg/R/context.R | 43 ++-- R/pkg/tests/fulltests/test_Serde.R | 32 +++ R/pkg/tests/fulltests/test_sparkSQL.R | 12 -- .../scala/org/apache/spark/api/r/RRDD.scala | 33 ++- .../scala/org/apache/spark/api/r/RUtils.scala | 4 ++ 5 files changed, 98 insertions(+), 26 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c3b4a94a/R/pkg/R/context.R -- diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index f168ca7..e991367 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -167,18 +167,30 @@ parallelize <- function(sc, coll, numSlices = 1) { # 2-tuples of raws serializedSlices <- lapply(slices, serialize, connection = NULL) - # The PRC backend cannot handle arguments larger than 2GB (INT_MAX) + # The RPC backend cannot handle arguments larger than 2GB (INT_MAX) # If serialized data is safely less than that threshold we send it over the PRC channel. # Otherwise, we write it to a file and send the file name if (objectSize < sizeLimit) { jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) } else { -fileName <- writeToTempFile(serializedSlices) -jrdd <- tryCatch(callJStatic( -"org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)), - finally = { -file.remove(fileName) -}) +if (callJStatic("org.apache.spark.api.r.RUtils", "getEncryptionEnabled", sc)) { + # the length of slices here is the parallelism to use in the jvm's sc.parallelize() + parallelism <- as.integer(numSlices) + jserver <- newJObject("org.apache.spark.api.r.RParallelizeServer", sc, parallelism) + authSecret <- callJMethod(jserver, "secret") + port <- callJMethod(jserver, "port") + conn <- socketConnection(port = port, blocking = TRUE, open = "wb", timeout = 1500) + doServerAuth(conn, authSecret) + writeToConnection(serializedSlices, conn) + jrdd <- callJMethod(jserver, "getResult") +} else { + fileName <- writeToTempFile(serializedSlices) + jrdd <- tryCatch(callJStatic( + "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)), +finally = { + file.remove(fileName) + }) +} } RDD(jrdd, "byte") @@ -194,14 +206,21 @@ getMaxAllocationLimit <- function(sc) { )) } +writeToConnection <- function(serializedSlices, conn) { + tryCatch({ +for (slice in serializedSlices) { + writeBin(as.integer(length(slice)), conn, endian = "big") + writeBin(slice, conn, endian = "big") +} + }, finally = { +close(conn) + }) +} + writeToTempFile <- function(serializedSlices) { fileName <- tempfile() conn <- file(fileName, "wb") - for (slice in serializedSlices) { -writeBin(as.integer(length(slice)), conn, endian = "big") -writeBin(slice, conn, endian = "big") - } - close(conn) + writeToConnection(serializedSlices, conn) fileName } http://git-wip-us.apache.org/repos/asf/spark/blob/c3b4a94a/R/pkg/tests/fulltests/test_Serde.R -- diff --git a/R/pkg/tests/fulltests/test_Serde.R b/R/pkg/tests/fulltests/test_Serde.R index 3577929..1525bdb 100644 --- a/R/pkg/tests/fulltests/test_Serde.R +++ b/R/pkg/tests/fulltests/test_Serde.R @@ -124,3 +124,35 @@ test_that("SerDe of list of lists", { }) sparkR.session.stop() + +# Note that this test should be at the end of tests since the configruations used here are not +# specific to sessions, and the Spark context is restarted. +test_that("createDataFrame large objects", { + for (encryptionEnabled in list("true", "false")) { +# To simulate a large object scenario, we set spark.r.maxAllocationLimit to a smaller value +conf <- list(spark.r.maxAllocationLimit = "100", + spark.io.encryption.enabled = encryptionEnabled) + +suppressWarnings(sparkR.session(master = sparkRTestMaster, +sparkConfig = conf, +
spark git commit: [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm
Repository: spark Updated Branches: refs/heads/branch-2.4 1303eb5c8 -> c64e7506d [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm ## What changes were proposed in this pull request? Always close the tempFile after `serializer.dump_stream(data, tempFile)` in _serialize_to_jvm ## How was this patch tested? N/A Closes #22523 from gatorsmile/fixMinor. Authored-by: gatorsmile Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c64e7506 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c64e7506 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c64e7506 Branch: refs/heads/branch-2.4 Commit: c64e7506dabaccc60f8140aeae589053645f23a6 Parents: 1303eb5 Author: gatorsmile Authored: Sun Sep 23 10:16:33 2018 +0800 Committer: hyukjinkwon Committed: Sun Sep 23 10:18:00 2018 +0800 -- python/pyspark/context.py | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c64e7506/python/pyspark/context.py -- diff --git a/python/pyspark/context.py b/python/pyspark/context.py index 87255c4..0924d3d 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -537,8 +537,10 @@ class SparkContext(object): # parallelize from there. tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir) try: -serializer.dump_stream(data, tempFile) -tempFile.close() +try: +serializer.dump_stream(data, tempFile) +finally: +tempFile.close() return reader_func(tempFile.name) finally: # we eagerily reads the file so we can delete right after. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra
Repository: spark Updated Branches: refs/heads/master 0fbba76fa -> a72d118cd [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra ## What changes were proposed in this pull request? This PR does not fix the problem itself but just target to add few comments to run PySpark tests on Python 3.6 and macOS High Serria since it actually blocks to run tests on this enviornment. it does not target to fix the problem yet. The problem here looks because we fork python workers and the forked workers somehow call Objective-C libraries in some codes at CPython's implementation. After debugging a while, I suspect `pickle` in Python 3.6 has some changes: https://github.com/apache/spark/blob/58419b92673c46911c25bc6c6b13397f880c6424/python/pyspark/serializers.py#L577 in particular, it looks also related to which objects are serialized or not as well. This link (http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html) and this link (https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/) were helpful for me to understand this. I am still debugging this but my guts say it's difficult to fix or workaround within Spark side. ## How was this patch tested? Manually tested: Before `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`: ``` /usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766: ResourceWarning: subprocess 27563 is still running ResourceWarning, source=self) [Stage 0:> (0 + 1) / 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug. ERROR == ERROR: test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) -- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o54.processAllAvailable. : org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted. === Streaming Query === Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 08d1435b-5358-4fb6-b167-811584a3163e] Current Committed Offsets: {} Current Available Offsets: {FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]: {"logOffset":0}} Current State: ACTIVE Thread State: RUNNABLE Logical Plan: FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: org.apache.spark.SparkException: Writing job aborted. at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ``` After `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`: ``` test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ... ok ``` Closes #22480 from HyukjinKwon/SPARK-25473. Authored-by: hyukjinkwon Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a72d118c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a72d118c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a72d118c Branch: refs/heads/master Commit: a72d118cd96cd44d37cb8f8b6c444953a99aab3f Parents: 0fbba76 Author: hyukjinkwon Authored: Sun Sep 23 11:14:27 2018 +0800 Committer: hyukjinkwon Committed: Sun Sep 23 11:14:27 2018 +0800 -- python/pyspark/sql/tests.py | 3 +++ 1 file changed, 3 insertions(+) --
spark git commit: [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm
Repository: spark Updated Branches: refs/heads/master 6ca87eb2e -> 0fbba76fa [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm ## What changes were proposed in this pull request? Always close the tempFile after `serializer.dump_stream(data, tempFile)` in _serialize_to_jvm ## How was this patch tested? N/A Closes #22523 from gatorsmile/fixMinor. Authored-by: gatorsmile Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fbba76f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fbba76f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fbba76f Branch: refs/heads/master Commit: 0fbba76faa00a18eef5d8c2ef2e673744d0d490b Parents: 6ca87eb Author: gatorsmile Authored: Sun Sep 23 10:16:33 2018 +0800 Committer: hyukjinkwon Committed: Sun Sep 23 10:16:33 2018 +0800 -- python/pyspark/context.py | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0fbba76f/python/pyspark/context.py -- diff --git a/python/pyspark/context.py b/python/pyspark/context.py index 87255c4..0924d3d 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -537,8 +537,10 @@ class SparkContext(object): # parallelize from there. tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir) try: -serializer.dump_stream(data, tempFile) -tempFile.close() +try: +serializer.dump_stream(data, tempFile) +finally: +tempFile.close() return reader_func(tempFile.name) finally: # we eagerily reads the file so we can delete right after. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org