GitHub user Suresh-Patibandla opened a pull request: https://github.com/apache/spark/pull/22248
TestPull ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22248.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22248 ---- commit c5b8d54c61780af6e9e157e6c855718df972efad Author: Chris Martin <chris@...> Date: 2018-07-28T15:40:10Z [SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13 ## What changes were proposed in this pull request? - Update DateTimeUtilsSuite so that when testing roundtripping in daysToMillis and millisToDays multiple skipdates can be specified. - Updated test so that both new years eve 2014 and new years day 2015 are skipped for kiribati time zones. This is necessary as java versions pre 181-b13 considered new years day 2015 to be skipped while susequent versions corrected this to new years eve. ## How was this patch tested? Unit tests Author: Chris Martin <ch...@cmartinit.co.uk> Closes #21901 from d80tb7/SPARK-24950_datetimeUtilsSuite_failures. commit 8fe5d2c393f035b9e82ba42202421c9ba66d6c78 Author: Kazuaki Ishizaki <ishizaki@...> Date: 2018-07-29T13:31:16Z [SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4 ## What changes were proposed in this pull request? This PR updates maven version from 3.3.9 to 3.5.4. The current build process uses mvn 3.3.9 that was release on 2015, which looks pretty old. We met [an issue](https://issues.apache.org/jira/browse/SPARK-24895) to need the maven 3.5.2 or later. The release note of the 3.5.4 is [here](https://maven.apache.org/docs/3.5.4/release-notes.html). Note version 3.4 was skipped. From [the release note of the 3.5.0](https://maven.apache.org/docs/3.5.0/release-notes.html), the followings are new features: 1. ANSI color logging for improved output visibility 1. add support for module name != artifactId in every calculated URLs (project, SCM, site): special project.directory property 1. create a slf4j-simple provider extension that supports level color rendering 1. ModelResolver interface enhancement: addition of resolveModel(Dependency) supporting version ranges ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Closes #21905 from kiszk/SPARK-24956. commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab Author: liulijia <liutang123@...> Date: 2018-07-29T20:13:00Z [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error When join key is long or int in broadcast join, Spark will use `LongToUnsafeRowMap` to store key-values of the table witch will be broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it is too big to hold in memory, it will be stored in disk. At that time, because `write` uses a variable `cursor` to determine how many bytes in `page` of `LongToUnsafeRowMap` will be write out and the `cursor` was not restore when deserializing, executor will write out nothing from page into disk. ## What changes were proposed in this pull request? Restore cursor value when deserializing. Author: liulijia <liutang...@yeah.net> Closes #21772 from liutang123/SPARK-24809. commit 3695ba57731a669ed20e7f676edee602c292fbed Author: Xingbo Jiang <xingbo.jiang@...> Date: 2018-07-30T01:58:28Z [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite ## What changes were proposed in this pull request? In the `afterEach()` method of both `TastSetManagerSuite` and `TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, because it shall stop the SparkContext. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ The test failure is caused by the above reason, the newly added `barrierCoordinator` required `rpcEnv` which has been stopped before `TaskSchedulerImpl` doing cleanup. ## How was this patch tested? Existing tests. Author: Xingbo Jiang <xingbo.ji...@databricks.com> Closes #21908 from jiangxb1987/afterEach. commit 3210121fed0ba256667f18f990c1a11d32c306ea Author: hyukjinkwon <gurwls223@...> Date: 2018-07-30T02:01:18Z [MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml ## What changes were proposed in this pull request? This PR propose to remove `-Phive-thriftserver` profile which seems not affecting the SparkR tests in AppVeyor. Originally wanted to check if there's a meaningful build time decrease but seems not. It will have but seems not meaningfully decreased. ## How was this patch tested? AppVeyor tests: ``` [00:40:49] Attaching package: 'SparkR' [00:40:49] [00:40:49] The following objects are masked from 'package:testthat': [00:40:49] [00:40:49] describe, not [00:40:49] [00:40:49] The following objects are masked from 'package:stats': [00:40:49] [00:40:49] cov, filter, lag, na.omit, predict, sd, var, window [00:40:49] [00:40:49] The following objects are masked from 'package:base': [00:40:49] [00:40:49] as.data.frame, colnames, colnames<-, drop, endsWith, intersect, [00:40:49] rank, rbind, sample, startsWith, subset, summary, transform, union [00:40:49] [00:40:49] Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:41:43] basic tests for CRAN: ............. [00:41:43] [00:41:43] DONE =========================================================================== [00:41:43] binary functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:05] ........... [00:42:05] functions on binary files: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:10] .... [00:42:10] broadcast variables: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:12] .. [00:42:12] functions in client.R: ..... [00:42:30] test functions in sparkR.R: .............................................. [00:42:30] include R packages: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:31] [00:42:31] JVM API: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:31] .. [00:42:31] MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:48:48] ...................................................................... [00:48:48] MLlib clustering algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:12] ..................................................................... [00:50:12] MLlib frequent pattern mining: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:18] ..... [00:50:18] MLlib recommendation algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:27] ........ [00:50:27] MLlib regression algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:56:00] ................................................................................................................................ [00:56:00] MLlib statistics algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:56:04] ........ [00:56:04] MLlib tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:58:20] .............................................................................................. [00:58:20] parallelize() and collect(): Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:58:20] ............................. [00:58:20] basic RDD functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:03:35] ............................................................................................................................................................................................................................................................................................................................................................................................................................................ [01:03:35] SerDe functionality: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:03:39] ............................... [01:03:39] partitionBy, groupByKey, reduceByKey etc.: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:04:20] .................... [01:04:20] functions in sparkR.R: .... [01:04:20] SparkSQL functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:04:50] ........................................................................................................................................-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:04:50] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:04:51] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:51] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:13] ............................................................................................................................................................................................................................................................................................................................................................-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:06:13] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:14] .-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:06:14] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:14] ....-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:06:14] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:12:30] ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... [01:12:30] Structured Streaming: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:27] .......................................... [01:14:27] tests RDD function take(): Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:28] ................ [01:14:28] the textFile() function: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:44] ............. [01:14:44] functions in utils.R: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:46] ............................................ [01:14:46] Windows-specific tests: . [01:14:46] [01:14:46] DONE =========================================================================== [01:15:29] Build success ``` Author: hyukjinkwon <gurwls...@apache.org> Closes #21894 from HyukjinKwon/wip-build. commit 6690924c49a443cd629fcc1a4460cf443fb0a918 Author: hyukjinkwon <gurwls223@...> Date: 2018-07-30T02:02:29Z [MINOR] Avoid the 'latest' link that might vary per release in functions.scala's comment ## What changes were proposed in this pull request? This PR propose to address https://github.com/apache/spark/pull/21318#discussion_r187843125 comment. This is rather a nit but looks we better avoid to update the link for each release since it always points the latest (it doesn't look like worth enough updating release guide on the other hand as well). ## How was this patch tested? N/A Author: hyukjinkwon <gurwls...@apache.org> Closes #21907 from HyukjinKwon/minor-fix. commit 65a4bc143ab5dc2ced589dc107bbafa8a7290931 Author: Dilip Biswal <dbiswal@...> Date: 2018-07-30T05:11:01Z [SPARK-21274][SQL] Implement INTERSECT ALL clause ## What changes were proposed in this pull request? Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Input Query ``` SQL SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_row(min_count, c1) FROM ( SELECT c1, IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count FROM ( SELECT c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt FROM ( SELECT c1, true as vcol1, null as vcol2 FROM ut1 UNION ALL SELECT c1, null as vcol1, true as vcol2 FROM ut2 ) AS union_all GROUP BY c1 HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1 ) ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite Author: Dilip Biswal <dbis...@us.ibm.com> Closes #21886 from dilipbiswal/dkb_intersect_all_final. commit bfe60fcdb49aa48534060c38e36e06119900140d Author: hyukjinkwon <gurwls223@...> Date: 2018-07-30T05:20:03Z [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning ## What changes were proposed in this pull request? Looks we intentionally set `null` for upper/lower bounds for complex types and don't use it. However, these look used in in-memory partition pruning, which ends up with incorrect results. This PR proposes to explicitly whitelist the supported types. ```scala val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df.cache().filter("arrayCol > array('a', 'b')").show() ``` ```scala val df = sql("select cast('a' as binary) as a") df.cache().filter("a == cast('a' as binary)").show() ``` **Before:** ``` +--------+ |arrayCol| +--------+ +--------+ ``` ``` +---+ | a| +---+ +---+ ``` **After:** ``` +--------+ |arrayCol| +--------+ | [c, d]| +--------+ ``` ``` +----+ | a| +----+ |[61]| +----+ ``` ## How was this patch tested? Unit tests were added and manually tested. Author: hyukjinkwon <gurwls...@apache.org> Closes #21882 from HyukjinKwon/stats-filter. commit 85505fc8a58ca229bbaf240c6bc23ea876d594db Author: Marco Gaido <marcogaido91@...> Date: 2018-07-30T12:53:45Z [SPARK-24957][SQL] Average with decimal followed by aggregation returns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the `Divide` operand returns a result which contains a precision and scale different by the ones which are expected as output of the `Divide` operand. In the case reported in the JIRA, for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while the output data type for `Divide` is 38, 22. This is not an issue when the `Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the `Average` operator we do have a `Cast`, this may be bypassed if the result of `Divide` is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by `CheckOverflow`. ## How was this patch tested? added UT Author: Marco Gaido <marcogaid...@gmail.com> Closes #21910 from mgaido91/SPARK-24957. commit fca0b8528e704cfe62863a34f8bb5dcee850b046 Author: hyukjinkwon <gurwls223@...> Date: 2018-07-30T13:13:08Z [SPARK-24967][SQL] Avro: Use internal.Logging instead for logging ## What changes were proposed in this pull request? Looks Avro uses direct `getLogger` to create a SLF4J logger. Should better use `internal.Logging` instead. ## How was this patch tested? Exiting tests. Author: hyukjinkwon <gurwls...@apache.org> Closes #21914 from HyukjinKwon/avro-log. commit b90bfe3c42eb9b51e6131a8f8923bcddfccd75bb Author: Gengliang Wang <gengliang.wang@...> Date: 2018-07-30T14:30:47Z [SPARK-24771][BUILD] Upgrade Apache AVRO to 1.8.2 ## What changes were proposed in this pull request? Upgrade Apache Avro from 1.7.7 to 1.8.2. The major new features: 1. More logical types. From the spec of 1.8.2 https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types we can see comparing to [1.7.7](https://avro.apache.org/docs/1.7.7/spec.html#Logical+Types), the new version support: - Date - Time (millisecond precision) - Time (microsecond precision) - Timestamp (millisecond precision) - Timestamp (microsecond precision) - Duration 2. Single-object encoding: https://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding This PR aims to update Apache Spark to support these new features. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.w...@databricks.com> Closes #21761 from gengliangwang/upgrade_avro_1.8. commit 47d84e4d0e56e14f9402770dceaf0b4302c00e98 Author: Takeshi Yamamuro <yamamuro@...> Date: 2018-07-30T14:42:00Z [SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column ## What changes were proposed in this pull request? This pr supported Date/Timestamp in a JDBC partition column (a numeric column is only supported in the master). This pr also modified code to verify a partition column type; ``` val jdbcTable = spark.read .option("partitionColumn", "text") .option("lowerBound", "aaa") .option("upperBound", "zzz") .option("numPartitions", 2) .jdbc("jdbc:postgresql:postgres", "t", options) // with this pr org.apache.spark.sql.AnalysisException: Partition column type should be numeric, date, or timestamp, but string found.; at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.verifyAndGetNormalizedPartitionColumn(JDBCRelation.scala:165) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:85) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:317) // without this pr java.lang.NumberFormatException: For input string: "aaa" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277) ``` Closes #19999 ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamam...@apache.org> Closes #21834 from maropu/SPARK-22814. commit d6b7545b5f495a496d40a982e0ab0f8053e1a4f5 Author: mcheah <mcheah@...> Date: 2018-07-30T18:41:02Z [SPARK-24963][K8S][TESTS] Don't set service account name for client mode test ## What changes were proposed in this pull request? Don't set service account name for the pod created in client mode ## How was this patch tested? Test should continue running smoothly in Jenkins. Author: mcheah <mch...@palantir.com> Closes #21900 from mccheah/fix-integration-test-service-account. commit abbb4ab4d8b12ba2d94b16407c0d62ae207ee4fa Author: Reynold Xin <rxin@...> Date: 2018-07-30T21:05:45Z [SPARK-24865][SQL] Remove AnalysisBarrier addendum ## What changes were proposed in this pull request? I didn't want to pollute the diff in the previous PR and left some TODOs. This is a follow-up to address those TODOs. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <r...@databricks.com> Closes #21896 from rxin/SPARK-24865-addendum. commit 2fbe294cf01f78b34498553d9228b57e2f992bce Author: mcheah <mcheah@...> Date: 2018-07-30T22:57:54Z [SPARK-24963][K8S][TESTS] Add user-specified service account name for client mode test driver pod ## What changes were proposed in this pull request? Adds the user-set service account name for the driver pod in the client mode integration test ## How was this patch tested? Manual test against a custom Kubernetes cluster Author: mcheah <mch...@palantir.com> Closes #21924 from mccheah/fix-service-account. commit d20c10fdf382acf43a7e6a541923bd078e19ca75 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-07-31T01:12:57Z [SPARK-24952][SQL] Support LZMA2 compression by Avro datasource ## What changes were proposed in this pull request? In the PR, I propose to support `LZMA2` (`XZ`) and `BZIP2` compressions by `AVRO` datasource in write since the codecs may have better characteristics like compression ratio and speed comparing to already supported `snappy` and `deflate` codecs. ## How was this patch tested? It was tested manually and by an existing test which was extended to check the `xz` and `bzip2` compressions. Author: Maxim Gekk <maxim.g...@databricks.com> Closes #21902 from MaxGekk/avro-xz-bzip2. commit f1550aaf1506c0115c8d95cd8bc784ed6c734ea5 Author: hyukjinkwon <gurwls223@...> Date: 2018-07-31T01:14:29Z [SPARK-24956][BUILD][FOLLOWUP] Upgrade Maven version to 3.5.4 for AppVeyor as well ## What changes were proposed in this pull request? Maven version was upgraded and AppVeyor should also use upgraded maven version. Currently, it looks broken by this: https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/2458-master ``` [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message: Detected Maven Version: 3.3.9 is not in the allowed range 3.5.4. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: ``` ## How was this patch tested? AppVeyor tests Author: hyukjinkwon <gurwls...@apache.org> Closes #21920 from HyukjinKwon/SPARK-24956. commit 8141d55926e95c06cd66bf82098895e1ed419449 Author: Li Jin <ice.xelloss@...> Date: 2018-07-31T02:10:38Z [SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide ## What changes were proposed in this pull request? Update Pandas UDFs section in sql-programming-guide. Add section for grouped aggregate pandas UDF. ## How was this patch tested? Author: Li Jin <ice.xell...@gmail.com> Closes #21887 from icexelloss/SPARK-23633-sql-programming-guide. commit b4fd75fb9b615cfe592ad269cf20d02b483a0d33 Author: maryannxue <maryannxue@...> Date: 2018-07-31T06:43:53Z [SPARK-24972][SQL] PivotFirst could not handle pivot columns of complex types ## What changes were proposed in this pull request? When the pivot column is of a complex type, the eval() result will be an UnsafeRow, while the keys of the HashMap for column value matching is a GenericInternalRow. As a result, there will be no match and the result will always be empty. So for a pivot column of complex-types, we should: 1) If the complex-type is not comparable (orderable), throw an Exception. It cannot be a pivot column. 2) Otherwise, if it goes through the `PivotFirst` code path, `PivotFirst` should use a TreeMap instead of HashMap for such columns. This PR has also reverted the walk-around in Analyzer that had been introduced to avoid this `PivotFirst` issue. ## How was this patch tested? Added UT. Author: maryannxue <maryann...@apache.org> Closes #21926 from maryannxue/pivot_followup. commit 4ac2126bc64bad1b4cbe1c697b4bcafacd67c96c Author: Mauro Palsgraaf <mauropalsgraaf@...> Date: 2018-07-31T15:18:08Z [SPARK-24536] Validate that an evaluated limit clause cannot be null ## What changes were proposed in this pull request? It proposes a version in which nullable expressions are not valid in the limit clause ## How was this patch tested? It was tested with unit and e2e tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Mauro Palsgraaf <mauropalsgr...@hotmail.com> Closes #21807 from mauropalsgraaf/SPARK-24536. commit 1223a201fcb2c2f211ad96997ebb00c3554aa822 Author: zhengruifeng <ruifengz@...> Date: 2018-07-31T18:37:13Z [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well ## What changes were proposed in this pull request? update doc of RandomForestClassifier.featureSubsetStrategy ## How was this patch tested? local built doc rdoc: ![default](https://user-images.githubusercontent.com/7322292/42807787-4dda6362-89e4-11e8-839f-a8519b7c1f1c.png) pydoc: ![default](https://user-images.githubusercontent.com/7322292/43112817-5f1d4d88-8f2a-11e8-93ff-de90db8afdca.png) Author: zhengruifeng <ruife...@foxmail.com> Closes #21788 from zhengruifeng/rf_doc_py_r. commit e82784d13fac7d45164dfadb00d3fa43e64e0bde Author: tedyu <yuzhihong@...> Date: 2018-07-31T20:14:14Z [SPARK-18057][SS] Update Kafka client version from 0.10.0.1 to 2.0.0 ## What changes were proposed in this pull request? This PR upgrades to the Kafka 2.0.0 release where KIP-266 is integrated. ## How was this patch tested? This PR uses existing Kafka related unit tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: tedyu <yuzhih...@gmail.com> Closes #21488 from tedyu/master. commit 42dfe4f1593767eae355e27bf969339f4ab03f56 Author: Huaxin Gao <huaxing@...> Date: 2018-07-31T20:23:11Z [SPARK-24973][PYTHON] Add numIter to Python ClusteringSummary ## What changes were proposed in this pull request? Add numIter to Python version of ClusteringSummary ## How was this patch tested? Modified existing UT test_multiclass_logistic_regression_summary Author: Huaxin Gao <huax...@us.ibm.com> Closes #21925 from huaxingao/spark-24973. commit f4772fd26f32b11ae54e7721924b5cf6eb27298a Author: hyukjinkwon <gurwls223@...> Date: 2018-08-01T00:24:24Z [SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) ## What changes were proposed in this pull request? See [ARROW-2432](https://jira.apache.org/jira/browse/ARROW-2432). Seems using `from_pandas` to convert decimals fails if encounters a value of `None`: ```python import pyarrow as pa import pandas as pd from decimal import Decimal pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]), type=pa.decimal128(3, 2)) ``` **Arrow 0.8.0** ``` <pyarrow.lib.Decimal128Array object at 0x10a572c58> [ Decimal('3.14'), NA ] ``` **Arrow 0.9.0** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` This PR propose to work around this via Decimal NaN: ```python pa.Array.from_pandas(pd.Series([Decimal('3.14'), Decimal('NaN')]), type=pa.decimal128(3, 2)) ``` ``` <pyarrow.lib.Decimal128Array object at 0x10ffd2e68> [ Decimal('3.14'), NA ] ``` ## How was this patch tested? Manually tested: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests ScalarPandasUDFTests ``` **Before** ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal self.assertEquals(df.collect(), res.collect()) File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect sock_info = self._jdf.collectToPython() File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o51.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/pyspark/worker.py", line 320, in main process() File "/.../spark/python/pyspark/worker.py", line 315, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream batch = _create_batch(series, self._timezone) File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/.../spark/python/pyspark/serializers.py", line 241, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` **After** ``` Running tests... ---------------------------------------------------------------------- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). .......S............................. ---------------------------------------------------------------------- Ran 37 tests in 21.980s ``` Author: hyukjinkwon <gurwls...@apache.org> Closes #21928 from HyukjinKwon/SPARK-24976. commit 5f3441e542bfacd81d70bd8b34c22044c8928bff Author: DB Tsai <d_tsai@...> Date: 2018-08-01T02:31:02Z [SPARK-24893][SQL] Remove the entire CaseWhen if all the outputs are semantic equivalence ## What changes were proposed in this pull request? Similar to SPARK-24890, if all the outputs of `CaseWhen` are semantic equivalence, `CaseWhen` can be removed. ## How was this patch tested? Tests added. Author: DB Tsai <d_t...@apple.com> Closes #21852 from dbtsai/short-circuit-when. commit 1f7e22c72c89fc2c0e729dde0948bc6bdf8f7628 Author: Reynold Xin <rxin@...> Date: 2018-08-01T05:25:40Z [SPARK-24951][SQL] Table valued functions should throw AnalysisException ## What changes were proposed in this pull request? Previously TVF resolution could throw IllegalArgumentException if the data type is null type. This patch replaces that exception with AnalysisException, enriched with positional information, to improve error message reporting and to be more consistent with rest of Spark SQL. ## How was this patch tested? Updated the test case in table-valued-functions.sql.out, which is how I identified this problem in the first place. Author: Reynold Xin <r...@databricks.com> Closes #21934 from rxin/SPARK-24951. commit 1efffb7993ecebe5dc1f9ebd924e7503bfd9668c Author: Reynold Xin <rxin@...> Date: 2018-08-01T07:15:31Z [SPARK-24982][SQL] UDAF resolution should not throw AssertionError ## What changes were proposed in this pull request? When user calls anUDAF with the wrong number of arguments, Spark previously throws an AssertionError, which is not supposed to be a user-facing exception. This patch updates it to throw AnalysisException instead, so it is consistent with a regular UDF. ## How was this patch tested? Updated test case udaf.sql. Author: Reynold Xin <r...@databricks.com> Closes #21938 from rxin/SPARK-24982. commit 1122754bd9c5aa1b434c2b0ad856bc8511cd2ee2 Author: Marcelo Vanzin <vanzin@...> Date: 2018-08-01T07:47:46Z [SPARK-24653][TESTS] Avoid cross-job pollution in TestUtils / SpillListener. There is a narrow race in this code that is caused when the code being run in assertSpilled / assertNotSpilled runs more than a single job. SpillListener assumed that only a single job was run, and so would only block waiting for that single job to finish when `numSpilledStages` was called. But some tests (like SQL tests that call `checkAnswer`) run more than one job, and so that wait was basically a no-op. This could cause the next test to install a listener to receive events from the previous job. Which could cause test failures in certain cases. The change fixes that race, and also uninstalls listeners after the test runs, so they don't accumulate when the SparkContext is shared among multiple tests. Author: Marcelo Vanzin <van...@cloudera.com> Closes #21639 from vanzin/SPARK-24653. commit defc54c69aadc510c6f77e13e57f003646c461bc Author: Wenchen Fan <wenchen@...> Date: 2018-08-01T13:39:35Z [SPARK-24971][SQL] remove SupportsDeprecatedScanRow ## What changes were proposed in this pull request? This is a follow up of https://github.com/apache/spark/pull/21118 . In https://github.com/apache/spark/pull/21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenc...@databricks.com> Closes #21921 from cloud-fan/row. commit 95a9d5e3a5ad22c2126fee0ffc7fc789edd18a59 Author: Kazuaki Ishizaki <ishizaki@...> Date: 2018-08-01T18:52:30Z [SPARK-23915][SQL] Add array_except function ## What changes were proposed in this pull request? The PR adds the SQL function `array_except`. The behavior of the function is based on Presto's one. This function returns returns an array of the elements in array1 but not in array2. Note: The order of elements in the result is not defined. ## How was this patch tested? Added UTs. Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Closes #21103 from kiszk/SPARK-23915. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org