[GitHub] spark pull request #22248: TestPull

Suresh-Patibandla Mon, 27 Aug 2018 11:54:02 -0700

GitHub user Suresh-Patibandla opened a pull request:

    https://github.com/apache/spark/pull/22248


    TestPull

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22248
    
----
commit c5b8d54c61780af6e9e157e6c855718df972efad
Author: Chris Martin <chris@...>
Date:   2018-07-28T15:40:10Z

    [SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and millisToDays fails 
w/java 8 181-b13
    
    ## What changes were proposed in this pull request?
    
    - Update DateTimeUtilsSuite so that when testing roundtripping in 
daysToMillis and millisToDays multiple skipdates can be specified.
    - Updated test so that both new years eve 2014 and new years day 2015 are 
skipped for kiribati time zones.  This is necessary as java versions pre 
181-b13 considered new years day 2015 to be skipped while susequent versions 
corrected this to new years eve.
    
    ## How was this patch tested?
    Unit tests
    
    Author: Chris Martin <ch...@cmartinit.co.uk>
    
    Closes #21901 from d80tb7/SPARK-24950_datetimeUtilsSuite_failures.

commit 8fe5d2c393f035b9e82ba42202421c9ba66d6c78
Author: Kazuaki Ishizaki <ishizaki@...>
Date:   2018-07-29T13:31:16Z

    [SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4
    
    ## What changes were proposed in this pull request?
    
    This PR updates maven version from 3.3.9 to 3.5.4. The current build 
process uses mvn 3.3.9 that was release on 2015, which looks pretty old.
    We met [an issue](https://issues.apache.org/jira/browse/SPARK-24895) to 
need the maven 3.5.2 or later.
    
    The release note of the 3.5.4 is 
[here](https://maven.apache.org/docs/3.5.4/release-notes.html). Note version 
3.4 was skipped.
    
    From [the release note of the 
3.5.0](https://maven.apache.org/docs/3.5.0/release-notes.html), the followings 
are new features:
    1. ANSI color logging for improved output visibility
    1. add support for module name != artifactId in every calculated URLs 
(project, SCM, site): special project.directory property
    1. create a slf4j-simple provider extension that supports level color 
rendering
    1. ModelResolver interface enhancement: addition of 
resolveModel(Dependency) supporting version ranges
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
    
    Closes #21905 from kiszk/SPARK-24956.

commit 2c54aae1bc2fa3da26917c89e6201fb2108d9fab
Author: liulijia <liutang123@...>
Date:   2018-07-29T20:13:00Z

    [SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in 
data error
    
    When join key is long or int in broadcast join, Spark will use 
`LongToUnsafeRowMap` to store key-values of the table witch will be 
broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it 
is too big to hold in memory, it will be stored in disk. At that time, because 
`write` uses a variable `cursor` to determine how many bytes in `page` of 
`LongToUnsafeRowMap` will be write out and the `cursor` was not restore when 
deserializing, executor will write out nothing from page into disk.
    
    ## What changes were proposed in this pull request?
    Restore cursor value when deserializing.
    
    Author: liulijia <liutang...@yeah.net>
    
    Closes #21772 from liutang123/SPARK-24809.

commit 3695ba57731a669ed20e7f676edee602c292fbed
Author: Xingbo Jiang <xingbo.jiang@...>
Date:   2018-07-30T01:58:28Z

    [MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and 
TaskSchedulerImplSuite
    
    ## What changes were proposed in this pull request?
    
    In the `afterEach()` method of both `TastSetManagerSuite` and 
`TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, 
because it shall stop the SparkContext.
    
    
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
    The test failure is caused by the above reason, the newly added 
`barrierCoordinator` required `rpcEnv` which has been stopped before 
`TaskSchedulerImpl` doing cleanup.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Xingbo Jiang <xingbo.ji...@databricks.com>
    
    Closes #21908 from jiangxb1987/afterEach.

commit 3210121fed0ba256667f18f990c1a11d32c306ea
Author: hyukjinkwon <gurwls223@...>
Date:   2018-07-30T02:01:18Z

    [MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml
    
    ## What changes were proposed in this pull request?
    
    This PR propose to remove `-Phive-thriftserver` profile which seems not 
affecting the SparkR tests in AppVeyor.
    
    Originally wanted to check if there's a meaningful build time decrease but 
seems not. It will have but seems not meaningfully decreased.
    
    ## How was this patch tested?
    
    AppVeyor tests:
    
    ```
    [00:40:49] Attaching package: 'SparkR'
    [00:40:49]
    [00:40:49] The following objects are masked from 'package:testthat':
    [00:40:49]
    [00:40:49]     describe, not
    [00:40:49]
    [00:40:49] The following objects are masked from 'package:stats':
    [00:40:49]
    [00:40:49]     cov, filter, lag, na.omit, predict, sd, var, window
    [00:40:49]
    [00:40:49] The following objects are masked from 'package:base':
    [00:40:49]
    [00:40:49]     as.data.frame, colnames, colnames<-, drop, endsWith, 
intersect,
    [00:40:49]     rank, rbind, sample, startsWith, subset, summary, transform, 
union
    [00:40:49]
    [00:40:49] Spark package found in SPARK_HOME: C:\projects\spark\bin\..
    [00:41:43] basic tests for CRAN: .............
    [00:41:43]
    [00:41:43] DONE 
===========================================================================
    [00:41:43] binary functions: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:42:05] ...........
    [00:42:05] functions on binary files: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:42:10] ....
    [00:42:10] broadcast variables: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:42:12] ..
    [00:42:12] functions in client.R: .....
    [00:42:30] test functions in sparkR.R: 
..............................................
    [00:42:30] include R packages: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:42:31]
    [00:42:31] JVM API: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:42:31] ..
    [00:42:31] MLlib classification algorithms, except for tree-based 
algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
    [00:48:48] 
......................................................................
    [00:48:48] MLlib clustering algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:50:12] 
.....................................................................
    [00:50:12] MLlib frequent pattern mining: Spark package found in 
SPARK_HOME: C:\projects\spark\bin\..
    [00:50:18] .....
    [00:50:18] MLlib recommendation algorithms: Spark package found in 
SPARK_HOME: C:\projects\spark\bin\..
    [00:50:27] ........
    [00:50:27] MLlib regression algorithms, except for tree-based algorithms: 
Spark package found in SPARK_HOME: C:\projects\spark\bin\..
    [00:56:00] 
................................................................................................................................
    [00:56:00] MLlib statistics algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:56:04] ........
    [00:56:04] MLlib tree-based algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:58:20] 
..............................................................................................
    [00:58:20] parallelize() and collect(): Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [00:58:20] .............................
    [00:58:20] basic RDD functions: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:03:35] 
............................................................................................................................................................................................................................................................................................................................................................................................................................................
    [01:03:35] SerDe functionality: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:03:39] ...............................
    [01:03:39] partitionBy, groupByKey, reduceByKey etc.: Spark package found 
in SPARK_HOME: C:\projects\spark\bin\..
    [01:04:20] ....................
    [01:04:20] functions in sparkR.R: ....
    [01:04:20] SparkSQL functions: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:04:50] 
........................................................................................................................................-chgrp:
 'APPVYR-WIN\None' does not match expected pattern for group
    [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
    [01:04:50] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for 
group
    [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
    [01:04:51] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for 
group
    [01:04:51] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
    [01:06:13] 
............................................................................................................................................................................................................................................................................................................................................................-chgrp:
 'APPVYR-WIN\None' does not match expected pattern for group
    [01:06:13] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
    [01:06:14] .-chgrp: 'APPVYR-WIN\None' does not match expected pattern for 
group
    [01:06:14] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
    [01:06:14] ....-chgrp: 'APPVYR-WIN\None' does not match expected pattern 
for group
    [01:06:14] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
    [01:12:30] 
...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    [01:12:30] Structured Streaming: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:14:27] ..........................................
    [01:14:27] tests RDD function take(): Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:14:28] ................
    [01:14:28] the textFile() function: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:14:44] .............
    [01:14:44] functions in utils.R: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..
    [01:14:46] ............................................
    [01:14:46] Windows-specific tests: .
    [01:14:46]
    [01:14:46] DONE 
===========================================================================
    [01:15:29] Build success
    ```
    
    Author: hyukjinkwon <gurwls...@apache.org>
    
    Closes #21894 from HyukjinKwon/wip-build.

commit 6690924c49a443cd629fcc1a4460cf443fb0a918
Author: hyukjinkwon <gurwls223@...>
Date:   2018-07-30T02:02:29Z

    [MINOR] Avoid the 'latest' link that might vary per release in 
functions.scala's comment
    
    ## What changes were proposed in this pull request?
    
    This PR propose to address 
https://github.com/apache/spark/pull/21318#discussion_r187843125 comment.
    
    This is rather a nit but looks we better avoid to update the link for each 
release since it always points the latest (it doesn't look like worth enough 
updating release guide on the other hand as well).
    
    ## How was this patch tested?
    
    N/A
    
    Author: hyukjinkwon <gurwls...@apache.org>
    
    Closes #21907 from HyukjinKwon/minor-fix.

commit 65a4bc143ab5dc2ced589dc107bbafa8a7290931
Author: Dilip Biswal <dbiswal@...>
Date:   2018-07-30T05:11:01Z

    [SPARK-21274][SQL] Implement INTERSECT ALL clause
    
    ## What changes were proposed in this pull request?
    Implements INTERSECT ALL clause through query rewrites using existing 
operators in Spark.  Please refer to 
[Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE)
 for the design.
    
    Input Query
    ``` SQL
    SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2
    ```
    Rewritten Query
    ```SQL
       SELECT c1
        FROM (
             SELECT replicate_row(min_count, c1)
             FROM (
                  SELECT c1,
                         IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS 
min_count
                  FROM (
                       SELECT   c1, count(vcol1) as vcol1_cnt, count(vcol2) as 
vcol2_cnt
                       FROM (
                            SELECT c1, true as vcol1, null as vcol2 FROM ut1
                            UNION ALL
                            SELECT c1, null as vcol1, true as vcol2 FROM ut2
                            ) AS union_all
                       GROUP BY c1
                       HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1
                      )
                  )
              )
    ```
    
    ## How was this patch tested?
    Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite
    
    Author: Dilip Biswal <dbis...@us.ibm.com>
    
    Closes #21886 from dilipbiswal/dkb_intersect_all_final.

commit bfe60fcdb49aa48534060c38e36e06119900140d
Author: hyukjinkwon <gurwls223@...>
Date:   2018-07-30T05:20:03Z

    [SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower 
bounds for in-memory partition pruning
    
    ## What changes were proposed in this pull request?
    
    Looks we intentionally set `null` for upper/lower bounds for complex types 
and don't use it. However, these look used in in-memory partition pruning, 
which ends up with incorrect results.
    
    This PR proposes to explicitly whitelist the supported types.
    
    ```scala
    val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
    df.cache().filter("arrayCol > array('a', 'b')").show()
    ```
    
    ```scala
    val df = sql("select cast('a' as binary) as a")
    df.cache().filter("a == cast('a' as binary)").show()
    ```
    
    **Before:**
    
    ```
    +--------+
    |arrayCol|
    +--------+
    +--------+
    ```
    
    ```
    +---+
    |  a|
    +---+
    +---+
    ```
    
    **After:**
    
    ```
    +--------+
    |arrayCol|
    +--------+
    |  [c, d]|
    +--------+
    ```
    
    ```
    +----+
    |   a|
    +----+
    |[61]|
    +----+
    ```
    
    ## How was this patch tested?
    
    Unit tests were added and manually tested.
    
    Author: hyukjinkwon <gurwls...@apache.org>
    
    Closes #21882 from HyukjinKwon/stats-filter.

commit 85505fc8a58ca229bbaf240c6bc23ea876d594db
Author: Marco Gaido <marcogaido91@...>
Date:   2018-07-30T12:53:45Z

    [SPARK-24957][SQL] Average with decimal followed by aggregation returns 
wrong result
    
    ## What changes were proposed in this pull request?
    
    When we do an average, the result is computed dividing the sum of the 
values by their count. In the case the result is a DecimalType, the way we are 
casting/managing the precision and scale is not really optimized and it is not 
coherent with what we do normally.
    
    In particular, a problem can happen when the `Divide` operand returns a 
result which contains a precision and scale different by the ones which are 
expected as output of the `Divide` operand. In the case reported in the JIRA, 
for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while 
the output data type for `Divide` is 38, 22. This is not an issue when the 
`Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, 
as these operations return a decimal with the defined precision and scale. 
Despite in the `Average` operator we do have a `Cast`, this may be bypassed if 
the result of `Divide` is the same type which it is casted to, hence the issue 
reported in the JIRA may arise.
    
    The PR proposes to use the normal rules/handling of the arithmetic 
operators with Decimal data type, so we both reuse the existing code (having a 
single logic for operations between decimals) and we fix this problem as the 
result is always guarded by `CheckOverflow`.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <marcogaid...@gmail.com>
    
    Closes #21910 from mgaido91/SPARK-24957.

commit fca0b8528e704cfe62863a34f8bb5dcee850b046
Author: hyukjinkwon <gurwls223@...>
Date:   2018-07-30T13:13:08Z

    [SPARK-24967][SQL] Avro: Use internal.Logging instead for logging
    
    ## What changes were proposed in this pull request?
    
    Looks Avro uses direct `getLogger` to create a SLF4J logger. Should better 
use `internal.Logging` instead.
    
    ## How was this patch tested?
    
    Exiting tests.
    
    Author: hyukjinkwon <gurwls...@apache.org>
    
    Closes #21914 from HyukjinKwon/avro-log.

commit b90bfe3c42eb9b51e6131a8f8923bcddfccd75bb
Author: Gengliang Wang <gengliang.wang@...>
Date:   2018-07-30T14:30:47Z

    [SPARK-24771][BUILD] Upgrade Apache AVRO to 1.8.2
    
    ## What changes were proposed in this pull request?
    
    Upgrade Apache Avro from 1.7.7 to 1.8.2. The major new features:
    
    1. More logical types. From the spec of 1.8.2 
https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types we can see comparing 
to [1.7.7](https://avro.apache.org/docs/1.7.7/spec.html#Logical+Types), the new 
version support:
        - Date
        - Time (millisecond precision)
        - Time (microsecond precision)
        - Timestamp (millisecond precision)
        - Timestamp (microsecond precision)
        - Duration
    
    2. Single-object encoding: 
https://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding
    
    This PR aims to update Apache Spark to support these new features.
    
    ## How was this patch tested?
    
    Unit test
    
    Author: Gengliang Wang <gengliang.w...@databricks.com>
    
    Closes #21761 from gengliangwang/upgrade_avro_1.8.

commit 47d84e4d0e56e14f9402770dceaf0b4302c00e98
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-07-30T14:42:00Z

    [SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column
    
    ## What changes were proposed in this pull request?
    This pr supported Date/Timestamp in a JDBC partition column (a numeric 
column is only supported in the master). This pr also modified code to verify a 
partition column type;
    ```
    val jdbcTable = spark.read
     .option("partitionColumn", "text")
     .option("lowerBound", "aaa")
     .option("upperBound", "zzz")
     .option("numPartitions", 2)
     .jdbc("jdbc:postgresql:postgres", "t", options)
    
    // with this pr
    org.apache.spark.sql.AnalysisException: Partition column type should be 
numeric, date, or timestamp, but string found.;
      at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.verifyAndGetNormalizedPartitionColumn(JDBCRelation.scala:165)
      at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:85)
      at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
      at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:317)
    
    // without this pr
    java.lang.NumberFormatException: For input string: "aaa"
      at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
      at java.lang.Long.parseLong(Long.java:589)
      at java.lang.Long.parseLong(Long.java:631)
      at 
scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
    ```
    
    Closes #19999
    
    ## How was this patch tested?
    Added tests in `JDBCSuite`.
    
    Author: Takeshi Yamamuro <yamam...@apache.org>
    
    Closes #21834 from maropu/SPARK-22814.

commit d6b7545b5f495a496d40a982e0ab0f8053e1a4f5
Author: mcheah <mcheah@...>
Date:   2018-07-30T18:41:02Z

    [SPARK-24963][K8S][TESTS] Don't set service account name for client mode 
test
    
    ## What changes were proposed in this pull request?
    
    Don't set service account name for the pod created in client mode
    
    ## How was this patch tested?
    
    Test should continue running smoothly in Jenkins.
    
    Author: mcheah <mch...@palantir.com>
    
    Closes #21900 from mccheah/fix-integration-test-service-account.

commit abbb4ab4d8b12ba2d94b16407c0d62ae207ee4fa
Author: Reynold Xin <rxin@...>
Date:   2018-07-30T21:05:45Z

    [SPARK-24865][SQL] Remove AnalysisBarrier addendum
    
    ## What changes were proposed in this pull request?
    I didn't want to pollute the diff in the previous PR and left some TODOs. 
This is a follow-up to address those TODOs.
    
    ## How was this patch tested?
    Should be covered by existing tests.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #21896 from rxin/SPARK-24865-addendum.

commit 2fbe294cf01f78b34498553d9228b57e2f992bce
Author: mcheah <mcheah@...>
Date:   2018-07-30T22:57:54Z

    [SPARK-24963][K8S][TESTS] Add user-specified service account name for 
client mode test driver pod
    
    ## What changes were proposed in this pull request?
    
    Adds the user-set service account name for the driver pod in the client 
mode integration test
    
    ## How was this patch tested?
    
    Manual test against a custom Kubernetes cluster
    
    Author: mcheah <mch...@palantir.com>
    
    Closes #21924 from mccheah/fix-service-account.

commit d20c10fdf382acf43a7e6a541923bd078e19ca75
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-31T01:12:57Z

    [SPARK-24952][SQL] Support LZMA2 compression by Avro datasource
    
    ## What changes were proposed in this pull request?
    
    In the PR, I propose to support `LZMA2` (`XZ`) and `BZIP2` compressions by 
`AVRO` datasource  in write since the codecs may have better characteristics 
like compression ratio and speed comparing to already supported `snappy` and 
`deflate` codecs.
    
    ## How was this patch tested?
    
    It was tested manually and by an existing test which was extended to check 
the `xz` and `bzip2` compressions.
    
    Author: Maxim Gekk <maxim.g...@databricks.com>
    
    Closes #21902 from MaxGekk/avro-xz-bzip2.

commit f1550aaf1506c0115c8d95cd8bc784ed6c734ea5
Author: hyukjinkwon <gurwls223@...>
Date:   2018-07-31T01:14:29Z

    [SPARK-24956][BUILD][FOLLOWUP] Upgrade Maven version to 3.5.4 for AppVeyor 
as well
    
    ## What changes were proposed in this pull request?
    
    Maven version was upgraded and AppVeyor should also use upgraded maven 
version.
    
    Currently, it looks broken by this:
    
    
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/2458-master
    
    ```
    [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion 
failed with message:
    Detected Maven Version: 3.3.9 is not in the allowed range 3.5.4.
    [INFO] 
------------------------------------------------------------------------
    [INFO] Reactor Summary:
    ```
    
    ## How was this patch tested?
    
    AppVeyor tests
    
    Author: hyukjinkwon <gurwls...@apache.org>
    
    Closes #21920 from HyukjinKwon/SPARK-24956.

commit 8141d55926e95c06cd66bf82098895e1ed419449
Author: Li Jin <ice.xelloss@...>
Date:   2018-07-31T02:10:38Z

    [SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide
    
    ## What changes were proposed in this pull request?
    
    Update Pandas UDFs section in sql-programming-guide. Add section for 
grouped aggregate pandas UDF.
    
    ## How was this patch tested?
    
    Author: Li Jin <ice.xell...@gmail.com>
    
    Closes #21887 from icexelloss/SPARK-23633-sql-programming-guide.

commit b4fd75fb9b615cfe592ad269cf20d02b483a0d33
Author: maryannxue <maryannxue@...>
Date:   2018-07-31T06:43:53Z

    [SPARK-24972][SQL] PivotFirst could not handle pivot columns of complex 
types
    
    ## What changes were proposed in this pull request?
    
    When the pivot column is of a complex type, the eval() result will be an 
UnsafeRow, while the keys of the HashMap for column value matching is a 
GenericInternalRow. As a result, there will be no match and the result will 
always be empty.
    So for a pivot column of complex-types, we should:
    1) If the complex-type is not comparable (orderable), throw an Exception. 
It cannot be a pivot column.
    2) Otherwise, if it goes through the `PivotFirst` code path, `PivotFirst` 
should use a TreeMap instead of HashMap for such columns.
    
    This PR has also reverted the walk-around in Analyzer that had been 
introduced to avoid this `PivotFirst` issue.
    
    ## How was this patch tested?
    
    Added UT.
    
    Author: maryannxue <maryann...@apache.org>
    
    Closes #21926 from maryannxue/pivot_followup.

commit 4ac2126bc64bad1b4cbe1c697b4bcafacd67c96c
Author: Mauro Palsgraaf <mauropalsgraaf@...>
Date:   2018-07-31T15:18:08Z

    [SPARK-24536] Validate that an evaluated limit clause cannot be null
    
    ## What changes were proposed in this pull request?
    
    It proposes a version in which nullable expressions are not valid in the 
limit clause
    
    ## How was this patch tested?
    
    It was tested with unit and e2e tests.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Mauro Palsgraaf <mauropalsgr...@hotmail.com>
    
    Closes #21807 from mauropalsgraaf/SPARK-24536.

commit 1223a201fcb2c2f211ad96997ebb00c3554aa822
Author: zhengruifeng <ruifengz@...>
Date:   2018-07-31T18:37:13Z

    [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explain 
RandomForestClassifier.featureSubsetStrategy well
    
    ## What changes were proposed in this pull request?
    update doc of RandomForestClassifier.featureSubsetStrategy
    
    ## How was this patch tested?
    local built doc
    
    rdoc:
    
![default](https://user-images.githubusercontent.com/7322292/42807787-4dda6362-89e4-11e8-839f-a8519b7c1f1c.png)
    
    pydoc:
    
![default](https://user-images.githubusercontent.com/7322292/43112817-5f1d4d88-8f2a-11e8-93ff-de90db8afdca.png)
    
    Author: zhengruifeng <ruife...@foxmail.com>
    
    Closes #21788 from zhengruifeng/rf_doc_py_r.

commit e82784d13fac7d45164dfadb00d3fa43e64e0bde
Author: tedyu <yuzhihong@...>
Date:   2018-07-31T20:14:14Z

    [SPARK-18057][SS] Update Kafka client version from 0.10.0.1 to 2.0.0
    
    ## What changes were proposed in this pull request?
    
    This PR upgrades to the Kafka 2.0.0 release where KIP-266 is integrated.
    
    ## How was this patch tested?
    
    This PR uses existing Kafka related unit tests
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: tedyu <yuzhih...@gmail.com>
    
    Closes #21488 from tedyu/master.

commit 42dfe4f1593767eae355e27bf969339f4ab03f56
Author: Huaxin Gao <huaxing@...>
Date:   2018-07-31T20:23:11Z

    [SPARK-24973][PYTHON] Add numIter to Python ClusteringSummary
    
    ## What changes were proposed in this pull request?
    
    Add numIter to Python version of ClusteringSummary
    
    ## How was this patch tested?
    
    Modified existing UT test_multiclass_logistic_regression_summary
    
    Author: Huaxin Gao <huax...@us.ibm.com>
    
    Closes #21925 from huaxingao/spark-24973.

commit f4772fd26f32b11ae54e7721924b5cf6eb27298a
Author: hyukjinkwon <gurwls223@...>
Date:   2018-08-01T00:24:24Z

    [SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to 
PyArrow 0.9.0)
    
    ## What changes were proposed in this pull request?
    
    See [ARROW-2432](https://jira.apache.org/jira/browse/ARROW-2432). Seems 
using `from_pandas` to convert decimals fails if encounters a value of `None`:
    
    ```python
    import pyarrow as pa
    import pandas as pd
    from decimal import Decimal
    
    pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]), 
type=pa.decimal128(3, 2))
    ```
    
    **Arrow 0.8.0**
    
    ```
    <pyarrow.lib.Decimal128Array object at 0x10a572c58>
    [
      Decimal('3.14'),
      NA
    ]
    ```
    
    **Arrow 0.9.0**
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
      File "array.pxi", line 177, in pyarrow.lib.array
      File "error.pxi", line 77, in pyarrow.lib.check_status
      File "error.pxi", line 77, in pyarrow.lib.check_status
    pyarrow.lib.ArrowInvalid: Error converting from Python objects to Decimal: 
Got Python object of type NoneType but can only handle these types: 
decimal.Decimal
    ```
    
    This PR propose to work around this via Decimal NaN:
    
    ```python
    pa.Array.from_pandas(pd.Series([Decimal('3.14'), Decimal('NaN')]), 
type=pa.decimal128(3, 2))
    ```
    
    ```
    <pyarrow.lib.Decimal128Array object at 0x10ffd2e68>
    [
      Decimal('3.14'),
      NA
    ]
    ```
    
    ## How was this patch tested?
    
    Manually tested:
    
    ```bash
    SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests ScalarPandasUDFTests
    ```
    
    **Before**
    
    ```
    Traceback (most recent call last):
      File "/.../spark/python/pyspark/sql/tests.py", line 4672, in 
test_vectorized_udf_null_decimal
        self.assertEquals(df.collect(), res.collect())
      File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
        sock_info = self._jdf.collectToPython()
      File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value
        format(target_id, ".", name), value)
    Py4JJavaError: An error occurred while calling o51.collectToPython.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 
in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 
(TID 7, localhost, executor driver): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/.../spark/python/pyspark/worker.py", line 320, in main
        process()
      File "/.../spark/python/pyspark/worker.py", line 315, in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
        batch = _create_batch(series, self._timezone)
      File "/.../spark/python/pyspark/serializers.py", line 243, in 
_create_batch
        arrs = [create_array(s, t) for s, t in series]
      File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
        return pa.Array.from_pandas(s, mask=mask, type=t)
      File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
      File "array.pxi", line 177, in pyarrow.lib.array
      File "error.pxi", line 77, in pyarrow.lib.check_status
      File "error.pxi", line 77, in pyarrow.lib.check_status
    ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
object of type NoneType but can only handle these types: decimal.Decimal
    ```
    
    **After**
    
    ```
    Running tests...
    ----------------------------------------------------------------------
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
    .......S.............................
    ----------------------------------------------------------------------
    Ran 37 tests in 21.980s
    ```
    
    Author: hyukjinkwon <gurwls...@apache.org>
    
    Closes #21928 from HyukjinKwon/SPARK-24976.

commit 5f3441e542bfacd81d70bd8b34c22044c8928bff
Author: DB Tsai <d_tsai@...>
Date:   2018-08-01T02:31:02Z

    [SPARK-24893][SQL] Remove the entire CaseWhen if all the outputs are 
semantic equivalence
    
    ## What changes were proposed in this pull request?
    
    Similar to SPARK-24890, if all the outputs of `CaseWhen` are semantic 
equivalence, `CaseWhen` can be removed.
    
    ## How was this patch tested?
    
    Tests added.
    
    Author: DB Tsai <d_t...@apple.com>
    
    Closes #21852 from dbtsai/short-circuit-when.

commit 1f7e22c72c89fc2c0e729dde0948bc6bdf8f7628
Author: Reynold Xin <rxin@...>
Date:   2018-08-01T05:25:40Z

    [SPARK-24951][SQL] Table valued functions should throw AnalysisException
    
    ## What changes were proposed in this pull request?
    Previously TVF resolution could throw IllegalArgumentException if the data 
type is null type. This patch replaces that exception with AnalysisException, 
enriched with positional information, to improve error message reporting and to 
be more consistent with rest of Spark SQL.
    
    ## How was this patch tested?
    Updated the test case in table-valued-functions.sql.out, which is how I 
identified this problem in the first place.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #21934 from rxin/SPARK-24951.

commit 1efffb7993ecebe5dc1f9ebd924e7503bfd9668c
Author: Reynold Xin <rxin@...>
Date:   2018-08-01T07:15:31Z

    [SPARK-24982][SQL] UDAF resolution should not throw AssertionError
    
    ## What changes were proposed in this pull request?
    When user calls anUDAF with the wrong number of arguments, Spark previously 
throws an AssertionError, which is not supposed to be a user-facing exception.  
This patch updates it to throw AnalysisException instead, so it is consistent 
with a regular UDF.
    
    ## How was this patch tested?
    Updated test case udaf.sql.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #21938 from rxin/SPARK-24982.

commit 1122754bd9c5aa1b434c2b0ad856bc8511cd2ee2
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-08-01T07:47:46Z

    [SPARK-24653][TESTS] Avoid cross-job pollution in TestUtils / SpillListener.
    
    There is a narrow race in this code that is caused when the code being
    run in assertSpilled / assertNotSpilled runs more than a single job.
    
    SpillListener assumed that only a single job was run, and so would only
    block waiting for that single job to finish when `numSpilledStages` was
    called. But some tests (like SQL tests that call `checkAnswer`) run more
    than one job, and so that wait was basically a no-op.
    
    This could cause the next test to install a listener to receive events
    from the previous job. Which could cause test failures in certain cases.
    
    The change fixes that race, and also uninstalls listeners after the
    test runs, so they don't accumulate when the SparkContext is shared
    among multiple tests.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #21639 from vanzin/SPARK-24653.

commit defc54c69aadc510c6f77e13e57f003646c461bc
Author: Wenchen Fan <wenchen@...>
Date:   2018-08-01T13:39:35Z

    [SPARK-24971][SQL] remove SupportsDeprecatedScanRow
    
    ## What changes were proposed in this pull request?
    
    This is a follow up of https://github.com/apache/spark/pull/21118 .
    
    In https://github.com/apache/spark/pull/21118 we added 
`SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` 
instead of `Row` for better performance. We should remove 
`SupportsDeprecatedScanRow` and encourage data sources to produce 
`InternalRow`, which is also very easy to build.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #21921 from cloud-fan/row.

commit 95a9d5e3a5ad22c2126fee0ffc7fc789edd18a59
Author: Kazuaki Ishizaki <ishizaki@...>
Date:   2018-08-01T18:52:30Z

    [SPARK-23915][SQL] Add array_except function
    
    ## What changes were proposed in this pull request?
    
    The PR adds the SQL function `array_except`. The behavior of the function 
is based on Presto's one.
    
    This function returns returns an array of the elements in array1 but not in 
array2.
    
    Note: The order of elements in the result is not defined.
    
    ## How was this patch tested?
    
    Added UTs.
    
    Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
    
    Closes #21103 from kiszk/SPARK-23915.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22248: TestPull

Reply via email to