[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...

sandecho Thu, 01 Mar 2018 11:40:21 -0800

GitHub user sandecho reopened a pull request:

    https://github.com/apache/spark/pull/20707


    [SPARK-21209][MLLLIB] Implement Incremental PCA algorithm

    ## What changes were proposed in this pull request?
    
    A new feature called Incremental Principal Component Analysis 
Algorithm(IPCA) has been proposed. It divides the incoming data in batch size 
and compute the PCA of the individual batch to generate Principal Component of 
entire data.
    
    ## How was this patch tested?
    Unit Testing
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20707
    
----
commit 6bb22961c0c9df1a1f22e9491894895b297f5288
Author: Sameer Agarwal <sameerag@...>
Date:   2018-01-11T23:23:17Z

    Preparing development version 2.3.1-SNAPSHOT

commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d
Author: WeichenXu <weichen.xu@...>
Date:   2018-01-12T00:20:30Z

    [SPARK-23008][ML] OnehotEncoderEstimator python API
    
    ## What changes were proposed in this pull request?
    
    OnehotEncoderEstimator python API.
    
    ## How was this patch tested?
    
    doctest
    
    Author: WeichenXu <weichen...@databricks.com>
    
    Closes #20209 from WeichenXu123/ohe_py.
    
    (cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2
Author: ho3rexqj <ho3rexqj@...>
Date:   2018-01-12T07:27:00Z

    [SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances 
of broadcast variable values
    
    When resources happen to be constrained on an executor the first time a 
broadcast variable is instantiated it is persisted to disk by the BlockManager. 
Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock 
from other instances of that broadcast variable spawns another instance of the 
underlying value. That is, broadcast variables are spawned once per executor 
**unless** memory is constrained, in which case every instance of a broadcast 
variable is provided with a unique copy of the underlying value.
    
    This patch fixes the above by explicitly caching the underlying values 
using weak references in a ReferenceMap.
    
    Author: ho3rexqj <ho3re...@gmail.com>
    
    Closes #20183 from ho3rexqj/fix/cache-broadcast-values.
    
    (cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit d512d873b3f445845bd113272d7158388427f8a6
Author: WeichenXu <weichen.xu@...>
Date:   2018-01-12T09:27:02Z

    [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated
    
    ## What changes were proposed in this pull request?
    
    mark OneHotEncoder python API deprecated
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <weichen...@databricks.com>
    
    Closes #20241 from WeichenXu123/mark_ohe_deprecated.
    
    (cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit 6152da3893a05b3f8dc0f13895af9be9548e5895
Author: Marco Gaido <marcogaido91@...>
Date:   2018-01-12T10:04:44Z

    [SPARK-23025][SQL] Support Null type in scala reflection
    
    ## What changes were proposed in this pull request?
    
    Add support for `Null` type in the `schemaFor` method for Scala reflection.
    
    ## How was this patch tested?
    
    Added UT
    
    Author: Marco Gaido <marcogaid...@gmail.com>
    
    Closes #20219 from mgaido91/SPARK-23025.
    
    (cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit db27a93652780f234f3c5fe750ef07bc5525d177
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-12T18:18:42Z

    [MINOR][BUILD] Fix Java linter errors
    
    ## What changes were proposed in this pull request?
    
    This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, 
this will be the final one.
    
    ```
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks failed at following occurrences:
    [ERROR] 
src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] 
(sizes) LineLength: Line is longer than 100 characters (found 101).
    [ERROR] 
src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] 
(imports) UnusedImports: Unused import - java.io.IOException.
    [ERROR] 
src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9]
 (modifier) ModifierOrder: 'private' modifier out of order with the JLS 
suggestions.
    [ERROR] 
src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java:[464] (sizes) 
LineLength: Line is longer than 100 characters (found 102).
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    ```
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks passed.
    ```
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #20242 from dongjoon-hyun/fix_lint_java_2.3_rc1.
    
    (cherry picked from commit 7bd14cfd40500a0b6462cda647bdbb686a430328)
    Signed-off-by: Sameer Agarwal <samee...@apache.org>

commit 02176f4c2f60342068669b215485ffd443346aed
Author: Marco Gaido <marcogaido91@...>
Date:   2018-01-12T19:25:37Z

    [SPARK-22975][SS] MetricsReporter should not throw exception when there was 
no progress reported
    
    ## What changes were proposed in this pull request?
    
    `MetricsReporter ` assumes that there has been some progress for the query, 
ie. `lastProgress` is not null. If this is not true, as it might happen in 
particular conditions, a `NullPointerException` can be thrown.
    
    The PR checks whether there is a `lastProgress` and if this is not true, it 
returns a default value for the metrics.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <marcogaid...@gmail.com>
    
    Closes #20189 from mgaido91/SPARK-22975.
    
    (cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de)
    Signed-off-by: Shixiong Zhu <zsxw...@gmail.com>

commit 60bcb4685022c29a6ddcf707b505369687ec7da6
Author: Sameer Agarwal <sameerag@...>
Date:   2018-01-12T23:07:14Z

    Revert "[SPARK-22908] Add kafka source and sink for continuous processing."
    
    This reverts commit f891ee3249e04576dd579cbab6f8f1632550e6bd.

commit ca27d9cb5e30b6a50a4c8b7d10ac28f4f51d44ee
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-13T07:13:44Z

    [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each 
batch within scalar Pandas UDF
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to add a note that saying the length of a scalar Pandas 
UDF's `Series` is not of the whole input column but of the batch.
    
    We are fine for a group map UDF because the usage is different from our 
typical UDF but scalar UDFs might cause confusion with the normal UDF.
    
    For example, please consider this example:
    
    ```python
    from pyspark.sql.functions import pandas_udf, col, lit
    
    df = spark.range(1)
    f = pandas_udf(lambda x, y: len(x) + y, LongType())
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 1|
    +------------------+
    ```
    
    ```python
    from pyspark.sql.functions import udf, col, lit
    
    df = spark.range(1)
    f = udf(lambda x, y: len(x) + y, "long")
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 4|
    +------------------+
    ```
    
    ## How was this patch tested?
    
    Manually built the doc and checked the output.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #20237 from HyukjinKwon/SPARK-22980.
    
    (cherry picked from commit cd9f49a2aed3799964976ead06080a0f7044a0c3)
    Signed-off-by: hyukjinkwon <gurwls...@gmail.com>

commit 801ffd799922e1c2751d3331874b88a67da8cf01
Author: Yuming Wang <yumwang@...>
Date:   2018-01-13T16:01:44Z

    [SPARK-22870][CORE] Dynamic allocation should allow 0 idle time
    
    ## What changes were proposed in this pull request?
    
    This pr to make `0` as a valid value for 
`spark.dynamicAllocation.executorIdleTimeout`.
    For details, see the jira description: 
https://issues.apache.org/jira/browse/SPARK-22870.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Yuming Wang <yumw...@ebay.com>
    Author: Yuming Wang <wgy...@gmail.com>
    
    Closes #20080 from wangyum/SPARK-22870.
    
    (cherry picked from commit fc6fe8a1d0f161c4788f3db94de49a8669ba3bcc)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 8d32ed5f281317ba380aa6b8b3f3f041575022cb
Author: xubo245 <601450868@...>
Date:   2018-01-13T18:28:57Z

    [SPARK-23036][SQL][TEST] Add withGlobalTempView for testing
    
    ## What changes were proposed in this pull request?
    
    Add withGlobalTempView when create global temp view, like withTempView and 
withView.
    And correct some improper usage.
    Please see jira.
    There are other similar place like that. I will fix it if community need. 
Please confirm it.
    ## How was this patch tested?
    
    no new test.
    
    Author: xubo245 <601450...@qq.com>
    
    Closes #20228 from xubo245/DropTempView.
    
    (cherry picked from commit bd4a21b4820c4ebaf750131574a6b2eeea36907e)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 0fc5533e53ad03eb67590ddd231f40c2713150c3
Author: CodingCat <zhunansjtu@...>
Date:   2018-01-13T18:36:32Z

    [SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's 
size
    
    ## What changes were proposed in this pull request?
    
    as per discussion in 
https://github.com/apache/spark/pull/19864#discussion_r156847927
    
    the current HadoopFsRelation is purely based on the underlying file size 
which is not accurate and makes the execution vulnerable to errors like OOM
    
    Users can enable CBO with the functionalities in 
https://github.com/apache/spark/pull/19864 to avoid this issue
    
    This JIRA proposes to add a configurable factor to sizeInBytes method in 
HadoopFsRelation class so that users can mitigate this problem without CBO
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: CodingCat <zhunans...@gmail.com>
    Author: Nan Zhu <nan...@uber.com>
    
    Closes #20072 from CodingCat/SPARK-22790.
    
    (cherry picked from commit ba891ec993c616dc4249fc786c56ea82ed04a827)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit bcd87ae0775d16b7c3b9de0c4f2db36eb3679476
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-13T21:39:38Z

    [SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in 
compareAndGetNewStats
    
    ## What changes were proposed in this pull request?
    This pr fixed code to compare values in `compareAndGetNewStats`.
    The test below fails in the current master;
    ```
        val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) * 
2)
        val newStats5 = CommandUtils.compareAndGetNewStats(
          Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None)
        assert(newStats5.isEmpty)
    ```
    
    ## How was this patch tested?
    Added some tests in `CommandUtilsSuite`.
    
    Author: Takeshi Yamamuro <yamam...@apache.org>
    
    Closes #20245 from maropu/SPARK-21213-FOLLOWUP.
    
    (cherry picked from commit 0066d6f6fa604817468471832968d4339f71c5cb)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 1f4a08b15ab47cf6c3bb08c783497422f30d0709
Author: foxish <ramanathana@...>
Date:   2018-01-14T05:34:28Z

    [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of 
other misses)
    
    ## What changes were proposed in this pull request?
    
    Including the `-Pkubernetes` flag in a few places it was missed.
    
    ## How was this patch tested?
    
    checkstyle, mima through manual tests.
    
    Author: foxish <ramanath...@google.com>
    
    Closes #20256 from foxish/SPARK-23063.
    
    (cherry picked from commit c3548d11c3c57e8f2c6ebd9d2d6a3924ddcd3cba)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit a335a49ce4672b44e5f818145214040a67c722ba
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-14T07:26:12Z

    [SPARK-23038][TEST] Update docker/spark-test (JDK/OS)
    
    ## What changes were proposed in this pull request?
    
    This PR aims to update the followings in `docker/spark-test`.
    
    - JDK7 -> JDK8
    Spark 2.2+ supports JDK8 only.
    
    - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel)
    The end of life of `precise` was April 28, 2017.
    
    ## How was this patch tested?
    
    Manual.
    
    * Master
    ```
    $ cd external/docker
    $ ./build
    $ export SPARK_HOME=...
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-master
    CONTAINER_IP=172.17.0.3
    ...
    18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and 
started at http://172.17.0.3:8080
    18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066.
    18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for 
submitting applications on port 6066
    18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE
    ```
    
    * Slave
    ```
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker 
spark://172.17.0.3:7077
    CONTAINER_IP=172.17.0.4
    ...
    18/01/11 06:51:54 INFO Worker: Successfully registered with master 
spark://172.17.0.3:7077
    ```
    
    After slave starts, master will show
    ```
    18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4 
cores, 1024.0 MB RAM
    ```
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #20230 from dongjoon-hyun/SPARK-23038.
    
    (cherry picked from commit 7a3d0aad2b89aef54f7dd580397302e9ff984d9d)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit 0d425c3362dc648d5c85b2b07af1a9df23b6d422
Author: Felix Cheung <felixcheung_m@...>
Date:   2018-01-14T10:43:10Z

    [SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text
    
    ## What changes were proposed in this pull request?
    
    fix doc truncated
    
    ## How was this patch tested?
    
    manually
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #20263 from felixcheung/r23docfix.
    
    (cherry picked from commit 66738d29c59871b29d26fc3756772b95ef536248)
    Signed-off-by: hyukjinkwon <gurwls...@gmail.com>

commit 5fbbd94d509dbbcfa1fe940569049f72ff4a6e89
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-14T14:26:21Z

    [SPARK-23021][SQL] AnalysisBarrier should override innerChildren to print 
correct explain output
    
    ## What changes were proposed in this pull request?
    `AnalysisBarrier` in the current master cuts off explain results for parsed 
logical plans;
    ```
    scala> Seq((1, 1)).toDF("a", 
"b").groupBy("a").count().sample(0.1).explain(true)
    == Parsed Logical Plan ==
    Sample 0.0, 0.1, false, -7661439431999668039
    +- AnalysisBarrier Aggregate [a#5], [a#5, count(1) AS count#14L]
    ```
    To fix this, `AnalysisBarrier` needs to override `innerChildren` and this 
pr changed the output to;
    ```
    == Parsed Logical Plan ==
    Sample 0.0, 0.1, false, -5086223488015741426
    +- AnalysisBarrier
          +- Aggregate [a#5], [a#5, count(1) AS count#14L]
             +- Project [_1#2 AS a#5, _2#3 AS b#6]
                +- LocalRelation [_1#2, _2#3]
    ```
    
    ## How was this patch tested?
    Added tests in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <yamam...@apache.org>
    
    Closes #20247 from maropu/SPARK-23021-2.
    
    (cherry picked from commit 990f05c80347c6eec2ee06823cff587c9ea90b49)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 9051e1a265dc0f1dc19fd27a0127ffa47f3ac245
Author: Sandor Murakozi <smurakozi@...>
Date:   2018-01-14T14:32:35Z

    [SPARK-23051][CORE] Fix for broken job description in Spark UI
    
    ## What changes were proposed in this pull request?
    
    In 2.2, Spark UI displayed the stage description if the job description was 
not set. This functionality was broken, the GUI has shown no description in 
this case. In addition, the code uses jobName and
    jobDescription instead of stageName and stageDescription when 
JobTableRowData is created.
    
    In this PR the logic producing values for the job rows was modified to find 
the latest stage attempt for the job and use that as a fallback if job 
description was missing.
    StageName and stageDescription are also set using values from stage and 
jobName/description is used only as a fallback.
    
    ## How was this patch tested?
    Manual testing of the UI, using the code in the bug report.
    
    Author: Sandor Murakozi <smurak...@gmail.com>
    
    Closes #20251 from smurakozi/SPARK-23051.
    
    (cherry picked from commit 60eeecd7760aee6ce2fd207c83ae40054eadaf83)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 2879236b92b5712b7438b972404375bbf1993df8
Author: guoxiaolong <guo.xiaolong1@...>
Date:   2018-01-14T18:02:49Z

    [SPARK-22999][SQL] show databases like command' can remove the like keyword
    
    ## What changes were proposed in this pull request?
    
    SHOW DATABASES (LIKE pattern = STRING)? Can be like the back increase?
    When using this command, LIKE keyword can be removed.
    You can refer to the SHOW TABLES command, SHOW TABLES 'test *' and SHOW 
TABELS like 'test *' can be used.
    Similarly SHOW DATABASES 'test *' and SHOW DATABASES like 'test *' can be 
used.
    
    ## How was this patch tested?
    unit tests   manual tests
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: guoxiaolong <guo.xiaolo...@zte.com.cn>
    
    Closes #20194 from guoxiaolongzte/SPARK-22999.
    
    (cherry picked from commit 42a1a15d739890bdfbb367ef94198b19e98ffcb7)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 30574fd3716dbdf553cfd0f4d33164ab8fbccb77
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-15T02:55:21Z

    [SPARK-23054][SQL] Fix incorrect results of casting UserDefinedType to 
String
    
    ## What changes were proposed in this pull request?
    This pr fixed the issue when casting `UserDefinedType`s into strings;
    ```
    >>> from pyspark.ml.classification import MultilayerPerceptronClassifier
    >>> from pyspark.ml.linalg import Vectors
    >>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, 
Vectors.dense([0.0, 1.0]))], ["label", "features"])
    >>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
    +-------------------------------------------+
    |features                                   |
    +-------------------------------------------+
    |[6,1,0,0,2800000020,2,0,0,0]               |
    |[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
    +-------------------------------------------+
    ```
    The root cause is that `Cast` handles input data as 
`UserDefinedType.sqlType`(this is underlying storage type), so we should pass 
data into `UserDefinedType.deserialize` then `toString`.
    This pr modified the result into;
    ```
    +---------+
    |features |
    +---------+
    |[0.0,0.0]|
    |[0.0,1.0]|
    +---------+
    ```
    
    ## How was this patch tested?
    Added tests in `UserDefinedTypeSuite `.
    
    Author: Takeshi Yamamuro <yamam...@apache.org>
    
    Closes #20246 from maropu/SPARK-23054.
    
    (cherry picked from commit b98ffa4d6dabaf787177d3f14b200fc4b118c7ce)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 81b989903af0cdcb6c19e6e8e7bdbac455a2c281
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-15T04:06:56Z

    [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` should work for ORC 
files
    
    ## What changes were proposed in this pull request?
    
    When `spark.sql.files.ignoreCorruptFiles=true`, we should ignore corrupted 
ORC files.
    
    ## How was this patch tested?
    
    Pass the Jenkins with a newly added test case.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #20240 from dongjoon-hyun/SPARK-23049.
    
    (cherry picked from commit 9a96bfc8bf021cb4b6c62fac6ce1bcf87affcd43)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 188999a3401357399d8d2b30f440d8b0b0795fc5
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-01-15T08:26:52Z

    [SPARK-23023][SQL] Cast field data to strings in showString
    
    ## What changes were proposed in this pull request?
    The current `Datset.showString` prints rows thru `RowEncoder` deserializers 
like;
    ```
    scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
    +------------------------------------------------------------+
    |a                                                           |
    +------------------------------------------------------------+
    |[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
    +------------------------------------------------------------+
    ```
    This result is incorrect because the correct one is;
    ```
    scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
    +------------------------+
    |a                       |
    +------------------------+
    |[[1, 2], [3], [4, 5, 6]]|
    +------------------------+
    ```
    So, this pr fixed code in `showString` to cast field data to strings before 
printing.
    
    ## How was this patch tested?
    Added tests in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <yamam...@apache.org>
    
    Closes #20214 from maropu/SPARK-23023.
    
    (cherry picked from commit b59808385cfe24ce768e5b3098b9034e64b99a5a)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 3491ca4fb5c2e3fecd727f7a31b8efbe74032bcc
Author: Yuming Wang <yumwang@...>
Date:   2018-01-15T13:49:34Z

    [SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module
    
    ## What changes were proposed in this pull request?
    
    Remove `MaxPermSize` for `sql` module
    
    ## How was this patch tested?
    
    Manually tested.
    
    Author: Yuming Wang <yumw...@ebay.com>
    
    Closes #20268 from wangyum/SPARK-19550-MaxPermSize.
    
    (cherry picked from commit a38c887ac093d7cf343d807515147d87ca931ce7)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit c6a3b9297f0246cfc02a57ec099ca23db90f343f
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-15T14:32:38Z

    [SPARK-23070] Bump previousSparkVersion in MimaBuild.scala to be 2.2.0
    
    ## What changes were proposed in this pull request?
    Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 and add the 
missing exclusions to `v23excludes` in `MimaExcludes`. No item can be 
un-excluded in `v23excludes`.
    
    ## How was this patch tested?
    The existing tests.
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #20264 from gatorsmile/bump22.
    
    (cherry picked from commit bd08a9e7af4137bddca638e627ad2ae531bce20f)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 706a308bdf4a2492d7e01bd3cf2710072704de0e
Author: xubo245 <601450868@...>
Date:   2018-01-15T15:13:15Z

    [SPARK-23035][SQL] Fix improper information of 
TempTableAlreadyExistsException
    
    ## What changes were proposed in this pull request?
    
    Problem: it throw TempTableAlreadyExistsException and output "Temporary 
table '$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
    
    So fix improper information about TempTableAlreadyExistsException when 
create temp view:
    
    change "Temporary table"  to  "Temporary view"
    
    ## How was this patch tested?
    
    test("rename temporary view - destination table already exists, with: 
CREATE TEMPORARY view")
    
    test("rename temporary view - destination table with database 
name,with:CREATE TEMPORARY view")
    
    Author: xubo245 <601450...@qq.com>
    
    Closes #20227 from xubo245/fixDeprecated.
    
    (cherry picked from commit 6c81fe227a6233f5d9665d2efadf8a1cf09f700d)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit bb8e5addc79652308169532c33baa8117c2464ca
Author: Marco Gaido <marcogaido91@...>
Date:   2018-01-16T02:47:42Z

    [SPARK-23080][SQL] Improve error message for built-in functions
    
    ## What changes were proposed in this pull request?
    
    When a user puts the wrong number of parameters in a function, an 
AnalysisException is thrown. If the function is a UDF, he user is told how many 
parameters the function expected and how many he/she put. If the function, 
instead, is a built-in one, no information about the number of parameters 
expected and the actual one is provided. This can help in some cases, to debug 
the errors (eg. bad quotes escaping may lead to a different number of 
parameters than expected, etc. etc.)
    
    The PR adds the information about the number of parameters passed and the 
expected one, analogously to what happens for UDF.
    
    ## How was this patch tested?
    
    modified existing UT + manual test
    
    Author: Marco Gaido <marcogaid...@gmail.com>
    
    Closes #20271 from mgaido91/SPARK-23080.
    
    (cherry picked from commit 8ab2d7ea99b2cff8b54b2cb3a1dbf7580845986a)
    Signed-off-by: hyukjinkwon <gurwls...@gmail.com>

commit e2ffb97819612c062bdfcde12e27e9d04c1a846d
Author: Sameer Agarwal <sameerag@...>
Date:   2018-01-16T03:20:18Z

    [SPARK-23000] Use fully qualified table names in HiveMetastoreCatalogSuite
    
    ## What changes were proposed in this pull request?
    
    In another attempt to fix DataSourceWithHiveMetastoreCatalogSuite, this 
patch uses qualified table names (`default.t`) in the individual tests.
    
    ## How was this patch tested?
    
    N/A (Test Only Change)
    
    Author: Sameer Agarwal <samee...@apache.org>
    
    Closes #20273 from sameeragarwal/flaky-test.
    
    (cherry picked from commit c7572b79da0a29e502890d7618eaf805a1c9f474)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit e58c4a929a5cbd2d611b3e07a29fcc93a827d980
Author: Yuanjian Li <xyliyuanjian@...>
Date:   2018-01-16T06:01:14Z

    [SPARK-22956][SS] Bug fix for 2 streams union failover scenario
    
    ## What changes were proposed in this pull request?
    
    This problem reported by yanlin-Lynn ivoson and LiangchangZ. Thanks!
    
    When we union 2 streams from kafka or other sources, while one of them have 
no continues data coming and in the same time task restart, this will cause an 
`IllegalStateException`. This mainly cause because the code in 
[MicroBatchExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190)
 , while one stream has no continues data, its comittedOffset same with 
availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
not properly handled in KafkaSource. Also, maybe we should also consider this 
scenario in other Source.
    
    ## How was this patch tested?
    
    Add a UT in KafkaSourceSuite.scala
    
    Author: Yuanjian Li <xyliyuanj...@gmail.com>
    
    Closes #20150 from xuanyuanking/SPARK-22956.
    
    (cherry picked from commit 07ae39d0ec1f03b1c73259373a8bb599694c7860)
    Signed-off-by: Shixiong Zhu <zsxw...@gmail.com>

commit 20c69816a63071b82b1035d4b48798c358206421
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-16T06:40:44Z

    [SPARK-23020][CORE] Fix races in launcher code, test.
    
    The race in the code is because the handle might update
    its state to the wrong state if the connection handling
    thread is still processing incoming data; so the handle
    needs to wait for the connection to finish up before
    checking the final state.
    
    The race in the test is because when waiting for a handle
    to reach a final state, the waitFor() method needs to wait
    until all handle state is updated (which also includes
    waiting for the connection thread above to finish).
    Otherwise, waitFor() may return too early, which would cause
    a bunch of different races (like the listener not being yet
    notified of the state change, or being in the middle of
    being notified, or the handle not being properly disposed
    and causing postChecks() to assert).
    
    On top of that I found, by code inspection, a couple of
    potential races that could make a handle end up in the
    wrong state when being killed.
    
    Tested by running the existing unit tests a lot (and not
    seeing the errors I was seeing before).
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #20223 from vanzin/SPARK-23020.
    
    (cherry picked from commit 66217dac4f8952a9923625908ad3dcb030763c81)
    Signed-off-by: Sameer Agarwal <samee...@apache.org>

commit 5c06ee2d49987c297e93de87f99c701e178ba294
Author: gatorsmile <gatorsmile@...>
Date:   2018-01-16T11:20:33Z

    [SPARK-22978][PYSPARK] Register Vectorized UDFs for SQL Statement
    
    ## What changes were proposed in this pull request?
    Register Vectorized UDFs for SQL Statement. For example,
    
    ```Python
    >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
    >>> pandas_udf("integer", PandasUDFType.SCALAR)
    ... def add_one(x):
    ...     return x + 1
    ...
    >>> _ = spark.udf.register("add_one", add_one)
    >>> spark.sql("SELECT add_one(id) FROM range(3)").collect()
    [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
    ```
    
    ## How was this patch tested?
    Added test cases
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #20171 from gatorsmile/supportVectorizedUDF.
    
    (cherry picked from commit b85eb946ac298e711dad25db0d04eee41d7fd236)
    Signed-off-by: hyukjinkwon <gurwls...@gmail.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...

Reply via email to