[GitHub] spark pull request #19489: Branch 2.2

dahaian Thu, 12 Oct 2017 23:40:07 -0700

GitHub user dahaian opened a pull request:

    https://github.com/apache/spark/pull/19489


    Branch 2.2

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19489.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19489
    
----
commit fafe283277b50974c26684b06449086acd0cf05a
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-05-26T07:01:28Z

    [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after 
FileChannel.transferTo
    
    ## What changes were proposed in this pull request?
    
    Long time ago we fixed a 
[bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about 
`FileChannel.transferTo`. We were not very confident about that fix, so we 
added a position check after the writing, try to discover the bug earlier.
    
     However this checking is missing in the new `UnsafeShuffleWriter`, this PR 
adds it.
    
    https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that 
`FileChannel.transferTo` bug, hopefully we can find out the root cause after 
adding this position check.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #18091 from cloud-fan/shuffle.
    
    (cherry picked from commit d9ad78908f6189719cec69d34557f1a750d2e6af)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f99456b5f6225a534ce52cf2b817285eb8853926
Author: NICHOLAS T. MARION <nmar...@us.ibm.com>
Date:   2017-05-10T09:59:57Z

    [SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities
    
    ## What changes were proposed in this pull request?
    
    Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these 
functions at any point that getParameter is called against a HttpServletRequest.
    
    ## How was this patch tested?
    
    Unit tests, IBM Security AppScan Standard no longer showing 
vulnerabilities, manual verification of WebUI pages.
    
    Author: NICHOLAS T. MARION <nmar...@us.ibm.com>
    
    Closes #17686 from n-marion/xss-fix.
    
    (cherry picked from commit b512233a457092b0e2a39d0b42cb021abc69d375)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 92837aeb47fc3427166e4b6e62f6130f7480d7fa
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2017-05-16T21:47:21Z

    [SPARK-19372][SQL] Fix throwing a Java exception at df.fliter() due to 64KB 
bytecode size limit
    
    ## What changes were proposed in this pull request?
    
    When an expression for `df.filter()` has many nodes (e.g. 400), the size of 
Java bytecode for the generated Java code is more than 64KB. It produces an 
Java exception. As a result, the execution fails.
    This PR continues to execute by calling `Expression.eval()` disabling code 
generation if an exception has been caught.
    
    ## How was this patch tested?
    
    Add a test suite into `DataFrameSuite`
    
    Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
    
    Closes #17087 from kiszk/SPARK-19372.

commit 2b59ed4f1d4e859d5987b6eaaee074260b2a12f8
Author: Michael Armbrust <mich...@databricks.com>
Date:   2017-05-26T20:33:23Z

    [SPARK-20844] Remove experimental from Structured Streaming APIs
    
    Now that Structured Streaming has been out for several Spark release and 
has large production use cases, the `Experimental` label is no longer 
appropriate.  I've left `InterfaceStability.Evolving` however, as I think we 
may make a few changes to the pluggable Source & Sink API in Spark 2.3.
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #18065 from marmbrus/streamingGA.

commit 30922dec8a8cc598b6715f85281591208a91df00
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-26T22:01:01Z

    [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and 
sortBy in SQL guide
    
    ## What changes were proposed in this pull request?
    
    - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and 
`bucketBy`.
    - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
    - Remove bucketing from Unsupported Hive Functionalities.
    
    ## How was this patch tested?
    
    Manual tests, docs build.
    
    Author: zero323 <zero...@users.noreply.github.com>
    
    Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
    
    (cherry picked from commit ae33abf71b353c638487948b775e966c7127cd46)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit fc799d730304c6a176636b414fc15184e89367d7
Author: Yu Peng <loneknigh...@gmail.com>
Date:   2017-05-26T23:28:36Z

    [SPARK-10643][CORE] Make spark-submit download remote files to local in 
client mode
    
    ## What changes were proposed in this pull request?
    
    This PR makes spark-submit script download remote files to local file 
system for local/standalone client mode.
    
    ## How was this patch tested?
    
    - Unit tests
    - Manual tests by adding s3a jar and testing against file on s3.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Yu Peng <loneknigh...@gmail.com>
    
    Closes #18078 from loneknightpy/download-jar-in-spark-submit.
    
    (cherry picked from commit 4af37812915763ac3bfd91a600a7f00a4b84d29a)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit 39f76657ef2967f4c87230e06cbbb1611c276375
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-05-27T02:57:43Z

    [SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read
    
    ## What changes were proposed in this pull request?
    
    This PR includes some minor improvement for the comments and tests in 
https://github.com/apache/spark/pull/16989
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #18117 from cloud-fan/follow.
    
    (cherry picked from commit 1d62f8aca82601506c44b6fd852f4faf3602d7e2)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f2408bdd7a0950385ee1364e006d55bfa6e5a200
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2017-05-27T05:25:38Z

    [SPARK-20843][CORE] Add a config to set driver terminate timeout
    
    ## What changes were proposed in this pull request?
    
    Add a `worker` configuration to set how long to wait before forcibly 
killing driver.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #18126 from zsxwing/SPARK-20843.
    
    (cherry picked from commit 6c1dbd6fc8d49acf7c1c902d2ebf89ed5e788a4e)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 25e87d80c483785dc4a79fb283bc80f68197bf4a
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-05-27T23:16:51Z

    [SPARK-20897][SQL] cached self-join should not fail
    
    ## What changes were proposed in this pull request?
    
    The failed test case is, we have a `SortMergeJoinExec` for a self-join, 
which means we have a `ReusedExchange` node in the query plan. It works fine 
without caching, but throws an exception in 
`SortMergeJoinExec.outputPartitioning` if we cache it.
    
    The root cause is, `ReusedExchange` doesn't propagate the output 
partitioning from its child, so in `SortMergeJoinExec.outputPartitioning` we 
create `PartitioningCollection` with a hash partitioning and an unknown 
partitioning, and fail.
    
    This bug is mostly fine, because inserting the `ReusedExchange` is the last 
step to prepare the physical plan, we won't call 
`SortMergeJoinExec.outputPartitioning` anymore after this.
    
    However, if the dataframe is cached, the physical plan of it becomes 
`InMemoryTableScanExec`, which contains another physical plan representing the 
cached query, and it has gone through the entire planning phase and may have 
`ReusedExchange`. Then the planner call 
`InMemoryTableScanExec.outputPartitioning`, which then calls 
`SortMergeJoinExec.outputPartitioning` and trigger this bug.
    
    ## How was this patch tested?
    
    a new regression test
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #18121 from cloud-fan/bug.
    
    (cherry picked from commit 08ede46b897b7e52cfe8231ffc21d9515122cf49)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit dc51be1e79b89d143da5df16b893df86a306f059
Author: Xiao Li <gatorsm...@gmail.com>
Date:   2017-05-28T04:32:18Z

    [SPARK-20908][SQL] Cache Manager: Hint should be ignored in plan matching
    
    ### What changes were proposed in this pull request?
    
    In Cache manager, the plan matching should ignore Hint.
    ```Scala
          val df1 = spark.range(10).join(broadcast(spark.range(10)))
          df1.cache()
          spark.range(10).join(spark.range(10)).explain()
    ```
    The output plan of the above query shows that the second query is  not 
using the cached data of the first query.
    ```
    BroadcastNestedLoopJoin BuildRight, Inner
    :- *Range (0, 10, step=1, splits=2)
    +- BroadcastExchange IdentityBroadcastMode
       +- *Range (0, 10, step=1, splits=2)
    ```
    
    After the fix, the plan becomes
    ```
    InMemoryTableScan [id#20L, id#23L]
       +- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk, 
memory, deserialized, 1 replicas)
             +- BroadcastNestedLoopJoin BuildRight, Inner
                :- *Range (0, 10, step=1, splits=2)
                +- BroadcastExchange IdentityBroadcastMode
                   +- *Range (0, 10, step=1, splits=2)
    ```
    
    ### How was this patch tested?
    Added a test.
    
    Author: Xiao Li <gatorsm...@gmail.com>
    
    Closes #18131 from gatorsmile/HintCache.
    
    (cherry picked from commit 06c155c90dc784b07002f33d98dcfe9be1e38002)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit 26640a26984bac4fc1037714e60bd3607929b377
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2017-05-29T19:17:14Z

    [SPARK-20907][TEST] Use testQuietly for test suites that generate long log 
output
    
    ## What changes were proposed in this pull request?
    
    Supress console output by using `testQuietly` in test suites
    
    ## How was this patch tested?
    
    Tested by `"SPARK-19372: Filter can be executed w/o generated code due to 
JVM code size limit"` in `DataFrameSuite`
    
    Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
    
    Closes #18135 from kiszk/SPARK-20907.
    
    (cherry picked from commit c9749068ecf8e0acabdfeeceeedff0f1f73293b7)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 3b79e4cda74e0bf82ec55e673beb8f84e7cfaca4
Author: Yuming Wang <wgy...@gmail.com>
Date:   2017-05-29T23:10:22Z

    [SPARK-8184][SQL] Add additional function description for weekofyear
    
    ## What changes were proposed in this pull request?
    
    Add additional function description for weekofyear.
    
    ## How was this patch tested?
    
     manual tests
    
    
![weekofyear](https://cloud.githubusercontent.com/assets/5399861/26525752/08a1c278-4394-11e7-8988-7cbf82c3a999.gif)
    
    Author: Yuming Wang <wgy...@gmail.com>
    
    Closes #18132 from wangyum/SPARK-8184.
    
    (cherry picked from commit 1c7db00c74ec6a91c7eefbdba85cbf41fbe8634a)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit f6730a70cb47ebb3df7f42209df7b076aece1093
Author: Prashant Sharma <prash...@in.ibm.com>
Date:   2017-05-30T01:12:01Z

    [SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of 
creating one every batch.
    
    ## What changes were proposed in this pull request?
    
    In summary, cost of recreating a KafkaProducer for writing every batch is 
high as it starts a lot threads and make connections and then closes them. A 
KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of 
KafkaProducer instance while writing via multiple threads is encouraged.
    
    Furthermore, I have performance improvement of 10x in latency, with this 
patch.
    
    ### These are times that addBatch took in ms. Without applying this patch
    
![with-out_patch](https://cloud.githubusercontent.com/assets/992952/23994612/a9de4a42-0a6b-11e7-9d5b-7ae18775bee4.png)
    ### These are times that addBatch took in ms. After applying this patch
    
![with_patch](https://cloud.githubusercontent.com/assets/992952/23994616/ad8c11ec-0a6b-11e7-8634-2266ebb5033f.png)
    
    ## How was this patch tested?
    Running distributed benchmarks comparing runs with this patch and without 
it.
    Added relevant unit tests.
    
    Author: Prashant Sharma <prash...@in.ibm.com>
    
    Closes #17308 from ScrapCodes/cached-kafka-producer.
    
    (cherry picked from commit 96a4d1d0827fc3fba83f174510b061684f0d00f7)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 5fdc7d80f46d51d4a8e49d9390b191fff42ec222
Author: Xiao Li <gatorsm...@gmail.com>
Date:   2017-05-30T21:06:19Z

    [SPARK-20924][SQL] Unable to call the function registered in the 
not-current database
    
    ### What changes were proposed in this pull request?
    We are unable to call the function registered in the not-current database.
    ```Scala
    sql("CREATE DATABASE dAtABaSe1")
    sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
'${classOf[GenericUDAFAverage].getName}'")
    sql("SELECT dAtABaSe1.test_avg(1)")
    ```
    The above code returns an error:
    ```
    Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'default'.; line 1 pos 7
    ```
    
    This PR is to fix the above issue.
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <gatorsm...@gmail.com>
    
    Closes #18146 from gatorsmile/qualifiedFunction.
    
    (cherry picked from commit 4bb6a53ebd06de3de97139a2dbc7c85fc3aa3e66)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 287440df6816b5c9f2be2aee949a4c20ab165180
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-05-31T03:24:43Z

    [SPARK-20275][UI] Do not display "Completed" column for in-progress 
applications
    
    ## What changes were proposed in this pull request?
    
    Current HistoryServer will display completed date of in-progress 
application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of 
unnecessarily showing this incorrect completed date, here propose to make this 
column invisible for in-progress applications.
    
    The purpose of only making this column invisible rather than deleting this 
field is that: this data is fetched through REST API, and in the REST API  the 
format is like below shows, in which `endTime` matches `endTimeEpoch`. So 
instead of changing REST API to break backward compatibility, here choosing a 
simple solution to only make this column invisible.
    
    ```
    [ {
      "id" : "local-1491805439678",
      "name" : "Spark shell",
      "attempts" : [ {
        "startTime" : "2017-04-10T06:23:57.574GMT",
        "endTime" : "1969-12-31T23:59:59.999GMT",
        "lastUpdated" : "2017-04-10T06:23:57.574GMT",
        "duration" : 0,
        "sparkUser" : "",
        "completed" : false,
        "startTimeEpoch" : 1491805437574,
        "endTimeEpoch" : -1,
        "lastUpdatedEpoch" : 1491805437574
      } ]
    } ]%
    ```
    
    Here is UI before changed:
    
    <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" 
src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png";>
    
    And after:
    
    <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" 
src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png";>
    
    ## How was this patch tested?
    
    Manual verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #17588 from jerryshao/SPARK-20275.
    
    (cherry picked from commit 52ed9b289d169219f7257795cbedc56565a39c71)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 3cad66e5e06a4020a16fa757fbf67f666b319bab
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2017-05-31T05:33:29Z

    [SPARK-20877][SPARKR][WIP] add timestamps to test runs
    
    to investigate how long they run
    
    Jenkins, AppVeyor
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #18104 from felixcheung/rtimetest.
    
    (cherry picked from commit 382fefd1879e4670f3e9e8841ec243e3eb11c578)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 3686c2e965758f471f9784b3e06223ce143b6aca
Author: David Eis <d...@bloomberg.net>
Date:   2017-05-31T12:52:55Z

    [SPARK-20790][MLLIB] Correctly handle negative values for implicit feedback 
in ALS
    
    ## What changes were proposed in this pull request?
    
    Revert the handling of negative values in ALS with implicit feedback, so 
that the confidence is the absolute value of the rating and the preference is 0 
for negative ratings. This was the original behavior.
    
    ## How was this patch tested?
    
    This patch was tested with the existing unit tests and an added unit test 
to ensure that negative ratings are not ignored.
    
    mengxr
    
    Author: David Eis <d...@bloomberg.net>
    
    Closes #18022 from davideis/bugfix/negative-rating.
    
    (cherry picked from commit d52f636228e833db89045bc7a0c17b72da13f138)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit f59f9a380351726de20453ab101f46e199a7079c
Author: liuxian <liu.xi...@zte.com.cn>
Date:   2017-05-31T18:43:36Z

    [SPARK-20876][SQL][BACKPORT-2.2] If the input parameter is float type for 
ceil or floor,the result is not we expected
    
    ## What changes were proposed in this pull request?
    
    This PR is to backport #18103 to Spark 2.2
    
    ## How was this patch tested?
    unit test
    
    Author: liuxian <liu.xi...@zte.com.cn>
    
    Closes #18155 from 10110346/wip-lx-0531.

commit a607a26b344470bbff1247908d49b848bb7918a0
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2017-06-01T00:26:18Z

    [SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException
    
    ## What changes were proposed in this pull request?
    
    `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its 
meaning is `Thrown if an application attempts to access or modify a field, or 
to call a method that it does not have access to`. Throwing a fatal error for 
AccumulatorV2 is not necessary and is pretty bad because it usually will just 
kill executors or SparkContext 
([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example 
of killing SparkContext due to `IllegalAccessError`). I think the correct type 
of exception in AccumulatorV2 should be `IllegalStateException`.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #18168 from zsxwing/SPARK-20940.
    
    (cherry picked from commit 24db35826a81960f08e3eb68556b0f51781144e1)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 14fda6f313c63c9d5a86595c12acfb1e36df43ad
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-06-01T05:34:53Z

    [SPARK-20244][CORE] Handle incorrect bytesRead metrics when using PySpark
    
    ## What changes were proposed in this pull request?
    
    Hadoop FileSystem's statistics in based on thread local variables, this is 
ok if the RDD computation chain is running in the same thread. But if child RDD 
creates another thread to consume the iterator got from Hadoop RDDs, the 
bytesRead computation will be error, because now the iterator's `next()` and 
`close()` may run in different threads. This could be happened when using 
PySpark with PythonRDD.
    
    So here building a map to track the `bytesRead` for different thread and 
add them together. This method will be used in three RDDs, `HadoopRDD`, 
`NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called 
directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`.
    
    ## How was this patch tested?
    
    Unit test and local cluster verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #17617 from jerryshao/SPARK-20244.
    
    (cherry picked from commit 5854f77ce1d3b9491e2a6bd1f352459da294e369)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 4ab7b820bfa90bd76670715bedfd155df7d8f0fd
Author: Yuming Wang <wgy...@gmail.com>
Date:   2017-06-01T06:17:15Z

    [MINOR][SQL] Fix a few function description error.
    
    ## What changes were proposed in this pull request?
    
    Fix a few function description error.
    
    ## How was this patch tested?
    
    manual tests
    
    
![descissues](https://cloud.githubusercontent.com/assets/5399861/26619392/d547736c-4610-11e7-85d7-aeeb09c02cc8.gif)
    
    Author: Yuming Wang <wgy...@gmail.com>
    
    Closes #18157 from wangyum/DescIssues.
    
    (cherry picked from commit c8045f8b482e347eccf2583e0952e1d8bcb6cb96)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit 6a4e023b250a86887475958093f1d3bdcbb49a03
Author: Xiao Li <gatorsm...@gmail.com>
Date:   2017-06-01T16:52:18Z

    [SPARK-20941][SQL] Fix SubqueryExec Reuse
    
    Before this PR, Subquery reuse does not work. Below are three issues:
    - Subquery reuse does not work.
    - It is sharing the same `SQLConf` (`spark.sql.exchange.reuse`) with the 
one for Exchange Reuse.
    - No test case covers the rule Subquery reuse.
    
    This PR is to fix the above three issues.
    - Ignored the physical operator `SubqueryExec` when comparing two plans.
    - Added a dedicated conf `spark.sql.subqueries.reuse` for controlling 
Subquery Reuse
    - Added a test case for verifying the behavior
    
    N/A
    
    Author: Xiao Li <gatorsm...@gmail.com>
    
    Closes #18169 from gatorsmile/subqueryReuse.
    
    (cherry picked from commit f7cf2096fdecb8edab61c8973c07c6fc877ee32d)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit b81a702f44c3c0fe851a602d5c4682a82149a1f4
Author: Li Yichao <l...@zhihu.com>
Date:   2017-06-01T21:39:57Z

    [SPARK-20365][YARN] Remove local scheme when add path to ClassPath.
    
    In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars 
started with "local" scheme), we will get inaccurate classpath for AM and 
containers. This is because we don't remove "local" scheme when concatenating 
classpath. It is OK to run because classpath is separated with ":" and java 
treat "local" as a separate jar. But we could improve it to remove the scheme.
    
    Updated `ClientSuite` to check "local" is not in the classpath.
    
    cc jerryshao
    
    Author: Li Yichao <l...@zhihu.com>
    Author: Li Yichao <liyichao.g...@gmail.com>
    
    Closes #18129 from liyichao/SPARK-20365.
    
    (cherry picked from commit 640afa49aa349c7ebe35d365eec3ef9bb7710b1d)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit 4cba3b5a350f4d477466fc73b32cbd653eee8405
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2017-06-01T21:44:34Z

    [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by 
the launcher.
    
    Blindly deserializing classes using Java serialization opens the code up to
    issues in other libraries, since just deserializing data from a stream may
    end up execution code (think readObject()).
    
    Since the launcher protocol is pretty self-contained, there's just a handful
    of classes it legitimately needs to deserialize, and they're in just two
    packages, so add a filter that throws errors if classes from any other
    package show up in the stream.
    
    This also maintains backwards compatibility (the updated launcher code can
    still communicate with the backend code in older Spark releases).
    
    Tested with new and existing unit tests.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #18166 from vanzin/SPARK-20922.
    
    (cherry picked from commit 8efc6e986554ae66eab93cd64a9035d716adbab0)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit bb3d900b48d50d27188a52662d5eb95738265669
Author: Bogdan Raducanu <bog...@databricks.com>
Date:   2017-06-01T22:50:40Z

    [SPARK-20854][SQL] Extend hint syntax to support expressions
    
    SQL hint syntax:
    * support expressions such as strings, numbers, etc. instead of only 
identifiers as it is currently.
    * support multiple hints, which was missing compared to the DataFrame 
syntax.
    
    DataFrame API:
    * support any parameters in DataFrame.hint instead of just strings
    
    Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.
    
    Author: Bogdan Raducanu <bog...@databricks.com>
    
    Closes #18086 from bogdanrdc/SPARK-20854.
    
    (cherry picked from commit 2134196a9c0aca82bc3e203c09e776a8bd064d65)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 25cc80066d68190c1ced7473dd4fd40f7e8dec3a
Author: guoxiaolong <guo.xiaolo...@zte.com.cn>
Date:   2017-06-02T13:38:00Z

    [SPARK-20942][WEB-UI] The title style about field is error in the history 
server web ui.
    
    ## What changes were proposed in this pull request?
    
    1.The title style about field is error.
    fix before:
    
![before](https://cloud.githubusercontent.com/assets/26266482/26661987/a7bed018-46b3-11e7-8a54-a5152d2df0f4.png)
    
    fix after:
    
![fix](https://cloud.githubusercontent.com/assets/26266482/26662000/ba6cc814-46b3-11e7-8f33-cfd4cc2c60fe.png)
    
    
![fix1](https://cloud.githubusercontent.com/assets/26266482/26662080/3c732e3e-46b4-11e7-8768-20b5a6aeadcb.png)
    
    executor-page style:
    
![executor_page](https://cloud.githubusercontent.com/assets/26266482/26662384/167cbd10-46b6-11e7-9e07-bf391dbc6e08.png)
    
    2.Title text description, 'the application' should be changed to 'this 
application'.
    
    3.Analysis of code:
     $('#history-summary [data-toggle="tooltip"]').tooltip();
    The id of 'history-summary' is not there. We only contain id of 
'history-summary-table'.
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: guoxiaolong <guo.xiaolo...@zte.com.cn>
    Author: éå°é¾ 10207633 <guo.xiaolo...@zte.com.cn>
    Author: guoxiaolongzte <guo.xiaolo...@zte.com.cn>
    
    Closes #18170 from guoxiaolongzte/SPARK-20942.
    
    (cherry picked from commit 625cebfde632361122e0db3452c4cc38147f696f)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit ae00d49afc9d6aaeabb16d905b764d705963ab50
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-06-02T16:58:01Z

    [SPARK-20967][SQL] SharedState.externalCatalog is not really lazy
    
    ## What changes were proposed in this pull request?
    
    `SharedState.externalCatalog` is marked as a `lazy val` but actually it's 
not lazy. We access `externalCatalog` while initializing `SharedState` and thus 
eliminate the effort of `lazy val`. When creating `ExternalCatalog` we will try 
to connect to the metastore and may throw an error, so it makes sense to make 
it a `lazy val` in `SharedState`.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #18187 from cloud-fan/minor.
    
    (cherry picked from commit d1b80ab9220d83e5fdaf33c513cc811dd17d0de1)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f36c3ee492c6d06e86a93c8e1e4aa1bf922c4e03
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-06-02T17:05:05Z

    [SPARK-20946][SQL] simplify the config setting logic in 
SparkSession.getOrCreate
    
    ## What changes were proposed in this pull request?
    
    The current conf setting logic is a little complex and has duplication, 
this PR simplifies it.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #18172 from cloud-fan/session.
    
    (cherry picked from commit e11d90bf8deb553fd41b8837e3856c11486c2503)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 7f35f5b99d9099b11e68af805a871e7b19f96df5
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2017-06-02T17:33:21Z

    [SPARK-20955][CORE] Intern "executorId" to reduce the memory usage
    
    ## What changes were proposed in this pull request?
    
    In [this 
line](https://github.com/apache/spark/blob/f7cf2096fdecb8edab61c8973c07c6fc877ee32d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L128),
 it uses the `executorId` string received from executors and finally it will go 
into `TaskUIData`. As deserializing the `executorId` string will always create 
a new instance, we have a lot of duplicated string instances.
    
    This PR does a String interning for TaskUIData to reduce the memory usage.
    
    ## How was this patch tested?
    
    Manually test using `bin/spark-shell --master local-cluster[6,1,1024]`. 
Test codes:
    ```
    for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() }
    Thread.sleep(2000)
    val l = 
sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener]
    org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData)
    ```
    This PR reduces the size of `stageIdToData` from 3487280 to 3009744 (86.3%) 
in the above case.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #18177 from zsxwing/SPARK-20955.
    
    (cherry picked from commit 16186cdcbce1a2ec8f839c550e6b571bf5dc2692)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 9a4a8e1b010bcfa187360c8331ef897195732638
Author: Xiao Li <gatorsm...@gmail.com>
Date:   2017-06-02T18:57:22Z

    [SPARK-19236][SQL][BACKPORT-2.2] Added createOrReplaceGlobalTempView method
    
    ### What changes were proposed in this pull request?
    
    This PR is to backport two PRs for adding the 
`createOrReplaceGlobalTempView` method
    https://github.com/apache/spark/pull/18147
    https://github.com/apache/spark/pull/16598
    
    ---
    Added the createOrReplaceGlobalTempView method for dataset API
    
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <gatorsm...@gmail.com>
    
    Closes #18167 from gatorsmile/Backport18147.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19489: Branch 2.2

Reply via email to