[GitHub] spark pull request #20027: Branch 2.2

Maple-Wang Tue, 19 Dec 2017 20:54:13 -0800

GitHub user Maple-Wang opened a pull request:

    https://github.com/apache/spark/pull/20027


    Branch 2.2

    use SparkR in the R shell, the master parameter too old,connot run 
"spark-submit --master yarn --deploy-mode client" . I install R on all node.
    
    when use in this way:
    if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
      Sys.setenv(SPARK_HOME = "/usr/hdp/2.6.1.0-129/spark2")
    }
    library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", 
"lib")))
    sparkR.session(master = "yarn", sparkConfig = list(spark.driver.memory = 
"10g"))
    
    
    
    it comes out:
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
(TID 4, node23.nuctech.com, executor 1): java.net.SocketTimeoutException: 
Accept timed out
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:372)
        at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
        at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20027
    
----
commit 96c04f1edcd53798d9db5a356482248868a0a905
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-06-24T05:23:43Z

    [SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster 
mode.
    
    Monitoring for standalone cluster mode is not implemented (see 
SPARK-11033), but
    the same scheduler implementation is used, and if it tries to connect to the
    launcher it will fail. So fix the scheduler so it only tries that in client 
mode;
    cluster mode applications will be correctly launched and will work, but 
monitoring
    through the launcher handle will not be available.
    
    Tested by running a cluster mode app with "SparkLauncher.startApplication".
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #18397 from vanzin/SPARK-21159.
    
    (cherry picked from commit bfd73a7c48b87456d1b84d826e04eca938a1be64)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit ad44ab5cb9cdaff836c7469d10b00a86a3e46adf
Author: gatorsmile <gatorsmile@...>
Date:   2017-06-24T14:35:59Z

    [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct
    
    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE `tab1`
    (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
    USING parquet
    
    INSERT INTO `tab1`
    SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 
'value', 'b'))
    
    SELECT custom_fields.id, custom_fields.value FROM tab1
    ```
    
    The above query always return the last struct of the array, because the 
rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we 
always use the same `GenericInternalRow` object when doing the cast.
    
    ### How was this patch tested?
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #18412 from gatorsmile/castStruct.
    
    (cherry picked from commit 2e1586f60a77ea0adb6f3f68ba74323f0c242199)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit d8e3a4af36f85455548e82ae4acd525f5e52f322
Author: Masha Basmanova <mbasmanova@...>
Date:   2017-06-25T05:49:35Z

    [SPARK-21079][SQL] Calculate total size of a partition table as a sum of 
individual partitions
    
    ## What changes were proposed in this pull request?
    
    Storage URI of a partitioned table may or may not point to a directory 
under which individual partitions are stored. In fact, individual partitions 
may be located in totally unrelated directories. Before this change, ANALYZE 
TABLE table COMPUTE STATISTICS command calculated total size of a table by 
adding up sizes of files found under table's storage URI. This calculation 
could produce 0 if partitions are stored elsewhere.
    
    This change uses storage URIs of individual partitions to calculate the 
sizes of all partitions of a table and adds these up to produce the total size 
of a table.
    
    CC: wzhfy
    
    ## How was this patch tested?
    
    Added unit test.
    
    Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and 
verified that sizeInBytes is calculated correctly. Before this change, the size 
would be zero.
    
    Author: Masha Basmanova <mbasman...@fb.com>
    
    Closes #18309 from mbasmanova/mbasmanova-analyze-part-table.
    
    (cherry picked from commit b449a1d6aa322a50cf221cd7a2ae85a91d6c7e9f)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 970f68c056ce543068af28df44490d068ec3d15d
Author: Liang-Chi Hsieh <viirya@...>
Date:   2017-06-27T16:57:05Z

    [SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should be 
global
    
    The issue happens in `ExternalMapToCatalyst`. For example, the following 
codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map 
format.
    
        val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> 
InnerData("name", i + 100))))
        val ds = spark.createDataset(data)
    
    The `valueConverter` in `ExternalMapToCatalyst` looks like:
    
        if (isnull(lambdavariable(ExternalMapToCatalyst_value52, 
ExternalMapToCatalyst_value_isNull52, ObjectType(class 
org.apache.spark.sql.InnerData), true))) null else named_struct(name, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, 
ExternalMapToCatalyst_value_isNull52, ObjectType(class 
org.apache.spark.sql.InnerData), true)).name, true), value, 
assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, 
ExternalMapToCatalyst_value_isNull52, ObjectType(class 
org.apache.spark.sql.InnerData), true)).value)
    
    There is a `CreateNamedStruct` expression (`named_struct`) to create a row 
of `InnerData.name` and `InnerData.value` that are referred by 
`ExternalMapToCatalyst_value52`.
    
    Because `ExternalMapToCatalyst_value52` are local variable, when 
`CreateNamedStruct` splits expressions to individual functions, the local 
variable can't be accessed anymore.
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <vii...@gmail.com>
    
    Closes #18418 from viirya/SPARK-19104.
    
    (cherry picked from commit fd8c931a30a084ee981b75aa469fc97dda6cfaa9)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 17a04b9000f86c04f492927d4aa3b23d60e2b5b6
Author: Nick Pentreath <nickp@...>
Date:   2017-06-29T08:51:12Z

    [SPARK-21210][DOC][ML] Javadoc 8 fixes for ML shared param traits
    
    PR #15999 included fixes for doc strings in the ML shared param traits 
(occurrences of `>` and `>=`).
    
    This PR simply uses the HTML-escaped version of the param doc to embed into 
the Scaladoc, to ensure that when `SharedParamsCodeGen` is run, the generated 
javadoc will be compliant for Java 8.
    
    ## How was this patch tested?
    Existing tests
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #18420 from MLnick/shared-params-javadoc8.
    
    (cherry picked from commit 70085e83d1ee728b23f7df15f570eb8d77f67a7a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 20cf51194a70b1bc2446289f4e57e5d287a8e7b9
Author: Shixiong Zhu <shixiong@...>
Date:   2017-06-30T02:56:48Z

    [SPARK-21253][CORE] Fix a bug that StreamCallback may not be notified if 
network errors happen
    
    ## What changes were proposed in this pull request?
    
    If a network error happens before processing StreamResponse/StreamFailure 
events, StreamCallback.onFailure won't be called.
    
    This PR fixes `failOutstandingRequests` to also notify outstanding 
StreamCallbacks.
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #18472 from zsxwing/fix-stream-2.
    
    (cherry picked from commit 4996c53949376153f9ebdc74524fed7226968808)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 8de67e3692c70e6401902cd9d9be823e1882da8d
Author: Shixiong Zhu <shixiong@...>
Date:   2017-06-30T03:02:22Z

    [SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem
    
    Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old 
shuffle service.
    
    Credits to wangyum
    
    Closes #18466
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    Author: Yuming Wang <wgy...@gmail.com>
    
    Closes #18467 from zsxwing/SPARK-21253.
    
    (cherry picked from commit 80f7ac3a601709dd9471092244612023363f54cd)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit c6ba647935e9108d139cc8914091790917567ad7
Author: IngoSchuster <ingo.schuster@...>
Date:   2017-06-30T03:16:09Z

    [SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy 
servlets to 8
    
    ## What changes were proposed in this pull request?
    Please see also https://issues.apache.org/jira/browse/SPARK-21176
    
    This change limits the number of selector threads that jetty creates to 
maximum 8 per proxy servlet (Jetty default is number of processors / 2).
    The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the 
Jetty defaults (which are designed for high-performance http servers).
    Once https://github.com/eclipse/jetty.project/issues/1643 is available, the 
code could be cleaned up to avoid the method override.
    
    I really need this on v2.1.1 - what is the best way for a backport 
automatic merge works fine)? Shall I create another PR?
    
    ## How was this patch tested?
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    The patch was tested manually on a Spark cluster with a head node that has 
88 processors using JMX to verify that the number of selector threads is now 
limited to 8 per proxy.
    
    gurvindersingh zsxwing can you please review the change?
    
    Author: IngoSchuster <ingo.schus...@de.ibm.com>
    Author: Ingo Schuster <ingo.schus...@de.ibm.com>
    
    Closes #18437 from IngoSchuster/master.
    
    (cherry picked from commit 88a536babf119b7e331d02aac5d52b57658803bf)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit d16e2620d1fc8a6b6b5c71401e5d452683fa7762
Author: Shixiong Zhu <shixiong@...>
Date:   2017-06-30T03:56:37Z

    [SPARK-21253][CORE][HOTFIX] Fix Scala 2.10 build
    
    ## What changes were proposed in this pull request?
    
    A follow up PR to fix Scala 2.10 build for #18472
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #18478 from zsxwing/SPARK-21253-2.
    
    (cherry picked from commit cfc696f4a4289acf132cb26baf7c02c5b6305277)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 8b08fd06c0e22e7967c05aee83654b6be446efb4
Author: Herman van Hovell <hvanhovell@...>
Date:   2017-06-30T04:34:09Z

    [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling
    
    ## What changes were proposed in this pull request?
    `WindowExec` currently improperly stores complex objects (UnsafeRow, 
UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a 
reference in the buffer used by `GeneratedMutableProjections` to the actual 
input data. Things go wrong when the input object (or the backing bytes) are 
reused for other things. This could happen in window functions when it starts 
spilling to disk. When reading the back the spill files the 
`UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, 
leading to weird corruption scenario's. Note that this only happens for 
aggregate functions that preserve (parts of) their input, for example `FIRST`, 
`LAST`, `MIN` & `MAX`.
    
    This was not seen before, because the spilling logic was not doing actual 
spills as much and actually used an in-memory page. This page was not cleaned 
up during window processing and made sure unsafe objects point to their own 
dedicated memory location. This was changed by 
https://github.com/apache/spark/pull/16909, after this PR Spark spills more 
eagerly.
    
    This PR provides a surgical fix because we are close to releasing Spark 
2.2. This change just makes sure that there cannot be any object reuse at the 
expensive of a little bit of performance. We will follow-up with a more subtle 
solution at a later point.
    
    ## How was this patch tested?
    Added a regression test to `DataFrameWindowFunctionsSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #18470 from hvanhovell/SPARK-21258.
    
    (cherry picked from commit e2f32ee45ac907f1f53fde7e412676a849a94872)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 29a0be2b3d42bfe991f47725f077892918731e08
Author: Xiao Li <gatorsmile@...>
Date:   2017-06-30T21:23:56Z

    [SPARK-21129][SQL] Arguments of SQL function call should not be named 
expressions
    
    ### What changes were proposed in this pull request?
    
    Function argument should not be named expressions. It could cause two 
issues:
    - Misleading error message
    - Unexpected query results when the column name is `distinct`, which is not 
a reserved word in our parser.
    
    ```
    spark-sql> select count(distinct c1, distinct c2) from t1;
    Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; 
line 1 pos 26;
    'Project [unresolvedalias('count(c1#30, 'distinct), None)]
    +- SubqueryAlias t1
       +- CatalogRelation `default`.`t1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31]
    ```
    
    After the fix, the error message becomes
    ```
    spark-sql> select count(distinct c1, distinct c2) from t1;
    Error in query:
    extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', 
NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, 
'+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35)
    
    == SQL ==
    select count(distinct c1, distinct c2) from t1
    -----------------------------------^^^
    ```
    
    ### How was this patch tested?
    Added a test case to parser suite.
    
    Author: Xiao Li <gatorsm...@gmail.com>
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #18338 from gatorsmile/parserDistinctAggFunc.
    
    (cherry picked from commit eed9c4ef859fdb75a816a3e0ce2d593b34b23444)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit a2c7b2133cfee7fa9abfaa2bfbfb637155466783
Author: Patrick Wendell <pwendell@...>
Date:   2017-06-30T22:54:34Z

    Preparing Spark release v2.2.0-rc6

commit 85fddf406429dac00ddfb2e6c30870da450455bd
Author: Patrick Wendell <pwendell@...>
Date:   2017-06-30T22:54:39Z

    Preparing development version 2.2.1-SNAPSHOT

commit 6fd39ea1c9dbf68763cb394a28d8a13c116341df
Author: Devaraj K <devaraj@...>
Date:   2017-07-01T14:53:49Z

    [SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws 
IllegalArgumentException: Self-suppression not permitted
    
    ## What changes were proposed in this pull request?
    
    Not adding the exception to the suppressed if it is the same instance as 
originalThrowable.
    
    ## How was this patch tested?
    
    Added new tests to verify this, these tests fail without source code 
changes and passes with the change.
    
    Author: Devaraj K <deva...@apache.org>
    
    Closes #18384 from devaraj-kavali/SPARK-21170.
    
    (cherry picked from commit 6beca9ce94f484de2f9ffb946bef8334781b3122)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit db21b679343aba54b240e1d552cf7ec772109f22
Author: Dongjoon Hyun <dongjoon@...>
Date:   2017-07-04T16:48:40Z

    [SPARK-20256][SQL] SessionState should be created more lazily
    
    ## What changes were proposed in this pull request?
    
    `SessionState` is designed to be created lazily. However, in reality, it 
created immediately in `SparkSession.Builder.getOrCreate` 
([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
    
    This PR aims to recover the lazy behavior by keeping the options into 
`initialSessionOptions`. The benefit is like the following. Users can start 
`spark-shell` and use RDD operations without any problems.
    
    **BEFORE**
    ```scala
    $ bin/spark-shell
    java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder'
    ...
    Caused by: org.apache.spark.sql.AnalysisException:
        org.apache.hadoop.hive.ql.metadata.HiveException:
           MetaException(message:java.security.AccessControlException:
              Permission denied: user=spark, access=READ,
                 inode="/apps/hive/warehouse":hive:hdfs:drwx------
    ```
    As reported in SPARK-20256, this happens when the warehouse directory is 
not allowed for this user.
    
    **AFTER**
    ```scala
    $ bin/spark-shell
    ...
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_112)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sc.range(0, 10, 1).count()
    res0: Long = 10
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    This closes #18512 .
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #18501 from dongjoon-hyun/SPARK-20256.
    
    (cherry picked from commit 1b50e0e0d6fd9d1b815a3bb37647ea659222e3f1)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 770fd2a239798d3fa1cb4223d73cfc57413c0bb8
Author: Takuya UESHIN <ueshin@...>
Date:   2017-07-05T03:24:38Z

    [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to 
converting to internal value.
    
    ## What changes were proposed in this pull request?
    
    `ExternalMapToCatalyst` should null-check map key prior to converting to 
internal value to throw an appropriate Exception instead of something like NPE.
    
    ## How was this patch tested?
    
    Added a test and existing tests.
    
    Author: Takuya UESHIN <ues...@databricks.com>
    
    Closes #18524 from ueshin/issues/SPARK-21300.
    
    (cherry picked from commit ce10545d3401c555e56a214b7c2f334274803660)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 6e1081cbeac58826526b6ff7f2938a556b31ca9e
Author: Sumedh Wale <swale@...>
Date:   2017-07-06T06:47:22Z

    [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream
    
    ## What changes were proposed in this pull request?
    
    Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known 
failures include writes to some DataSources that have own SparkPlan 
implementations and cause EXCHANGE in writes.
    
    ## How was this patch tested?
    
    Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte 
array having non-zero offset.
    
    Author: Sumedh Wale <sw...@snappydata.io>
    
    Closes #18535 from sumwale/SPARK-21312.
    
    (cherry picked from commit 14a3bb3a008c302aac908d7deaf0942a98c63be7)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 4e53a4edd72e372583f243c660bbcc0572205716
Author: Tathagata Das <tathagata.das1565@...>
Date:   2017-07-06T07:20:26Z

    [SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp checkpoint 
dir should be deleted
    
    ## What changes were proposed in this pull request?
    
    Stopping query while it is being initialized can throw interrupt exception, 
in which case temporary checkpoint directories will not be deleted, and the 
test will fail.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #18442 from tdas/DatastreamReaderWriterSuite-fix.
    
    (cherry picked from commit 60043f22458668ac7ecba94fa78953f23a6bdcec)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 576fd4c3a67b4affc5ac50979e27ae929472f0d9
Author: Tathagata Das <tathagata.das1565@...>
Date:   2017-07-07T00:28:20Z

    [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation
    
    ## What changes were proposed in this pull request?
    
    Few changes to the Structured Streaming documentation
    - Clarify that the entire stream input table is not materialized
    - Add information for Ganglia
    - Add Kafka Sink to the main docs
    - Removed a couple of leftover experimental tags
    - Added more associated reading material and talk videos.
    
    In addition, https://github.com/apache/spark/pull/16856 broke the link to 
the RDD programming guide in several places while renaming the page. This PR 
fixes those sameeragarwal cloud-fan.
    - Added a redirection to avoid breaking internal and possible external 
links.
    - Removed unnecessary redirection pages that were there since the separate 
scala, java, and python programming guides were merged together in 2013 or 2014.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #18485 from tdas/SPARK-21267.
    
    (cherry picked from commit 0217dfd26f89133f146197359b556c9bf5aca172)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit ab12848d624f6b74d401e924255c0b4fcc535231
Author: Prashant Sharma <prashant@...>
Date:   2017-07-08T06:33:12Z

    [SPARK-21069][SS][DOCS] Add rate source to programming guide.
    
    ## What changes were proposed in this pull request?
    
    SPARK-20979 added a new structured streaming source: Rate source. This 
patch adds the corresponding documentation to programming guide.
    
    ## How was this patch tested?
    
    Tested by running jekyll locally.
    
    Author: Prashant Sharma <prash...@apache.org>
    Author: Prashant Sharma <prash...@in.ibm.com>
    
    Closes #18562 from ScrapCodes/spark-21069/rate-source-docs.
    
    (cherry picked from commit d0bfc6733521709e453d643582df2bdd68f28de7)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 7d0b1c927d92cc2a4932262514ffd12c47593b80
Author: Bogdan Raducanu <bogdan@...>
Date:   2017-07-08T12:14:59Z

    [SPARK-21228][SQL][BRANCH-2.2] InSet incorrect handling of structs
    
    ## What changes were proposed in this pull request?
    
    This is backport of https://github.com/apache/spark/pull/18455
    When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering 
(similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet 
as before (which should be faster). Similarly, In.eval uses Ordering.equiv 
instead of equals.
    
    ## How was this patch tested?
    New test in SQLQuerySuite.
    
    Author: Bogdan Raducanu <bog...@databricks.com>
    
    Closes #18563 from bogdanrdc/SPARK-21228-BRANCH2.2.

commit a64f10800244a8057f7f32c3d2f4a719c5080d05
Author: Dongjoon Hyun <dongjoon@...>
Date:   2017-07-08T12:16:47Z

    [SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should clean 
up stopped sessions.
    
    `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it 
leaves behind some stopped `SparkContext`s interfereing with other test suites 
using `ShardSQLContext`.
    
    Recently, master branch fails consequtively.
    - 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
    
    Pass the Jenkins with a updated suite.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #18567 from dongjoon-hyun/SPARK-SESSION.
    
    (cherry picked from commit 0b8dd2d08460f3e6eb578727d2c336b6f11959e7)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit c8d7855b905742033b7588ce7ee28bc23de13709
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-07-08T16:24:54Z

    [SPARK-20342][CORE] Update task accumulators before sending task end event.
    
    This makes sures that listeners get updated task information; otherwise it's
    possible to write incomplete task information into event logs, for example,
    making the information in a replayed UI inconsistent with the original
    application.
    
    Added a new unit test to try to detect the problem, but it's not guaranteed
    to fail since it's a race; but it fails pretty reliably for me without the
    scheduler changes.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #18393 from vanzin/SPARK-20342.try2.
    
    (cherry picked from commit 9131bdb7e12bcfb2cb699b3438f554604e28aaa8)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 964332b2879af048a95606dfcb4f2cb2e356135b
Author: jinxing <jinxing6042@...>
Date:   2017-07-08T16:27:58Z

    [SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffleToMem.
    
    ## What changes were proposed in this pull request?
    
    In current code, reducer can break the old shuffle service when 
`spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document.
    
    Author: jinxing <jinxing6...@126.com>
    
    Closes #18566 from jinxing64/SPARK-21343.
    
    (cherry picked from commit 062c336d06a0bd4e740a18d2349e03e311509243)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 3bfad9d4210f96dcd2270599257c3a5272cad77b
Author: Zhenhua Wang <wangzhenhua@...>
Date:   2017-07-09T10:51:06Z

    [SPARK-21083][SQL][BRANCH-2.2] Store zero size and row count when analyzing 
empty table
    
    ## What changes were proposed in this pull request?
    
    We should be able to store zero size and row count after analyzing empty 
table.
    This is a backport for 
https://github.com/apache/spark/commit/9fccc3627fa41d32fbae6dbbb9bd1521e43eb4f0.
    
    ## How was this patch tested?
    
    Added new test.
    
    Author: Zhenhua Wang <wangzhen...@huawei.com>
    
    Closes #18575 from wzhfy/analyzeEmptyTable-2.2.

commit 40fd0ce7f2c2facb96fc5d613bc7b6e4b573d9f7
Author: jinxing <jinxing6042@...>
Date:   2017-07-10T13:06:58Z

    [SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFetcher.
    
    When `RetryingBlockFetcher` retries fetching blocks. There could be two 
`DownloadCallback`s download the same content to the same target file. It could 
cause `ShuffleBlockFetcherIterator` reading a partial result.
    
    This pr proposes to create and delete the tmp files in 
`OneForOneBlockFetcher`
    
    Author: jinxing <jinxing6...@126.com>
    Author: Shixiong Zhu <zsxw...@gmail.com>
    
    Closes #18565 from jinxing64/SPARK-21342.
    
    (cherry picked from commit 6a06c4b03c4dd86241fb9d11b4360371488f0e53)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit a05edf454a67261c89f0f2ecd1fe46bb8cebc257
Author: Juliusz Sompolski <julek@...>
Date:   2017-07-10T16:26:42Z

    [SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows
    
    ## What changes were proposed in this pull request?
    
    Updating numOutputRows metric was missing from one return path of LeftAnti 
SortMergeJoin.
    
    ## How was this patch tested?
    
    Non-zero output rows manually seen in metrics.
    
    Author: Juliusz Sompolski <ju...@databricks.com>
    
    Closes #18494 from juliuszsompolski/SPARK-21272.

commit edcd9fbc92683753d55ed0c69f391bf3bed59da4
Author: Shixiong Zhu <shixiong@...>
Date:   2017-07-11T03:26:17Z

    [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*
    
    ## What changes were proposed in this pull request?
    
    Remove all usages of Scala Tuple2 from common/network-* projects. 
Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #18593 from zsxwing/SPARK-21369.
    
    (cherry picked from commit 833eab2c9bd273ee9577fbf9e480d3e3a4b7d203)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 399aa016e8f44fea4e5ef4b71a9a80484dd755f8
Author: Xingbo Jiang <xingbo.jiang@...>
Date:   2017-07-11T13:52:54Z

    [SPARK-21366][SQL][TEST] Add sql test for window functions
    
    ## What changes were proposed in this pull request?
    
    Add sql test for window functions, also remove uncecessary test cases in 
`WindowQuerySuite`.
    
    ## How was this patch tested?
    
    Added `window.sql` and the corresponding output file.
    
    Author: Xingbo Jiang <xingbo.ji...@databricks.com>
    
    Closes #18591 from jiangxb1987/window.
    
    (cherry picked from commit 66d21686556681457aab6e44e19f5614c5635f0c)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit cb6fc89ba20a427fa7d66fa5036b17c1a5d5d87f
Author: Eric Vandenberg <ericvandenberg@...>
Date:   2017-07-12T06:49:15Z

    [SPARK-21219][CORE] Task retry occurs on same executor due to race coâ¦
    
    â¦ndition with blacklisting
    
    There's a race condition in the current TaskSetManager where a failed task 
is added for retry (addPendingTask), and can asynchronously be assigned to an 
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the 
result is the task might re-execute on the same executor.  This is particularly 
problematic if the executor is shutting down since the retry task immediately 
becomes a lost task (ExecutorLostFailure).  Another side effect is that the 
actual failure reason gets obscured by the retry task which never actually 
executed.  There are sample logs showing the issue in the 
https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and 
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    Implemented a unit test that verifies the task is black listed before it is 
added to the pending task.  Ran the unit test without the fix and it fails.  
Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Eric Vandenberg <ericvandenbergfb.com>
    
    Closes #18427 from ericvandenbergfb/blacklistFix.
    
    ## What changes were proposed in this pull request?
    
    This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.
    
    ## How was this patch tested?
    
    Ran TaskSetManagerSuite tests locally.
    
    Author: Eric Vandenberg <ericvandenb...@fb.com>
    
    Closes #18604 from jsoltren/branch-2.2.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20027: Branch 2.2

Reply via email to