[GitHub] spark pull request #21870: Branch 2.3

lovezeropython Wed, 25 Jul 2018 00:46:10 -0700

GitHub user lovezeropython opened a pull request:

    https://github.com/apache/spark/pull/21870


    Branch 2.3

    EOFError
    # ConnectionResetError: [Errno 54] Connection reset by peer
    (Please fill in changes proposed in this fix)
    
    ```
    /pyspark.zip/pyspark/worker.py", line 255, in main
        if read_int(infile) == SpecialLengths.END_OF_STREAM:
      File 
"/Users/songhao/apps/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
 line 683, in read_int
        length = stream.read(4)
    ConnectionResetError: [Errno 54] Connection reset by peerenter code here
    ```
    
    
![markdown20180725153615](https://user-images.githubusercontent.com/35518020/43186593-ba74b8ea-9021-11e8-8c02-c013d987f204.png)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21870.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21870
    
----
commit 3737c3d32bb92e73cadaf3b1b9759d9be00b288d
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-13T06:05:13Z

    [SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark
    
    ## What changes were proposed in this pull request?
    Deprecating the field `name` in PySpark is not expected. This PR is to 
revert the change.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #20595 from gatorsmile/removeDeprecate.
    
    (cherry picked from commit 407f67249639709c40c46917700ed6dd736daa7d)
    Signed-off-by: hyukjinkwon <gurwls...@gmail.com>

commit 1c81c0c626f115fbfe121ad6f6367b695e9f3b5f
Author: guoxiaolong <guo.xiaolong1@...>
Date:   2018-02-13T12:23:10Z

    [SPARK-23384][WEB-UI] When it has no incomplete(completed) applications 
found, the last updated time is not formatted and client local time zone is not 
show in history server web ui.
    
    ## What changes were proposed in this pull request?
    
    When it has no incomplete(completed) applications found, the last updated 
time is not formatted and client local time zone is not show in history server 
web ui. It is a bug.
    
    fix before:
    
![1](https://user-images.githubusercontent.com/26266482/36070635-264d7cf0-0f3a-11e8-8426-14135ffedb16.png)
    
    fix after:
    
![2](https://user-images.githubusercontent.com/26266482/36070651-8ec3800e-0f3a-11e8-991c-6122cc9539fe.png)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: guoxiaolong <guo.xiaolo...@zte.com.cn>
    
    Closes #20573 from guoxiaolongzte/SPARK-23384.
    
    (cherry picked from commit 300c40f50ab4258d697f06a814d1491dc875c847)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit dbb1b399b6cf8372a3659c472f380142146b1248
Author: huangtengfei <huangtengfei@...>
Date:   2018-02-13T15:59:21Z

    [SPARK-23053][CORE] taskBinarySerialization and task partitions calculate 
in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
    
    ## What changes were proposed in this pull request?
    
    When we run concurrent jobs using the same rdd which is marked to do 
checkpoint. If one job has finished running the job, and start the process of 
RDD.doCheckpoint, while another job is submitted, then submitStage and 
submitMissingTasks will be called. In 
[submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961),
 will serialize taskBinaryBytes and calculate task partitions which are both 
affected by the status of checkpoint, if the former is calculated before 
doCheckpoint finished, while the latter is calculated after doCheckpoint 
finished, when run task, rdd.compute will be called, for some rdds with 
particular partition type such as 
[UnionRDD](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala)
 who will do partition type cast, will get a ClassCastException because the 
part params is actually a CheckpointRDDPartition.
    This error occurs  because rdd.doCheckpoint occurs in the same thread that 
called sc.runJob, while the task serialization occurs in the DAGSchedulers 
event loop.
    
    ## How was this patch tested?
    
    the exist uts and also add a test case in DAGScheduerSuite to show the 
exception case.
    
    Author: huangtengfei <huangtengfei@huangtengfeideMacBook-Pro.local>
    
    Closes #20244 from ivoson/branch-taskpart-mistype.
    
    (cherry picked from commit 091a000d27f324de8c5c527880854ecfcf5de9a4)
    Signed-off-by: Imran Rashid <iras...@cloudera.com>

commit ab01ba718c7752b564e801a1ea546aedc2055dc0
Author: Bogdan Raducanu <bogdan@...>
Date:   2018-02-13T17:49:52Z

    [SPARK-23316][SQL] AnalysisException after max iteration reached for IN 
query
    
    ## What changes were proposed in this pull request?
    Added flag ignoreNullability to DataType.equalsStructurally.
    The previous semantic is for ignoreNullability=false.
    When ignoreNullability=true equalsStructurally ignores nullability of 
contained types (map key types, value types, array element types, structure 
field types).
    In.checkInputTypes calls equalsStructurally to check if the children types 
match. They should match regardless of nullability (which is just a hint), so 
it is now called with ignoreNullability=true.
    
    ## How was this patch tested?
    New test in SubquerySuite
    
    Author: Bogdan Raducanu <bog...@databricks.com>
    
    Closes #20548 from bogdanrdc/SPARK-23316.
    
    (cherry picked from commit 05d051293fe46938e9cb012342fea6e8a3715cd4)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 320ffb1309571faedb271f2c769b4ab1ee1cd267
Author: Joseph K. Bradley <joseph@...>
Date:   2018-02-13T19:18:45Z

    [SPARK-23154][ML][DOC] Document backwards compatibility guarantees for ML 
persistence
    
    ## What changes were proposed in this pull request?
    
    Added documentation about what MLlib guarantees in terms of loading ML 
models and Pipelines from old Spark versions.  Discussed & confirmed on linked 
JIRA.
    
    Author: Joseph K. Bradley <jos...@databricks.com>
    
    Closes #20592 from jkbradley/SPARK-23154-backwards-compat-doc.
    
    (cherry picked from commit d58fe28836639e68e262812d911f167cb071007b)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 4f6a457d464096d791e13e57c55bcf23c01c418f
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-13T19:56:49Z

    [SPARK-23400][SQL] Add a constructors for ScalaUDF
    
    ## What changes were proposed in this pull request?
    
    In this upcoming 2.3 release, we changed the interface of `ScalaUDF`. 
Unfortunately, some Spark packages (e.g., spark-deep-learning) are using our 
internal class `ScalaUDF`. In the release 2.3, we added new parameters into 
this class. The users hit the binary compatibility issues and got the exception:
    
    ```
    > java.lang.NoSuchMethodError: 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.&lt;init&gt;(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V
    ```
    
    This PR is to improve the backward compatibility. However, we definitely 
should not encourage the external packages to use our internal classes. This 
might make us hard to maintain/develop the codes in Spark.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #20591 from gatorsmile/scalaUDF.
    
    (cherry picked from commit 2ee76c22b6e48e643694c9475e5f0d37124215e7)
    Signed-off-by: Shixiong Zhu <zsxw...@gmail.com>

commit bb26bdb55fdf84c4e36fd66af9a15e325a3982d6
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-02-14T02:55:24Z

    [SPARK-23399][SQL] Register a task completion listener first for 
OrcColumnarBatchReader
    
    This PR aims to resolve an open file leakage issue reported at 
[SPARK-23390](https://issues.apache.org/jira/browse/SPARK-23390) by moving the 
listener registration position. Currently, the sequence is like the following.
    
    1. Create `batchReader`
    2. `batchReader.initialize` opens a ORC file.
    3. `batchReader.initBatch` may take a long time to alloc memory in some 
environment and cause errors.
    4. `Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => 
iter.close()))`
    
    This PR moves 4 before 2 and 3. To sum up, the new sequence is 1 -> 4 -> 2 
-> 3.
    
    Manual. The following test case makes OOM intentionally to cause leaked 
filesystem connection in the current code base. With this patch, leakage 
doesn't occurs.
    
    ```scala
      // This should be tested manually because it raises OOM intentionally
      // in order to cause `Leaked filesystem connection`.
      test("SPARK-23399 Register a task completion listener first for 
OrcColumnarBatchReader") {
        withSQLConf(SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE.key -> 
s"${Int.MaxValue}") {
          withTempDir { dir =>
            val basePath = dir.getCanonicalPath
            Seq(0).toDF("a").write.format("orc").save(new Path(basePath, 
"first").toString)
            Seq(1).toDF("a").write.format("orc").save(new Path(basePath, 
"second").toString)
            val df = spark.read.orc(
              new Path(basePath, "first").toString,
              new Path(basePath, "second").toString)
            val e = intercept[SparkException] {
              df.collect()
            }
            assert(e.getCause.isInstanceOf[OutOfMemoryError])
          }
        }
      }
    ```
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #20590 from dongjoon-hyun/SPARK-23399.
    
    (cherry picked from commit 357babde5a8eb9710de7016d7ae82dee21fa4ef3)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit fd66a3b7b151514a9f626444ef8710f64dab6813
Author: âattilapirosâ <piros.attila.zsolt@...>
Date:   2018-02-14T14:45:54Z

    [SPARK-23394][UI] In RDD storage page show the executor addresses instead 
of the IDs
    
    ## What changes were proposed in this pull request?
    
    Extending RDD storage page to show executor addresses in the block table.
    
    ## How was this patch tested?
    
    Manually:
    
    ![screen shot 2018-02-13 at 10 30 
59](https://user-images.githubusercontent.com/2017933/36142668-0b3578f8-10a9-11e8-95ea-2f57703ee4af.png)
    
    Author: âattilapirosâ <piros.attila.zs...@gmail.com>
    
    Closes #20589 from attilapiros/SPARK-23394.
    
    (cherry picked from commit 140f87533a468b1046504fc3ff01fbe1637e41cd)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit a5a8a86e213c34d6fb32f0ae52db24d8f1ef0905
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-14T18:59:36Z

    Revert "[SPARK-23249][SQL] Improved block merging logic for partitions"
    
    This reverts commit f5f21e8c4261c0dfe8e3e788a30b38b188a18f67.

commit bd83f7ba097d9bca9a0e8c072f7566a645887a96
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-15T07:52:59Z

    [SPARK-23421][SPARK-22356][SQL] Document the behavior change in
    
    ## What changes were proposed in this pull request?
    https://github.com/apache/spark/pull/19579 introduces a behavior change. We 
need to document it in the migration guide.
    
    ## How was this patch tested?
    Also update the HiveExternalCatalogVersionsSuite to verify it.
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #20606 from gatorsmile/addMigrationGuide.
    
    (cherry picked from commit a77ebb0921e390cf4fc6279a8c0a92868ad7e69b)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 129fd45efb418c6afa95aa26e5b96f03a39dcdd0
Author: gatorsmile <gatorsmile@...>
Date:   2018-02-15T07:56:02Z

    [SPARK-23094] Revert [] Fix invalid character handling in JsonDataSource
    
    ## What changes were proposed in this pull request?
    This PR is to revert the PR https://github.com/apache/spark/pull/20302, 
because it causes a regression.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #20614 from gatorsmile/revertJsonFix.
    
    (cherry picked from commit 95e4b4916065e66a4f8dba57e98e725796f75e04)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit f2c0585652a8262801e92a02b56c56f16b8926e5
Author: Wenchen Fan <wenchen@...>
Date:   2018-02-15T08:59:44Z

    [SPARK-23419][SPARK-23416][SS] data source v2 write path should re-throw 
interruption exceptions directly
    
    ## What changes were proposed in this pull request?
    
    Streaming execution has a list of exceptions that means interruption, and 
handle them specially. `WriteToDataSourceV2Exec` should also respect this list 
and not wrap them with `SparkException`.
    
    ## How was this patch tested?
    
    existing test.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #20605 from cloud-fan/write.
    
    (cherry picked from commit f38c760638063f1fb45e9ee2c772090fb203a4a0)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit d24d13179f0e9d125eaaebfcc225e1ec30c5cb83
Author: Gabor Somogyi <gabor.g.somogyi@...>
Date:   2018-02-15T11:52:40Z

    [SPARK-23422][CORE] YarnShuffleIntegrationSuite fix when SPARK_PREPENâ¦
    
    â¦D_CLASSES set to 1
    
    ## What changes were proposed in this pull request?
    
    YarnShuffleIntegrationSuite fails when SPARK_PREPEND_CLASSES set to 1.
    
    Normally mllib built before yarn module. When SPARK_PREPEND_CLASSES used 
mllib classes are on yarn test classpath.
    
    Before 2.3 that did not cause issues. But 2.3 has SPARK-22450, which 
registered some mllib classes with the kryo serializer. Now it dies with the 
following error:
    
    `
    18/02/13 07:33:29 INFO SparkContext: Starting job: collect at 
YarnShuffleIntegrationSuite.scala:143
    Exception in thread "dag-scheduler-event-loop" 
java.lang.NoClassDefFoundError: breeze/linalg/DenseMatrix
    `
    
    In this PR NoClassDefFoundError caught only in case of testing and then do 
nothing.
    
    ## How was this patch tested?
    
    Automated: Pass the Jenkins.
    
    Author: Gabor Somogyi <gabor.g.somo...@gmail.com>
    
    Closes #20608 from gaborgsomogyi/SPARK-23422.
    
    (cherry picked from commit 44e20c42254bc6591b594f54cd94ced5fcfadae3)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit bae4449ad836a64db853e297d33bfd1a725faa0b
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-02-15T16:55:39Z

    [SPARK-23426][SQL] Use `hive` ORC impl and disable PPD for Spark 2.3.0
    
    ## What changes were proposed in this pull request?
    
    To prevent any regressions, this PR changes ORC implementation to `hive` by 
default like Spark 2.2.X.
    Users can enable `native` ORC. Also, ORC PPD is also restored to `false` 
like Spark 2.2.X.
    
    
![orc_section](https://user-images.githubusercontent.com/9700541/36221575-57a1d702-1173-11e8-89fe-dca5842f4ca7.png)
    
    ## How was this patch tested?
    
    Pass all test cases.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #20610 from dongjoon-hyun/SPARK-ORC-DISABLE.
    
    (cherry picked from commit 2f0498d1e85a53b60da6a47d20bbdf56b42b7dcb)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 03960faa621710388c0de91d16a71dda749d173f
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-02-15T17:40:08Z

    [MINOR][SQL] Fix an error message about inserting into bucketed tables
    
    ## What changes were proposed in this pull request?
    
    This replaces `Sparkcurrently` to `Spark currently` in the following error 
message.
    
    ```scala
    scala> sql("insert into t2 select * from v1")
    org.apache.spark.sql.AnalysisException: Output Hive table `default`.`t2`
    is bucketed but Sparkcurrently does NOT populate bucketed ...
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #20617 from dongjoon-hyun/SPARK-ERROR-MSG.
    
    (cherry picked from commit 6968c3cfd70961c4e86daffd6a156d0a9c1d7a2a)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 0bd7765cd9832bee348af87663f3d424b61e92fc
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-02-15T17:54:39Z

    [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug
    
    ## What changes were proposed in this pull request?
    
    #### Problem:
    
    Since 2.3, `Bucketizer` supports multiple input/output columns. We will 
check if exclusive params are set during transformation. E.g., if `inputCols` 
and `outputCol` are both set, an error will be thrown.
    
    However, when we write `Bucketizer`, looks like the default params and 
user-supplied params are merged during writing. All saved params are loaded 
back and set to created model instance. So the default `outputCol` param in 
`HasOutputCol` trait will be set in `paramMap` and become an user-supplied 
param. That makes the check of exclusive params failed.
    
    #### Fix:
    
    This changes the saving logic of Bucketizer to handle this case. This is a 
quick fix to catch the time of 2.3. We should consider modify the persistence 
mechanism later.
    
    Please see the discussion in the JIRA.
    
    Note: The multi-column `QuantileDiscretizer` also has the same issue.
    
    ## How was this patch tested?
    
    Modified tests.
    
    Author: Liang-Chi Hsieh <vii...@gmail.com>
    
    Closes #20594 from viirya/SPARK-23377-2.
    
    (cherry picked from commit db45daab90ede4c03c1abc9096f4eac584e9db17)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 75bb19a018f9260eab3ea0ba3ea46e84b87eabf2
Author: âattilapirosâ <piros.attila.zsolt@...>
Date:   2018-02-15T20:03:41Z

    [SPARK-23413][UI] Fix sorting tasks by Host / Executor ID at the Stagâ¦
    
    â¦e page
    
    ## What changes were proposed in this pull request?
    
    Fixing exception got at sorting tasks by Host / Executor ID:
    ```
            java.lang.IllegalArgumentException: Invalid sort column: Host
        at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017)
        at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694)
        at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61)
        at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96)
        at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708)
        at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293)
        at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282)
        at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
        at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
        at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
        at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
    ```
    
    Moreover some refactoring to avoid similar problems by introducing 
constants for each header name and reusing them at the identification of the 
corresponding sorting index.
    
    ## How was this patch tested?
    
    Manually:
    
    ![screen shot 2018-02-13 at 18 57 
10](https://user-images.githubusercontent.com/2017933/36166532-1cfdf3b8-10f3-11e8-8d32-5fcaad2af214.png)
    
    (cherry picked from commit 1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea)
    
    Author: âattilapirosâ <piros.attila.zs...@gmail.com>
    
    Closes #20623 from squito/fix_backport.

commit ccb0a59d7383db451b86aee67423eb6e28f1f982
Author: hyukjinkwon <gurwls223@...>
Date:   2018-02-16T17:41:17Z

    [SPARK-23446][PYTHON] Explicitly check supported types in toPandas
    
    ## What changes were proposed in this pull request?
    
    This PR explicitly specifies and checks the types we supported in 
`toPandas`. This was a hole. For example, we haven't finished the binary type 
support in Python side yet but now it allows as below:
    
    ```python
    spark.conf.set("spark.sql.execution.arrow.enabled", "false")
    df = spark.createDataFrame([[bytearray("a")]])
    df.toPandas()
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    df.toPandas()
    ```
    
    ```
         _1
    0  [97]
      _1
    0  a
    ```
    
    This should be disallowed. I think the same things also apply to nested 
timestamps too.
    
    I also added some nicer message about `spark.sql.execution.arrow.enabled` 
in the error message.
    
    ## How was this patch tested?
    
    Manually tested and tests added in `python/pyspark/sql/tests.py`.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #20625 from HyukjinKwon/pandas_convertion_supported_type.
    
    (cherry picked from commit c5857e496ff0d170ed0339f14afc7d36b192da6d)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 8360da07110d847a01b243e6d786922a5057ad9f
Author: Shintaro Murakami <mrkm4ntr@...>
Date:   2018-02-17T01:17:55Z

    [SPARK-23381][CORE] Murmur3 hash generates a different value from other 
implementations
    
    ## What changes were proposed in this pull request?
    Murmur3 hash generates a different value from the original and other 
implementations (like Scala standard library and Guava or so) when the length 
of a bytes array is not multiple of 4.
    
    ## How was this patch tested?
    Added a unit test.
    
    **Note: When we merge this PR, please give all the credits to Shintaro 
Murakami.**
    
    Author: Shintaro Murakami <mrkm4ntrgmail.com>
    
    Author: gatorsmile <gatorsm...@gmail.com>
    Author: Shintaro Murakami <mrkm4...@gmail.com>
    
    Closes #20630 from gatorsmile/pr-20568.
    
    (cherry picked from commit d5ed2108d32e1d95b26ee7fed39e8a733e935e2c)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 44095cb65500739695b0324c177c19dfa1471472
Author: Sameer Agarwal <sameerag@...>
Date:   2018-02-17T01:29:46Z

    Preparing Spark release v2.3.0-rc4

commit c7a0dea46a251a27b304ac2ec9f07f97aca4b1d0
Author: Sameer Agarwal <sameerag@...>
Date:   2018-02-17T01:29:51Z

    Preparing development version 2.3.1-SNAPSHOT

commit a1ee6f1fc543120763f1b373bb31bc6d84004318
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-02-21T01:54:06Z

    [SPARK-23470][UI] Use first attempt of last stage to define job description.
    
    This is much faster than finding out what the last attempt is, and the
    data should be the same.
    
    There's room for improvement in this page (like only loading data for
    the jobs being shown, instead of loading all available jobs and sorting
    them), but this should bring performance on par with the 2.2 version.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #20644 from vanzin/SPARK-23470.
    
    (cherry picked from commit 2ba77ed9e51922303e3c3533e368b95788bd7de5)
    Signed-off-by: Sameer Agarwal <samee...@apache.org>

commit 1d78f03ae3037f1ddbe6533a6733b7805a6705bf
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-02-21T02:06:21Z

    [SPARK-23468][CORE] Stringify auth secret before storing it in credentials.
    
    The secret is used as a string in many parts of the code, so it has
    to be turned into a hex string to avoid issues such as the random
    byte sequence not containing a valid UTF8 sequence.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #20643 from vanzin/SPARK-23468.
    
    (cherry picked from commit 6d398c05cbad69aa9093429e04ae44c73b81cd5a)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit 3e7269eb904b591883300d7433e5c99be0b3b5b3
Author: Tathagata Das <tathagata.das1565@...>
Date:   2018-02-21T02:16:10Z

    [SPARK-23454][SS][DOCS] Added trigger information to the Structured 
Streaming programming guide
    
    ## What changes were proposed in this pull request?
    
    - Added clear information about triggers
    - Made the semantics guarantees of watermarks more clear for streaming 
aggregations and stream-stream joins.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #20631 from tdas/SPARK-23454.
    
    (cherry picked from commit 601d653bff9160db8477f86d961e609fc2190237)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 373ac642fc145527bdfccae046d0e98f105ee7b3
Author: Tathagata Das <tathagata.das1565@...>
Date:   2018-02-21T22:56:13Z

    [SPARK-23484][SS] Fix possible race condition in KafkaContinuousReader
    
    ## What changes were proposed in this pull request?
    
    var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is 
accessed from multiple threads - the query thread at the time of reader factory 
creation, and the epoch tracking thread at the time of `needsReconfiguration`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #20655 from tdas/SPARK-23484.
    
    (cherry picked from commit 3fd0ccb13fea44727d970479af1682ef00592147)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 23ba4416e1bbbaa818876d7a837f7a5e260aa048
Author: Shixiong Zhu <zsxwing@...>
Date:   2018-02-21T23:37:28Z

    [SPARK-23481][WEBUI] lastStageAttempt should fail when a stage doesn't exist
    
    ## What changes were proposed in this pull request?
    
    The issue here is `AppStatusStore.lastStageAttempt` will return the next 
available stage in the store when a stage doesn't exist.
    
    This PR adds `last(stageId)` to ensure it returns a correct `StageData`
    
    ## How was this patch tested?
    
    The new unit test.
    
    Author: Shixiong Zhu <zsxw...@gmail.com>
    
    Closes #20654 from zsxwing/SPARK-23481.
    
    (cherry picked from commit 744d5af652ee8cece361cbca31e5201134e0fb42)
    Signed-off-by: Shixiong Zhu <zsxw...@gmail.com>

commit a0d7949896e70f427e7f3942ff340c9484ff0aab
Author: Shixiong Zhu <zsxwing@...>
Date:   2018-02-22T03:43:11Z

    [SPARK-23475][WEBUI] Skipped stages should be evicted before completed 
stages
    
    ## What changes were proposed in this pull request?
    
    The root cause of missing completed stages is because `cleanupStages` will 
never remove skipped stages.
    
    This PR changes the logic to always remove skipped stage first. This is 
safe since  the job itself contains enough information to render skipped stages 
in the UI.
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Author: Shixiong Zhu <zsxw...@gmail.com>
    
    Closes #20656 from zsxwing/SPARK-23475.
    
    (cherry picked from commit 45cf714ee6d4eead2fe00794a0d754fa6d33d4a6)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 992447fb30ee9ebb3cf794f2d06f4d63a2d792db
Author: Sameer Agarwal <sameerag@...>
Date:   2018-02-22T17:56:57Z

    Preparing Spark release v2.3.0-rc5

commit 285b841ffbfb21c0af3f83800f7815fb0bfe3627
Author: Sameer Agarwal <sameerag@...>
Date:   2018-02-22T17:57:03Z

    Preparing development version 2.3.1-SNAPSHOT

commit 578607b30a65d101380acbe9c9740dc267a5d55c
Author: Marco Gaido <marcogaido91@...>
Date:   2018-02-24T02:27:33Z

    [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
    
    ## What changes were proposed in this pull request?
    
    SPARK-20648 introduced the status `SKIPPED` for the stages. On the UI, 
previously, skipped stages were shown as `PENDING`; after this change, they are 
not shown on the UI.
    
    The PR introduce a new section in order to show also `SKIPPED` stages in a 
proper table.
    
    Manual backport from to branch-2.3.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <marcogaid...@gmail.com>
    
    Closes #20662 from mgaido91/SPARK-23475_2.3.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21870: Branch 2.3

Reply via email to