[GitHub] spark pull request #16348: Branch 2.0.4399

laixiaohang Mon, 19 Dec 2016 20:02:12 -0800

GitHub user laixiaohang opened a pull request:

    https://github.com/apache/spark/pull/16348


    Branch 2.0.4399

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/laixiaohang/spark branch-2.0.4399

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16348.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16348
    
----
commit c9c36fa0c7bccefde808bdbc32b04e8555356001
Author: Davies Liu <dav...@databricks.com>
Date:   2016-09-02T22:10:12Z

    [SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in 
DataFrameWriter
    
    Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    
    For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.
    
    Ideally, we should make all the analyzer rules all idempotent, that may 
require lots of effort to double checking them one by one (may be not easy).
    
    An easier approach could be never feed a optimized plan into Analyzer, this 
PR fix the case for RunnableComand, they will be optimized, during execution, 
the passed `query` will also be passed into QueryExecution again. This PR make 
these `query` not part of the children, so they will not be optimized and 
analyzed again.
    
    Right now, we did not know a logical plan is optimized or not, we could 
introduce a flag for that, and make sure a optimized logical plan will not be 
analyzed again.
    
    Added regression tests.
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #14797 from davies/fix_writer.
    
    (cherry picked from commit ed9c884dcf925500ceb388b06b33bd2c95cd2ada)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b
Author: Sameer Agarwal <samee...@cs.berkeley.edu>
Date:   2016-09-02T22:16:16Z

    [SPARK-16334] Reusing same dictionary column for decoding consecutive row 
groups shouldn't throw an error
    
    This patch fixes a bug in the vectorized parquet reader that's caused by 
re-using the same dictionary column vector while reading consecutive row 
groups. Specifically, this issue manifests for a certain distribution of 
dictionary/plain encoded data while we read/populate the underlying bit packed 
dictionary data into a column-vector based data structure.
    
    Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!
    
    Author: Sameer Agarwal <samee...@cs.berkeley.edu>
    
    Closes #14941 from sameeragarwal/parquet-exception-2.
    
    (cherry picked from commit a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit b8f65dad7be22231e982aaec3bbd69dbeacc20da
Author: Davies Liu <davies....@gmail.com>
Date:   2016-09-02T22:40:02Z

    Fix build

commit c0ea7707127c92ecb51794b96ea40d7cdb28b168
Author: Davies Liu <davies....@gmail.com>
Date:   2016-09-02T23:05:37Z

    Revert "[SPARK-16334] Reusing same dictionary column for decoding 
consecutive row groups shouldn't throw an error"
    
    This reverts commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b.

commit 12a2e2a5ab5db12f39a7b591e914d52058e1581b
Author: Junyang Qian <junya...@databricks.com>
Date:   2016-09-03T04:11:57Z

    [SPARKR][MINOR] Fix docs for sparkR.session and count
    
    ## What changes were proposed in this pull request?
    
    This PR tries to add some more explanation to `sparkR.session`. It also 
modifies doc for `count` so when grouped in one doc, the description doesn't 
confuse users.
    
    ## How was this patch tested?
    
    Manual test.
    
    ![screen shot 2016-09-02 at 1 21 36 
pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)
    
    Author: Junyang Qian <junya...@databricks.com>
    
    Closes #14942 from junyangq/fixSparkRSessionDoc.
    
    (cherry picked from commit d2fde6b72c4aede2e7edb4a7e6653fb1e7b19924)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 949544d017ab25b43b683cd5c1e6783d87bfce45
Author: CodingCat <zhunans...@gmail.com>
Date:   2016-09-03T09:03:40Z

    [SPARK-17347][SQL][EXAMPLES] Encoder in Dataset example has incorrect type
    
    ## What changes were proposed in this pull request?
    
    We propose to fix the Encoder type in the Dataset example
    
    ## How was this patch tested?
    
    The PR will be tested with the current unit test cases
    
    Author: CodingCat <zhunans...@gmail.com>
    
    Closes #14901 from CodingCat/SPARK-17347.
    
    (cherry picked from commit 97da41039b2b8fa7f93caf213ae45b9973925995)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 196d62eae05be0d87a20776fa07208b7ea2ddc90
Author: Sandeep Singh <sand...@techaddict.me>
Date:   2016-09-03T14:35:19Z

    [MINOR][SQL] Not dropping all necessary tables
    
    ## What changes were proposed in this pull request?
    was not dropping table `parquet_t3`
    
    ## How was this patch tested?
    tested `LogicalPlanToSQLSuite` locally
    
    Author: Sandeep Singh <sand...@techaddict.me>
    
    Closes #13767 from techaddict/minor-8.
    
    (cherry picked from commit a8a35b39b92fc9000eaac102c67c66be30b05e54)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit a7f5e7066f935d58d702a3e86b85aa175291d0fc
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-08-10T08:25:01Z

    [SPARK-16959][SQL] Rebuild Table Comment when Retrieving Metadata from Hive 
Metastore
    
    ### What changes were proposed in this pull request?
    The `comment` in `CatalogTable` returned from Hive is always empty. We 
store it in the table property when creating a table. However, when we try to 
retrieve the table metadata from Hive metastore, we do not rebuild it. The 
`comment` is always empty.
    
    This PR is to fix the issue.
    
    ### How was this patch tested?
    Fixed the test case to verify the change.
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14550 from gatorsmile/tableComment.
    
    (cherry picked from commit bdd537164dcfeec5e9c51d54791ef16997ff2597)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 3500dbc9bcce243b6656f308ee4941de0350d198
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-07-26T10:46:12Z

    [SPARK-16663][SQL] desc table should be consistent between data source and 
hive serde tables
    
    Currently there are 2 inconsistence:
    
    1. for data source table, we only print partition names, for hive table, we 
also print partition schema. After this PR, we will always print schema
    2. if column doesn't have comment, data source table will print empty 
string, hive table will print null. After this PR, we will always print null
    
    new test in `HiveDDLSuite`
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #14302 from cloud-fan/minor3.
    
    (cherry picked from commit a2abb583caaec9a2cecd5d65b05d172fc096c125)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 704215d3055bad7957d1d6da1a1a526c0d27d37d
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-09-03T17:02:20Z

    [SPARK-17335][SQL] Fix ArrayType and MapType CatalogString.
    
    ## What changes were proposed in this pull request?
    the `catalogString` for `ArrayType` and `MapType` currently calls the 
`simpleString` method on its children. This is a problem when the child is a 
struct, the `struct.simpleString` implementation truncates the number of fields 
it shows (25 at max). This breaks the generation of a proper `catalogString`, 
and has shown to cause errors while writing to Hive.
    
    This PR fixes this by providing proper `catalogString` implementations for 
`ArrayData` or `MapData`.
    
    ## How was this patch tested?
    Added testing for `catalogString` to `DataTypeSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #14938 from hvanhovell/SPARK-17335.
    
    (cherry picked from commit c2a1576c230697f56f282b6388c79835377e0f2f)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit e387c8ba86f89115eb2eabac070c215f451c5f0f
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-09-05T03:17:37Z

    [SPARK-17391][TEST][2.0] Fix Two Test Failures After Backport
    
    ### What changes were proposed in this pull request?
    In the latest branch 2.0, we have two test case failure due to backport.
    
    - test("ALTER VIEW AS should keep the previous table properties, comment, 
create_time, etc.")
    - test("SPARK-6212: The EXPLAIN output of CTAS only shows the analyzed 
plan")
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14951 from gatorsmile/fixTestFailure.

commit f92d87455214005e60b2d58aa814aaabd2ac9495
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-09-06T02:45:54Z

    [SPARK-17353][SPARK-16943][SPARK-16942][BACKPORT-2.0][SQL] Fix multiple 
bugs in CREATE TABLE LIKE command
    
    ### What changes were proposed in this pull request?
    This PR is to backport https://github.com/apache/spark/pull/14531.
    
    The existing `CREATE TABLE LIKE` command has multiple issues:
    
    - The generated table is non-empty when the source table is a data source 
table. The major reason is the data source table is using the table property 
`path` to store the location of table contents. Currently, we keep it 
unchanged. Thus, we still create the same table with the same location.
    
    - The table type of the generated table is `EXTERNAL` when the source table 
is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, 
but Hive is checking the table property `EXTERNAL` to decide whether the table 
is `EXTERNAL` or not. (See 
https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408)
 Thus, the created table is still `EXTERNAL`.
    
    - When the source table is a `VIEW`, the metadata of the generated table 
contains the original view text and view original text. So far, this does not 
break anything, but it could cause something wrong in Hive. (For example, 
https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
    
    - The issue regarding the table `comment`. To follow what Hive does, the 
table comment should be cleaned, but the column comments should be still kept.
    
    - The `INDEX` table is not supported. Thus, we should throw an exception in 
this case.
    
    - `owner` should not be retained. `ToHiveTable` set it 
[here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793)
 no matter which value we set in `CatalogTable`. We set it to an empty string 
for avoiding the confusing output in Explain.
    
    - Add a support for temp tables
    
    - Like Hive, we should not copy the table properties from the source table 
to the created table, especially for the statistics-related properties, which 
could be wrong in the created table.
    
    - `unsupportedFeatures` should not be copied from the source table. The 
created table does not have these unsupported features.
    
    - When the type of source table is a view, the target table is using the 
default format of data source tables: `spark.sql.sources.default`.
    
    This PR is to fix the above issues.
    
    ### How was this patch tested?
    Improve the test coverage by adding more test cases
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14946 from gatorsmile/createTableLike20.

commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-06T02:50:07Z

    [SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to 
missing otherCopyArgs
    
    ## What changes were proposed in this pull request?
    
    `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs 
to include currying construction arguments, otherwise it reports 
AssertException telling that the construction argument values' count doesn't 
match the construction argument names' count.
    
    For class `MetastoreRelation`, it has a currying construction parameter 
`client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Sean Zhong <seanzh...@databricks.com>
    
    Closes #14928 from clockfly/metastore_relation_toJSON.
    
    (cherry picked from commit afb3d5d301d004fd748ad305b3d72066af4ebb6c)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit dd27530c7a1f4670a8e28be37c81952eca456752
Author: Yadong Qi <qiyadong2...@gmail.com>
Date:   2016-09-06T02:57:21Z

    [SPARK-17358][SQL] Cached table(parquet/orc) should be shard between 
beelines
    
    ## What changes were proposed in this pull request?
    Cached table(parquet/orc) couldn't be shard between beelines, because the 
`sameResult` method used by `CacheManager` always return false(`sparkSession` 
are different) when compare two `HadoopFsRelation` in different beelines. So we 
make `sparkSession` a curry parameter.
    
    ## How was this patch tested?
    Beeline1
    ```
    1: jdbc:hive2://localhost:10000> CACHE TABLE src_pqt;
    +---------+--+
    | Result  |
    +---------+--+
    +---------+--+
    No rows selected (5.143 seconds)
    1: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    |                                                                           
                                                                                
                                                 plan                           
                                                                                
                                                                                
                 |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    | == Physical Plan ==
    InMemoryTableScan [key#49, value#50]
       +- InMemoryRelation [key#49, value#50], true, 10000, StorageLevel(disk, 
memory, deserialized, 1 replicas), `src_pqt`
             +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, 
Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<key:int,value:string>  |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    ```
    
    Beeline2
    ```
    0: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    |                                                                           
                                                                                
                                                 plan                           
                                                                                
                                                                                
                 |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    | == Physical Plan ==
    InMemoryTableScan [key#68, value#69]
       +- InMemoryRelation [key#68, value#69], true, 10000, StorageLevel(disk, 
memory, deserialized, 1 replicas), `src_pqt`
             +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, 
Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<key:int,value:string>  |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    ```
    
    Author: Yadong Qi <qiyadong2...@gmail.com>
    
    Closes #14913 from watermen/SPARK-17358.
    
    (cherry picked from commit 64e826f91eabb1a22d3d163d71fbb7b6d2185f25)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f56b70fec2d31fd062320bb328c320e4eca72f1d
Author: Yin Huai <yh...@databricks.com>
Date:   2016-09-06T04:13:28Z

    Revert "[SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException 
due to missing otherCopyArgs"
    
    This reverts commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34.

commit 286ccd6ba9e3927e8d445c2f56b6f1f5c77e11df
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-06T07:42:52Z

    [SPARK-17369][SQL][2.0] MetastoreRelation toJSON throws AssertException due 
to missing otherCopyArgs
    
    backport https://github.com/apache/spark/pull/14928 to 2.0
    
    ## What changes were proposed in this pull request?
    
    `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs 
to include currying construction arguments, otherwise it reports 
AssertException telling that the construction argument values' count doesn't 
match the construction argument names' count.
    
    For class `MetastoreRelation`, it has a currying construction parameter 
`client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Sean Zhong <seanzh...@databricks.com>
    
    Closes #14968 from clockfly/metastore_toJSON_fix_for_spark_2.0.

commit c0f1f536dc75c9a1a932282046718228b95d2f70
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-06T08:05:50Z

    [SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode
    
    ## What changes were proposed in this pull request?
    
    class `org.apache.spark.sql.types.Metadata` is widely used in mllib to 
store some ml attributes. `Metadata` is commonly stored in `Alias` expression.
    
    ```
    case class Alias(child: Expression, name: String)(
        val exprId: ExprId = NamedExpression.newExprId,
        val qualifier: Option[String] = None,
        val explicitMetadata: Option[Metadata] = None,
        override val isGenerated: java.lang.Boolean = false)
    ```
    
    The `Metadata` can take a big memory footprint since the number of 
attributes is big ( in scale of million). When `toJSON` is called on `Alias` 
expression, the `Metadata` will also be converted to a big JSON string.
    If a plan contains many such kind of `Alias` expressions, it may trigger 
out of memory error when `toJSON` is called, since converting all `Metadata` 
references to JSON will take huge memory.
    
    With this PR, we will skip scanning Metadata when doing JSON conversion. 
For a reproducer of the OOM, and analysis, please look at jira 
https://issues.apache.org/jira/browse/SPARK-17356.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Zhong <seanzh...@databricks.com>
    
    Closes #14915 from clockfly/json_oom.
    
    (cherry picked from commit 6f13aa7dfee12b1b301bd10a1050549008ecc67e)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 95e44dca1d99ff7904c3c2e174f0f2123062ce3c
Author: Davies Liu <dav...@databricks.com>
Date:   2016-09-06T17:46:31Z

    [SPARK-16922] [SPARK-17211] [SQL] make the address of values portable in 
LongToUnsafeRowMap
    
    ## What changes were proposed in this pull request?
    
    In LongToUnsafeRowMap, we use offset of a value as pointer, stored in a 
array also in the page for chained values. The offset is not portable, because 
Platform.LONG_ARRAY_OFFSET will be different with different JVM Heap size, then 
the deserialized LongToUnsafeRowMap will be corrupt.
    
    This PR will change to use portable address (without 
Platform.LONG_ARRAY_OFFSET).
    
    ## How was this patch tested?
    
    Added a test case with random generated keys, to improve the coverage. But 
this test is not a regression test, that could require a Spark cluster that 
have at least 32G heap in driver or executor.
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #14927 from davies/longmap.
    
    (cherry picked from commit f7e26d788757f917b32749856bb29feb7b4c2987)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit 534380484ac5f56bd3e14a8917a24ca6cccf198f
Author: Sameer Agarwal <samee...@cs.berkeley.edu>
Date:   2016-09-06T17:48:53Z

    [SPARK-16334] [BACKPORT] Reusing same dictionary column for decoding 
consecutive row groups shouldn't throw an error
    
    ## What changes were proposed in this pull request?
    
    Backports https://github.com/apache/spark/pull/14941 in 2.0.
    
    This patch fixes a bug in the vectorized parquet reader that's caused by 
re-using the same dictionary column vector while reading consecutive row 
groups. Specifically, this issue manifests for a certain distribution of 
dictionary/plain encoded data while we read/populate the underlying bit packed 
dictionary data into a column-vector based data structure.
    
    Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!
    
    Author: Sameer Agarwal <sameeragcs.berkeley.edu>
    
    Closes #14941 from sameeragarwal/parquet-exception-2.
    
    Author: Sameer Agarwal <samee...@cs.berkeley.edu>
    
    Closes #14944 from sameeragarwal/branch-2.0.

commit 130a80fd87bd8bb275f59af6d81c2e7dcc9707f9
Author: Adam Roberts <arobe...@uk.ibm.com>
Date:   2016-09-06T21:13:25Z

    [SPARK-17378][BUILD] Upgrade snappy-java to 1.1.2.6
    
    ## What changes were proposed in this pull request?
    
    Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: 
https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a 
bug in SnappyInputStream when reading compressed data that happened to have the 
same first byte with the stream magic header (#142)"
    
    ## How was this patch tested?
    Existing unit tests using the latest IBM Java 8 on Intel, Power and Z 
architectures (little and big-endian)
    
    Author: Adam Roberts <arobe...@uk.ibm.com>
    
    Closes #14958 from a-roberts/master.
    
    (cherry picked from commit 6c08dbf683875ff1ba724447e0531f673bcff8ba)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 0ae97863781200ec96f89ad98e5d11bb1778fab0
Author: Sandeep Singh <sand...@techaddict.me>
Date:   2016-09-06T21:18:28Z

    [SPARK-17299] TRIM/LTRIM/RTRIM should not strips characters other than 
spaces
    
    ## What changes were proposed in this pull request?
    TRIM/LTRIM/RTRIM should not strips characters other than spaces, we were 
trimming all chars small than ASCII 0x20(space)
    
    ## How was this patch tested?
    fixed existing tests.
    
    Author: Sandeep Singh <sand...@techaddict.me>
    
    Closes #14924 from techaddict/SPARK-17299.
    
    (cherry picked from commit 7775d9f224e22400c6c8c093652a383f4af66ee0)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 015751421bc444e350ad15c6f2e8a52f2da5b6e9
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-06T22:07:28Z

    [SPARK-17110] Fix StreamCorruptionException in 
BlockManager.getRemoteValues()
    
    ## What changes were proposed in this pull request?
    
    This patch fixes a `java.io.StreamCorruptedException` error affecting 
remote reads of cached values when certain data types are used. The problem 
stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the 
"best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this 
would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection 
would pick KryoSerializer for replication and block transfer. However, the 
`getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class 
tags in order to enable the same serializer to be used during deserialization, 
causing Java to be inappropriately used instead of Kryo, leading to the 
StreamCorruptedException.
    
    We already fixed a similar bug in #14311, which dealt with similar issues 
in block replication. Prior to that patch, it seems that we had no tests to 
ensure that block replication actually succeeded. Similarly, prior to this bug 
fix patch it looks like we had no tests to perform remote reads of cached data, 
which is why this bug was able to remain latent for so long.
    
    This patch addresses the bug by modifying `BlockManager`'s `get()` and  
`getRemoteValues()` methods to accept ClassTags, allowing the proper class tag 
to be threaded in the `getOrElseUpdate` code path (which is used by 
`rdd.iterator`)
    
    ## How was this patch tested?
    
    Extended the caching tests in `DistributedSuite` to exercise the 
`getRemoteValues` path, plus manual testing to verify that the PySpark bug 
reproduction in SPARK-17110 is fixed.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #14952 from JoshRosen/SPARK-17110.
    
    (cherry picked from commit 29cfab3f1524c5690be675d24dda0a9a1806d6ff)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit f3cfce09274741cc04bf2f00e873b3b64976b6d5
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-06T23:49:06Z

    [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'
    
    ## What changes were proposed in this pull request?
    
    Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of 
error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$`
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #14983 from zsxwing/SPARK-17316-3.
    
    (cherry picked from commit 175b4344112b376cbbbd05265125ed0e1b87d507)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit a23d4065c5705b805c69e569ea177167d44b5244
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-09-06T02:36:00Z

    [SPARK-17279][SQL] better error message for exceptions during ScalaUDF 
execution
    
    ## What changes were proposed in this pull request?
    
    If `ScalaUDF` throws exceptions during executing user code, sometimes it's 
hard for users to figure out what's wrong, especially when they use Spark 
shell. An example
    ```
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 
in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 
325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException
        at 
line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
        at 
line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    ...
    ```
    We should catch these exceptions and rethrow them with better error 
message, to say that the exception is happened in scala udf.
    
    This PR also does some clean up for `ScalaUDF` and add a unit test suite 
for it.
    
    ## How was this patch tested?
    
    the new test suite
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #14850 from cloud-fan/npe.
    
    (cherry picked from commit 8d08f43d09157b98e559c0be6ce6fd571a35e0d1)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 796577b43d3df94f5d3a8e4baeb0aa03fbbb3f21
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-09-07T02:34:11Z

    [SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to 
save file names in FileStreamSource
    
    ## What changes were proposed in this pull request?
    
    When we create a filestream on a directory that has partitioned subdirs 
(i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir 
as Seq[String] which internally is a Stream[String]. This is because of this 
[line](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93),
 where a LinkedHashSet.values.toSeq returns Stream. Then when the 
[FileStreamSource](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79)
 filters this Stream[String] to remove the seen files, it creates a new 
Stream[String], which has a filter function that has a $outer reference to the 
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
causes NotSerializableException. This will happened even if there is just one 
file in the dir.
    
    Its important to note that this behavior is different in Scala 2.11. There 
is no $outer reference to FileStreamSource, so it does not throw 
NotSerializableException. However, with a large sequence of files (tested with 
10000 files), it throws StackOverflowError. This is because how Stream class is 
implemented. Its basically like a linked list, and attempting to serialize a 
long Stream requires *recursively* going through linked list, thus resulting in 
StackOverflowError.
    
    In short, across both Scala 2.10 and 2.11, serialization fails when both 
the following conditions are true.
    - file stream defined on a partitioned directory
    - directory has 10k+ files
    
    The right solution is to convert the seq to an array before writing to the 
log. This PR implements this fix in two ways.
    - Changing all uses for HDFSMetadataLog to ensure Array is used instead of 
Seq
    - Added a `require` in HDFSMetadataLog such that it is never used with type 
Seq
    
    ## How was this patch tested?
    
    Added unit test that test that ensures the file stream source can handle 
with 10000 files. This tests fails in both Scala 2.10 and 2.11 with different 
failures as indicated above.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #14987 from tdas/SPARK-17372.
    
    (cherry picked from commit eb1ab88a86ce35f3d6ba03b3a798099fbcf6b3fc)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit ee6301a88e3b109398cec9bc470b5a88f72654dd
Author: Clark Fitzgerald <clarkfi...@gmail.com>
Date:   2016-09-07T06:40:37Z

    [SPARK-16785] R dapply doesn't return array or raw columns
    
    Fixed bug in `dapplyCollect` by changing the `compute` function of 
`worker.R` to explicitly handle raw (binary) vectors.
    
    cc shivaram
    
    Unit tests
    
    Author: Clark Fitzgerald <clarkfi...@gmail.com>
    
    Closes #14783 from clarkfitzg/SPARK-16785.
    
    (cherry picked from commit 9fccde4ff80fb0fd65a9e90eb3337965e4349de4)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit c8811adaa6b2fb6c5ca31520908d148326ebaf18
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-09-07T08:38:56Z

    [SPARK-17296][SQL] Simplify parser join processing [BACKPORT 2.0]
    
    ## What changes were proposed in this pull request?
    This PR backports https://github.com/apache/spark/pull/14867 to branch-2.0. 
It fixes a number of join ordering bugs.
    ## How was this patch tested?
    Added tests to `PlanParserSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #14984 from hvanhovell/SPARK-17296-branch-2.0.

commit e6caceb5e141a1665b21d04079a86baca041e453
Author: Srinivasa Reddy Vundela <v...@cloudera.com>
Date:   2016-09-07T11:41:03Z

    [MINOR][SQL] Fixing the typo in unit test
    
    ## What changes were proposed in this pull request?
    
    Fixing the typo in the unit test of CodeGenerationSuite.scala
    
    ## How was this patch tested?
    Ran the unit test after fixing the typo and it passes
    
    Author: Srinivasa Reddy Vundela <v...@cloudera.com>
    
    Closes #14989 from vundela/typo_fix.
    
    (cherry picked from commit 76ad89e9241fb2dece95dd445661dd95ee4ef699)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 078ac0e6321aeb72c670a65ec90b9c20ef0a7788
Author: Eric Liang <e...@databricks.com>
Date:   2016-09-07T19:33:50Z

    [SPARK-17370] Shuffle service files not invalidated when a slave is lost
    
    ## What changes were proposed in this pull request?
    
    DAGScheduler invalidates shuffle files when an executor loss event occurs, 
but not when the external shuffle service is enabled. This is because when 
shuffle service is on, the shuffle file lifetime can exceed the executor 
lifetime.
    
    However, it also doesn't invalidate shuffle files when the shuffle service 
itself is lost (due to whole slave loss). This can cause long hangs when slaves 
are lost since the file loss is not detected until a subsequent stage attempts 
to read the shuffle files.
    
    The proposed fix is to also invalidate shuffle files when an executor is 
lost due to a `SlaveLost` event.
    
    ## How was this patch tested?
    
    Unit tests, also verified on an actual cluster that slave loss invalidates 
shuffle files immediately as expected.
    
    cc mateiz
    
    Author: Eric Liang <e...@databricks.com>
    
    Closes #14931 from ericl/sc-4439.
    
    (cherry picked from commit 649fa4bf1d6fc9271ae56b6891bc93ebf57858d1)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 067752ce08dc035ee807d20be2202c385f88f01c
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-09-07T23:43:05Z

    [SPARK-16533][CORE] - backport driver deadlock fix to 2.0
    
    ## What changes were proposed in this pull request?
    Backport changes from #14710 and #14925 to 2.0
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    Author: Angus Gerry <ango...@gmail.com>
    
    Closes #14933 from angolon/SPARK-16533-2.0.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16348: Branch 2.0.4399

Reply via email to