[GitHub] spark pull request #15728: [SPARK-18133] [branch-2.0] [Examples] [ML] [Pytho...

jagadeesanas2 Tue, 01 Nov 2016 21:15:56 -0700

GitHub user jagadeesanas2 opened a pull request:

    https://github.com/apache/spark/pull/15728


    [SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pipeline Example has 
syntax errors]

    ## What changes were proposed in this pull request?
    
    [Fix] [branch-2.0] In Python 3, there is only one integer type (i.e., int), 
which mostly behaves like the long type in Python 2. Since Python 3 won't 
accept "L", so removed "L" in all examples.
    
    ## How was this patch tested?
    
    Unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ibmsoe/spark SPARK-18133_2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15728.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15728
    
----
commit 191d99692dc4315c371b566e3a9c5b114876ee49
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-09-01T00:54:59Z

    [SPARK-17180][SPARK-17309][SPARK-17323][SQL][2.0] create AlterViewAsCommand 
to handle ALTER VIEW AS
    
    ## What changes were proposed in this pull request?
    
    Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 
3 bugs:
    
    1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no 
database part and temp view exists
    2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist.
    3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, 
comment, create_time, etc.
    
    The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we 
need different code path to handle them. However, in `CreateViewCommand`, there 
is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce 
extra flag. But instead of doing this, I think a more natural way is to 
separate the ALTER VIEW AS logic into a new command.
    
    backport https://github.com/apache/spark/pull/14874 to 2.0
    
    ## How was this patch tested?
    
    new tests in SQLViewSuite
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #14893 from cloud-fan/minor4.

commit 8711b451d727074173748418a47cec210f84f2f7
Author: Junyang Qian <junya...@databricks.com>
Date:   2016-09-01T04:28:53Z

    [SPARKR][MINOR] Fix windowPartitionBy example
    
    ## What changes were proposed in this pull request?
    
    The usage in the original example is incorrect. This PR fixes it.
    
    ## How was this patch tested?
    
    Manual test.
    
    Author: Junyang Qian <junya...@databricks.com>
    
    Closes #14903 from junyangq/SPARKR-FixWindowPartitionByDoc.
    
    (cherry picked from commit d008638fbedc857c1adc1dff399d427b8bae848e)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 6281b74b6965ffcd0600844cea168cbe71ca8248
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-01T06:25:20Z

    [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class 
defined in repl again
    
    ## What changes were proposed in this pull request?
    
    After digging into the logs, I noticed the failure is because in this test, 
it starts a local cluster with 2 executors. However, when SparkContext is 
created, executors may be still not up. When one of the executor is not up 
during running the job, the blocks won't be replicated.
    
    This PR just adds a wait loop before running the job to fix the flaky test.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #14905 from zsxwing/SPARK-17318-2.
    
    (cherry picked from commit 21c0a4fe9d8e21819ba96e7dc2b1f2999d3299ae)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 13bacd7308c42c92f42fbc3ffbee9a13282668a9
Author: Tejas Patil <tej...@fb.com>
Date:   2016-09-01T16:49:43Z

    [SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordeâ¦
    
    ## What changes were proposed in this pull request?
    
    Ports https://github.com/apache/spark/pull/14841 and 
https://github.com/apache/spark/pull/14910 from `master` to `branch-2.0`
    
    Jira : https://issues.apache.org/jira/browse/SPARK-17271
    
    Planner is adding un-needed SORT operation due to bug in the way comparison 
for `SortOrder` is done at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
    `SortOrder` needs to be compared semantically because `Expression` within 
two `SortOrder` can be "semantically equal" but not literally equal objects.
    
    eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON 
a.col1=b.col1")`
    
    Expression in required SortOrder:
    ```
          AttributeReference(
            name = "col1",
            dataType = LongType,
            nullable = false
          ) (exprId = exprId,
            qualifier = Some("a")
          )
    ```
    
    Expression in child SortOrder:
    ```
          AttributeReference(
            name = "col1",
            dataType = LongType,
            nullable = false
          ) (exprId = exprId)
    ```
    
    Notice that the output column has a qualifier but the child attribute does 
not but the inherent expression is the same and hence in this case we can say 
that the child satisfies the required sort order.
    
    This PR includes following changes:
    - Added a `semanticEquals` method to `SortOrder` so that it can compare 
underlying child expressions semantically (and not using default Object.equals)
    - Fixed `EnsureRequirements` to use semantic comparison of SortOrder
    
    ## How was this patch tested?
    
    - Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite`
    
    Author: Tejas Patil <tej...@fb.com>
    
    Closes #14920 from tejasapatil/SPARK-17271_2.0_port.

commit ac22ab0779c8672ba622b90304f05ac44ff83819
Author: Brian Cho <b...@fb.com>
Date:   2016-09-01T21:13:17Z

    [SPARK-16926] [SQL] Remove partition columns from partition metadata.
    
    ## What changes were proposed in this pull request?
    
    This removes partition columns from column metadata of partitions to match 
tables.
    
    A change introduced in SPARK-14388 removed partition columns from the 
column metadata of tables, but not for partitions. This causes TableReader to 
believe that the schema is different between table and partition, and create an 
unnecessary conversion object inspector in TableReader.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Brian Cho <b...@fb.com>
    
    Closes #14515 from dafrista/partition-columns-metadata.
    
    (cherry picked from commit 473d78649dec7583bcc4ec24b6f38303c38e81a2)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit dd377a52203def279b529832b888ef46be6268dc
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-01T23:45:26Z

    [SPARK-17355] Workaround for HIVE-14684 / HiveResultSetMetaData.isSigned 
exception
    
    ## What changes were proposed in this pull request?
    
    Attempting to use Spark SQL's JDBC data source against the Hive 
ThriftServer results in a `java.sql.SQLException: Method` not supported 
exception from `org.apache.hive.jdbc.HiveResultSetMetaData.isSigned`. Here are 
two user reports of this issue:
    
    - 
https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
    - https://stackoverflow.com/questions/32195946/method-not-supported-in-spark
    
    I have filed [HIVE-14684](https://issues.apache.org/jira/browse/HIVE-14684) 
to attempt to fix this in Hive by implementing the isSigned method, but in the 
meantime / for compatibility with older JDBC drivers I think we should add 
special-case error handling to work around this bug.
    
    This patch updates `JDBCRDD`'s `ResultSetMetadata` to schema conversion to 
catch the "Method not supported" exception from Hive and return `isSigned = 
true`. I believe that this is safe because, as far as I know, Hive does not 
support unsigned numeric types.
    
    ## How was this patch tested?
    
    Tested manually against a Spark Thrift Server.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #14911 from JoshRosen/hive-jdbc-workaround.
    
    (cherry picked from commit 15539e54c2650a164f09c072f8fae934bb0468c9)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit f9463238de1e7ea17da8f258f22e385a0ed4134e
Author: Kousuke Saruta <saru...@oss.nttdata.co.jp>
Date:   2016-09-02T07:46:15Z

    [SPARK-17342][WEBUI] Style of event timeline is broken
    
    ## What changes were proposed in this pull request?
    
    SPARK-15373 (#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, 
some class was renamed like 'timeline to vis-timeline' but that ticket didn't 
care and now style is broken.
    
    In this PR, I've restored the style by modifying `timeline-view.css` and 
`timeline-view.js`.
    
    ## How was this patch tested?
    
    manual tests.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    * Before
    <img width="1258" alt="2016-09-01 1 38 31" 
src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png";>
    
    * After
    <img width="1256" alt="2016-09-01 3 30 19" 
src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png";>
    
    Author: Kousuke Saruta <saru...@oss.nttdata.co.jp>
    
    Closes #14900 from sarutak/SPARK-17342.
    
    (cherry picked from commit 2ab8dbddaa31e4491b52eb0e495660ebbebfdb9e)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 171bdfd963b5dda85ddf5e72b72471fdaaaf2fe3
Author: wm...@hotmail.com <wm...@hotmail.com>
Date:   2016-09-02T08:47:17Z

    [SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when 
collecting SparkDataFrame
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    registerTempTable(createDataFrame(iris), "iris")
    str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y 
 from iris limit 5")))
    
    'data.frame':       5 obs. of  2 variables:
     $ x: num  1 1 1 1 1
     $ y:List of 5
      ..$ : num 2
      ..$ : num 2
      ..$ : num 2
      ..$ : num 2
      ..$ : num 2
    
    The problem is that spark returns `decimal(10, 0)` col type, instead of 
`decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be 
handled as "double".
    
    As discussed in JIRA thread, we can have two potential fixes:
    1). Scala side fix to add a new case when writing the object back; However, 
I can't use spark.sql.types._ in Spark core due to dependency issues. I don't 
find a way of doing type case match;
    
    2). SparkR side fix: Add a helper function to check special type like 
`"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This 
special helper is generic for adding new types handling in the future.
    
    I open this PR to discuss pros and cons of both approaches. If we want to 
do Scala side fix, we need to find a way to match the case of DecimalType and 
StructType in Spark Core.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    Manual test:
    > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as 
y  from iris limit 5")))
    'data.frame':       5 obs. of  2 variables:
     $ x: num  1 1 1 1 1
     $ y: num  2 2 2 2 2
    R Unit tests
    
    Author: wm...@hotmail.com <wm...@hotmail.com>
    
    Closes #14613 from wangmiao1981/type.
    
    (cherry picked from commit 0f30cdedbdb0d38e8c479efab6bb1c6c376206ff)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit d9d10ffb9c2ee2a79257d8827bdc99052d144511
Author: Kousuke Saruta <saru...@oss.nttdata.co.jp>
Date:   2016-09-02T09:26:43Z

    [SPARK-17352][WEBUI] Executor computing time can be negative-number because 
of calculation error
    
    ## What changes were proposed in this pull request?
    
    In StagePage, executor-computing-time is calculated but calculation error 
can occur potentially because it's calculated by subtraction of floating 
numbers.
    
    Following capture is an example.
    
    <img width="949" alt="capture-timeline" 
src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png";>
    
    ## How was this patch tested?
    
    Manual tests.
    
    Author: Kousuke Saruta <saru...@oss.nttdata.co.jp>
    
    Closes #14908 from sarutak/SPARK-17352.
    
    (cherry picked from commit 7ee24dac8e779f6a9bf45371fdc2be83fb679cb2)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 91a3cf1365157918f280d60c9b3ffeec4c087b92
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-09-02T14:31:01Z

    [SPARK-16935][SQL] Verification of Function-related ExternalCatalog APIs
    
    Function-related `HiveExternalCatalog` APIs do not have enough verification 
logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become 
consistent in the error handling.
    
    For example, below is the exception we got when calling `renameFunction`.
    ```
    15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to 
get database db1, returning NoSuchObjectException
    15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to 
get database db2, returning NoSuchObjectException
    15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object 
"org.apache.hadoop.hive.metastore.model.MFunction205629e9" using statement 
"UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : 
org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException:
 The statement was aborted because it would have caused a duplicate key value 
in a unique or primary key constraint or unique index identified by 
'UNIQUEFUNCTION' defined on 'FUNCS'.
        at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
        at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
Source)
        at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
Source)
    ```
    
    Improved the existing test cases to check whether the messages are right.
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14521 from gatorsmile/functionChecking.
    
    (cherry picked from commit 247a4faf06c1dd47a6543c56929cd0182a03e106)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 30e5c84939a5169cec1720196e1122fc0759ae2a
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-09-02T17:08:14Z

    [SPARK-17261] [PYSPARK] Using HiveContext after re-creating SparkContext in 
Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a 
stopped sparkContext"
    
    ## What changes were proposed in this pull request?
    
    Set SparkSession._instantiatedContext as None so that we can recreate 
SparkSession again.
    
    ## How was this patch tested?
    
    Tested manually using the following command in pyspark shell
    ```
    spark.stop()
    spark = SparkSession.builder.enableHiveSupport().getOrCreate()
    spark.sql("show databases").show()
    ```
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #14857 from zjffdu/SPARK-17261.
    
    (cherry picked from commit ea662286561aa9fe321cb0a0e10cdeaf60440b90)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit 29ac2f62e88ea8e280b474e61cdb2ab0a0d92a94
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2016-09-02T17:12:10Z

    [SPARK-17376][SPARKR] Spark version should be available in R
    
    ## What changes were proposed in this pull request?
    
    Add sparkR.version() API.
    
    ```
    > sparkR.version()
    [1] "2.1.0-SNAPSHOT"
    ```
    
    ## How was this patch tested?
    
    manual, unit tests
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #14935 from felixcheung/rsparksessionversion.
    
    (cherry picked from commit 812333e4336113e44d2c9473bcba1cee4a989d2c)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit d4ae35d02f92df407e54b65c2d6b48388448f031
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2016-09-02T17:28:37Z

    [SPARKR][DOC] regexp_extract should doc that it returns empty string when 
match fails
    
    ## What changes were proposed in this pull request?
    
    Doc change - see https://issues.apache.org/jira/browse/SPARK-16324
    
    ## How was this patch tested?
    
    manual check
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #14934 from felixcheung/regexpextractdoc.
    
    (cherry picked from commit 419eefd811a4e29a73bc309157f150751e478db5)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 03d9af6043ae443ced004383c996fa8eebf3a1d1
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2016-09-02T18:08:25Z

    [SPARK-17376][SPARKR] followup - change since version
    
    ## What changes were proposed in this pull request?
    
    change since version in doc
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #14939 from felixcheung/rsparkversion2.
    
    (cherry picked from commit eac1d0e921345b5d15aa35d8c565140292ab2af3)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit c9c36fa0c7bccefde808bdbc32b04e8555356001
Author: Davies Liu <dav...@databricks.com>
Date:   2016-09-02T22:10:12Z

    [SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in 
DataFrameWriter
    
    Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    
    For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.
    
    Ideally, we should make all the analyzer rules all idempotent, that may 
require lots of effort to double checking them one by one (may be not easy).
    
    An easier approach could be never feed a optimized plan into Analyzer, this 
PR fix the case for RunnableComand, they will be optimized, during execution, 
the passed `query` will also be passed into QueryExecution again. This PR make 
these `query` not part of the children, so they will not be optimized and 
analyzed again.
    
    Right now, we did not know a logical plan is optimized or not, we could 
introduce a flag for that, and make sure a optimized logical plan will not be 
analyzed again.
    
    Added regression tests.
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #14797 from davies/fix_writer.
    
    (cherry picked from commit ed9c884dcf925500ceb388b06b33bd2c95cd2ada)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b
Author: Sameer Agarwal <samee...@cs.berkeley.edu>
Date:   2016-09-02T22:16:16Z

    [SPARK-16334] Reusing same dictionary column for decoding consecutive row 
groups shouldn't throw an error
    
    This patch fixes a bug in the vectorized parquet reader that's caused by 
re-using the same dictionary column vector while reading consecutive row 
groups. Specifically, this issue manifests for a certain distribution of 
dictionary/plain encoded data while we read/populate the underlying bit packed 
dictionary data into a column-vector based data structure.
    
    Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!
    
    Author: Sameer Agarwal <samee...@cs.berkeley.edu>
    
    Closes #14941 from sameeragarwal/parquet-exception-2.
    
    (cherry picked from commit a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit b8f65dad7be22231e982aaec3bbd69dbeacc20da
Author: Davies Liu <davies....@gmail.com>
Date:   2016-09-02T22:40:02Z

    Fix build

commit c0ea7707127c92ecb51794b96ea40d7cdb28b168
Author: Davies Liu <davies....@gmail.com>
Date:   2016-09-02T23:05:37Z

    Revert "[SPARK-16334] Reusing same dictionary column for decoding 
consecutive row groups shouldn't throw an error"
    
    This reverts commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b.

commit 12a2e2a5ab5db12f39a7b591e914d52058e1581b
Author: Junyang Qian <junya...@databricks.com>
Date:   2016-09-03T04:11:57Z

    [SPARKR][MINOR] Fix docs for sparkR.session and count
    
    ## What changes were proposed in this pull request?
    
    This PR tries to add some more explanation to `sparkR.session`. It also 
modifies doc for `count` so when grouped in one doc, the description doesn't 
confuse users.
    
    ## How was this patch tested?
    
    Manual test.
    
    ![screen shot 2016-09-02 at 1 21 36 
pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)
    
    Author: Junyang Qian <junya...@databricks.com>
    
    Closes #14942 from junyangq/fixSparkRSessionDoc.
    
    (cherry picked from commit d2fde6b72c4aede2e7edb4a7e6653fb1e7b19924)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 949544d017ab25b43b683cd5c1e6783d87bfce45
Author: CodingCat <zhunans...@gmail.com>
Date:   2016-09-03T09:03:40Z

    [SPARK-17347][SQL][EXAMPLES] Encoder in Dataset example has incorrect type
    
    ## What changes were proposed in this pull request?
    
    We propose to fix the Encoder type in the Dataset example
    
    ## How was this patch tested?
    
    The PR will be tested with the current unit test cases
    
    Author: CodingCat <zhunans...@gmail.com>
    
    Closes #14901 from CodingCat/SPARK-17347.
    
    (cherry picked from commit 97da41039b2b8fa7f93caf213ae45b9973925995)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 196d62eae05be0d87a20776fa07208b7ea2ddc90
Author: Sandeep Singh <sand...@techaddict.me>
Date:   2016-09-03T14:35:19Z

    [MINOR][SQL] Not dropping all necessary tables
    
    ## What changes were proposed in this pull request?
    was not dropping table `parquet_t3`
    
    ## How was this patch tested?
    tested `LogicalPlanToSQLSuite` locally
    
    Author: Sandeep Singh <sand...@techaddict.me>
    
    Closes #13767 from techaddict/minor-8.
    
    (cherry picked from commit a8a35b39b92fc9000eaac102c67c66be30b05e54)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit a7f5e7066f935d58d702a3e86b85aa175291d0fc
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-08-10T08:25:01Z

    [SPARK-16959][SQL] Rebuild Table Comment when Retrieving Metadata from Hive 
Metastore
    
    ### What changes were proposed in this pull request?
    The `comment` in `CatalogTable` returned from Hive is always empty. We 
store it in the table property when creating a table. However, when we try to 
retrieve the table metadata from Hive metastore, we do not rebuild it. The 
`comment` is always empty.
    
    This PR is to fix the issue.
    
    ### How was this patch tested?
    Fixed the test case to verify the change.
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14550 from gatorsmile/tableComment.
    
    (cherry picked from commit bdd537164dcfeec5e9c51d54791ef16997ff2597)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 3500dbc9bcce243b6656f308ee4941de0350d198
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-07-26T10:46:12Z

    [SPARK-16663][SQL] desc table should be consistent between data source and 
hive serde tables
    
    Currently there are 2 inconsistence:
    
    1. for data source table, we only print partition names, for hive table, we 
also print partition schema. After this PR, we will always print schema
    2. if column doesn't have comment, data source table will print empty 
string, hive table will print null. After this PR, we will always print null
    
    new test in `HiveDDLSuite`
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #14302 from cloud-fan/minor3.
    
    (cherry picked from commit a2abb583caaec9a2cecd5d65b05d172fc096c125)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 704215d3055bad7957d1d6da1a1a526c0d27d37d
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-09-03T17:02:20Z

    [SPARK-17335][SQL] Fix ArrayType and MapType CatalogString.
    
    ## What changes were proposed in this pull request?
    the `catalogString` for `ArrayType` and `MapType` currently calls the 
`simpleString` method on its children. This is a problem when the child is a 
struct, the `struct.simpleString` implementation truncates the number of fields 
it shows (25 at max). This breaks the generation of a proper `catalogString`, 
and has shown to cause errors while writing to Hive.
    
    This PR fixes this by providing proper `catalogString` implementations for 
`ArrayData` or `MapData`.
    
    ## How was this patch tested?
    Added testing for `catalogString` to `DataTypeSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #14938 from hvanhovell/SPARK-17335.
    
    (cherry picked from commit c2a1576c230697f56f282b6388c79835377e0f2f)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit e387c8ba86f89115eb2eabac070c215f451c5f0f
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-09-05T03:17:37Z

    [SPARK-17391][TEST][2.0] Fix Two Test Failures After Backport
    
    ### What changes were proposed in this pull request?
    In the latest branch 2.0, we have two test case failure due to backport.
    
    - test("ALTER VIEW AS should keep the previous table properties, comment, 
create_time, etc.")
    - test("SPARK-6212: The EXPLAIN output of CTAS only shows the analyzed 
plan")
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14951 from gatorsmile/fixTestFailure.

commit f92d87455214005e60b2d58aa814aaabd2ac9495
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-09-06T02:45:54Z

    [SPARK-17353][SPARK-16943][SPARK-16942][BACKPORT-2.0][SQL] Fix multiple 
bugs in CREATE TABLE LIKE command
    
    ### What changes were proposed in this pull request?
    This PR is to backport https://github.com/apache/spark/pull/14531.
    
    The existing `CREATE TABLE LIKE` command has multiple issues:
    
    - The generated table is non-empty when the source table is a data source 
table. The major reason is the data source table is using the table property 
`path` to store the location of table contents. Currently, we keep it 
unchanged. Thus, we still create the same table with the same location.
    
    - The table type of the generated table is `EXTERNAL` when the source table 
is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, 
but Hive is checking the table property `EXTERNAL` to decide whether the table 
is `EXTERNAL` or not. (See 
https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408)
 Thus, the created table is still `EXTERNAL`.
    
    - When the source table is a `VIEW`, the metadata of the generated table 
contains the original view text and view original text. So far, this does not 
break anything, but it could cause something wrong in Hive. (For example, 
https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
    
    - The issue regarding the table `comment`. To follow what Hive does, the 
table comment should be cleaned, but the column comments should be still kept.
    
    - The `INDEX` table is not supported. Thus, we should throw an exception in 
this case.
    
    - `owner` should not be retained. `ToHiveTable` set it 
[here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793)
 no matter which value we set in `CatalogTable`. We set it to an empty string 
for avoiding the confusing output in Explain.
    
    - Add a support for temp tables
    
    - Like Hive, we should not copy the table properties from the source table 
to the created table, especially for the statistics-related properties, which 
could be wrong in the created table.
    
    - `unsupportedFeatures` should not be copied from the source table. The 
created table does not have these unsupported features.
    
    - When the type of source table is a view, the target table is using the 
default format of data source tables: `spark.sql.sources.default`.
    
    This PR is to fix the above issues.
    
    ### How was this patch tested?
    Improve the test coverage by adding more test cases
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #14946 from gatorsmile/createTableLike20.

commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-06T02:50:07Z

    [SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to 
missing otherCopyArgs
    
    ## What changes were proposed in this pull request?
    
    `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs 
to include currying construction arguments, otherwise it reports 
AssertException telling that the construction argument values' count doesn't 
match the construction argument names' count.
    
    For class `MetastoreRelation`, it has a currying construction parameter 
`client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Sean Zhong <seanzh...@databricks.com>
    
    Closes #14928 from clockfly/metastore_relation_toJSON.
    
    (cherry picked from commit afb3d5d301d004fd748ad305b3d72066af4ebb6c)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit dd27530c7a1f4670a8e28be37c81952eca456752
Author: Yadong Qi <qiyadong2...@gmail.com>
Date:   2016-09-06T02:57:21Z

    [SPARK-17358][SQL] Cached table(parquet/orc) should be shard between 
beelines
    
    ## What changes were proposed in this pull request?
    Cached table(parquet/orc) couldn't be shard between beelines, because the 
`sameResult` method used by `CacheManager` always return false(`sparkSession` 
are different) when compare two `HadoopFsRelation` in different beelines. So we 
make `sparkSession` a curry parameter.
    
    ## How was this patch tested?
    Beeline1
    ```
    1: jdbc:hive2://localhost:10000> CACHE TABLE src_pqt;
    +---------+--+
    | Result  |
    +---------+--+
    +---------+--+
    No rows selected (5.143 seconds)
    1: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    |                                                                           
                                                                                
                                                 plan                           
                                                                                
                                                                                
                 |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    | == Physical Plan ==
    InMemoryTableScan [key#49, value#50]
       +- InMemoryRelation [key#49, value#50], true, 10000, StorageLevel(disk, 
memory, deserialized, 1 replicas), `src_pqt`
             +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, 
Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<key:int,value:string>  |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    ```
    
    Beeline2
    ```
    0: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    |                                                                           
                                                                                
                                                 plan                           
                                                                                
                                                                                
                 |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    | == Physical Plan ==
    InMemoryTableScan [key#68, value#69]
       +- InMemoryRelation [key#68, value#69], true, 10000, StorageLevel(disk, 
memory, deserialized, 1 replicas), `src_pqt`
             +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, 
Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<key:int,value:string>  |
    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
    ```
    
    Author: Yadong Qi <qiyadong2...@gmail.com>
    
    Closes #14913 from watermen/SPARK-17358.
    
    (cherry picked from commit 64e826f91eabb1a22d3d163d71fbb7b6d2185f25)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f56b70fec2d31fd062320bb328c320e4eca72f1d
Author: Yin Huai <yh...@databricks.com>
Date:   2016-09-06T04:13:28Z

    Revert "[SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException 
due to missing otherCopyArgs"
    
    This reverts commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34.

commit 286ccd6ba9e3927e8d445c2f56b6f1f5c77e11df
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-06T07:42:52Z

    [SPARK-17369][SQL][2.0] MetastoreRelation toJSON throws AssertException due 
to missing otherCopyArgs
    
    backport https://github.com/apache/spark/pull/14928 to 2.0
    
    ## What changes were proposed in this pull request?
    
    `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs 
to include currying construction arguments, otherwise it reports 
AssertException telling that the construction argument values' count doesn't 
match the construction argument names' count.
    
    For class `MetastoreRelation`, it has a currying construction parameter 
`client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Sean Zhong <seanzh...@databricks.com>
    
    Closes #14968 from clockfly/metastore_toJSON_fix_for_spark_2.0.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15728: [SPARK-18133] [branch-2.0] [Examples] [ML] [Pytho...

Reply via email to