GitHub user jagadeesanas2 opened a pull request: https://github.com/apache/spark/pull/15728
[SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pipeline Example has syntax errors] ## What changes were proposed in this pull request? [Fix] [branch-2.0] In Python 3, there is only one integer type (i.e., int), which mostly behaves like the long type in Python 2. Since Python 3 won't accept "L", so removed "L" in all examples. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ibmsoe/spark SPARK-18133_2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15728.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15728 ---- commit 191d99692dc4315c371b566e3a9c5b114876ee49 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-09-01T00:54:59Z [SPARK-17180][SPARK-17309][SPARK-17323][SQL][2.0] create AlterViewAsCommand to handle ALTER VIEW AS ## What changes were proposed in this pull request? Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 3 bugs: 1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no database part and temp view exists 2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist. 3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, comment, create_time, etc. The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we need different code path to handle them. However, in `CreateViewCommand`, there is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce extra flag. But instead of doing this, I think a more natural way is to separate the ALTER VIEW AS logic into a new command. backport https://github.com/apache/spark/pull/14874 to 2.0 ## How was this patch tested? new tests in SQLViewSuite Author: Wenchen Fan <wenc...@databricks.com> Closes #14893 from cloud-fan/minor4. commit 8711b451d727074173748418a47cec210f84f2f7 Author: Junyang Qian <junya...@databricks.com> Date: 2016-09-01T04:28:53Z [SPARKR][MINOR] Fix windowPartitionBy example ## What changes were proposed in this pull request? The usage in the original example is incorrect. This PR fixes it. ## How was this patch tested? Manual test. Author: Junyang Qian <junya...@databricks.com> Closes #14903 from junyangq/SPARKR-FixWindowPartitionByDoc. (cherry picked from commit d008638fbedc857c1adc1dff399d427b8bae848e) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit 6281b74b6965ffcd0600844cea168cbe71ca8248 Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-09-01T06:25:20Z [SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl again ## What changes were proposed in this pull request? After digging into the logs, I noticed the failure is because in this test, it starts a local cluster with 2 executors. However, when SparkContext is created, executors may be still not up. When one of the executor is not up during running the job, the blocks won't be replicated. This PR just adds a wait loop before running the job to fix the flaky test. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixi...@databricks.com> Closes #14905 from zsxwing/SPARK-17318-2. (cherry picked from commit 21c0a4fe9d8e21819ba96e7dc2b1f2999d3299ae) Signed-off-by: Shixiong Zhu <shixi...@databricks.com> commit 13bacd7308c42c92f42fbc3ffbee9a13282668a9 Author: Tejas Patil <tej...@fb.com> Date: 2016-09-01T16:49:43Z [SPARK-17271][SQL] Planner adds un-necessary Sort even if child orde⦠## What changes were proposed in this pull request? Ports https://github.com/apache/spark/pull/14841 and https://github.com/apache/spark/pull/14910 from `master` to `branch-2.0` Jira : https://issues.apache.org/jira/browse/SPARK-17271 Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253 `SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects. eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")` Expression in required SortOrder: ``` AttributeReference( name = "col1", dataType = LongType, nullable = false ) (exprId = exprId, qualifier = Some("a") ) ``` Expression in child SortOrder: ``` AttributeReference( name = "col1", dataType = LongType, nullable = false ) (exprId = exprId) ``` Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order. This PR includes following changes: - Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals) - Fixed `EnsureRequirements` to use semantic comparison of SortOrder ## How was this patch tested? - Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite` Author: Tejas Patil <tej...@fb.com> Closes #14920 from tejasapatil/SPARK-17271_2.0_port. commit ac22ab0779c8672ba622b90304f05ac44ff83819 Author: Brian Cho <b...@fb.com> Date: 2016-09-01T21:13:17Z [SPARK-16926] [SQL] Remove partition columns from partition metadata. ## What changes were proposed in this pull request? This removes partition columns from column metadata of partitions to match tables. A change introduced in SPARK-14388 removed partition columns from the column metadata of tables, but not for partitions. This causes TableReader to believe that the schema is different between table and partition, and create an unnecessary conversion object inspector in TableReader. ## How was this patch tested? Existing unit tests. Author: Brian Cho <b...@fb.com> Closes #14515 from dafrista/partition-columns-metadata. (cherry picked from commit 473d78649dec7583bcc4ec24b6f38303c38e81a2) Signed-off-by: Davies Liu <davies....@gmail.com> commit dd377a52203def279b529832b888ef46be6268dc Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-01T23:45:26Z [SPARK-17355] Workaround for HIVE-14684 / HiveResultSetMetaData.isSigned exception ## What changes were proposed in this pull request? Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer results in a `java.sql.SQLException: Method` not supported exception from `org.apache.hive.jdbc.HiveResultSetMetaData.isSigned`. Here are two user reports of this issue: - https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0 - https://stackoverflow.com/questions/32195946/method-not-supported-in-spark I have filed [HIVE-14684](https://issues.apache.org/jira/browse/HIVE-14684) to attempt to fix this in Hive by implementing the isSigned method, but in the meantime / for compatibility with older JDBC drivers I think we should add special-case error handling to work around this bug. This patch updates `JDBCRDD`'s `ResultSetMetadata` to schema conversion to catch the "Method not supported" exception from Hive and return `isSigned = true`. I believe that this is safe because, as far as I know, Hive does not support unsigned numeric types. ## How was this patch tested? Tested manually against a Spark Thrift Server. Author: Josh Rosen <joshro...@databricks.com> Closes #14911 from JoshRosen/hive-jdbc-workaround. (cherry picked from commit 15539e54c2650a164f09c072f8fae934bb0468c9) Signed-off-by: Josh Rosen <joshro...@databricks.com> commit f9463238de1e7ea17da8f258f22e385a0ed4134e Author: Kousuke Saruta <saru...@oss.nttdata.co.jp> Date: 2016-09-02T07:46:15Z [SPARK-17342][WEBUI] Style of event timeline is broken ## What changes were proposed in this pull request? SPARK-15373 (#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, some class was renamed like 'timeline to vis-timeline' but that ticket didn't care and now style is broken. In this PR, I've restored the style by modifying `timeline-view.css` and `timeline-view.js`. ## How was this patch tested? manual tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) * Before <img width="1258" alt="2016-09-01 1 38 31" src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png"> * After <img width="1256" alt="2016-09-01 3 30 19" src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png"> Author: Kousuke Saruta <saru...@oss.nttdata.co.jp> Closes #14900 from sarutak/SPARK-17342. (cherry picked from commit 2ab8dbddaa31e4491b52eb0e495660ebbebfdb9e) Signed-off-by: Sean Owen <so...@cloudera.com> commit 171bdfd963b5dda85ddf5e72b72471fdaaaf2fe3 Author: wm...@hotmail.com <wm...@hotmail.com> Date: 2016-09-02T08:47:17Z [SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when collecting SparkDataFrame ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) registerTempTable(createDataFrame(iris), "iris") str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y:List of 5 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double". As discussed in JIRA thread, we can have two potential fixes: 1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match; 2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future. I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test: > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y: num 2 2 2 2 2 R Unit tests Author: wm...@hotmail.com <wm...@hotmail.com> Closes #14613 from wangmiao1981/type. (cherry picked from commit 0f30cdedbdb0d38e8c479efab6bb1c6c376206ff) Signed-off-by: Felix Cheung <felixche...@apache.org> commit d9d10ffb9c2ee2a79257d8827bdc99052d144511 Author: Kousuke Saruta <saru...@oss.nttdata.co.jp> Date: 2016-09-02T09:26:43Z [SPARK-17352][WEBUI] Executor computing time can be negative-number because of calculation error ## What changes were proposed in this pull request? In StagePage, executor-computing-time is calculated but calculation error can occur potentially because it's calculated by subtraction of floating numbers. Following capture is an example. <img width="949" alt="capture-timeline" src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png"> ## How was this patch tested? Manual tests. Author: Kousuke Saruta <saru...@oss.nttdata.co.jp> Closes #14908 from sarutak/SPARK-17352. (cherry picked from commit 7ee24dac8e779f6a9bf45371fdc2be83fb679cb2) Signed-off-by: Sean Owen <so...@cloudera.com> commit 91a3cf1365157918f280d60c9b3ffeec4c087b92 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-09-02T14:31:01Z [SPARK-16935][SQL] Verification of Function-related ExternalCatalog APIs Function-related `HiveExternalCatalog` APIs do not have enough verification logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become consistent in the error handling. For example, below is the exception we got when calling `renameFunction`. ``` 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db1, returning NoSuchObjectException 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db2, returning NoSuchObjectException 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object "org.apache.hadoop.hive.metastore.model.MFunction205629e9" using statement "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUEFUNCTION' defined on 'FUNCS'. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) ``` Improved the existing test cases to check whether the messages are right. Author: gatorsmile <gatorsm...@gmail.com> Closes #14521 from gatorsmile/functionChecking. (cherry picked from commit 247a4faf06c1dd47a6543c56929cd0182a03e106) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 30e5c84939a5169cec1720196e1122fc0759ae2a Author: Jeff Zhang <zjf...@apache.org> Date: 2016-09-02T17:08:14Z [SPARK-17261] [PYSPARK] Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext" ## What changes were proposed in this pull request? Set SparkSession._instantiatedContext as None so that we can recreate SparkSession again. ## How was this patch tested? Tested manually using the following command in pyspark shell ``` spark.stop() spark = SparkSession.builder.enableHiveSupport().getOrCreate() spark.sql("show databases").show() ``` Author: Jeff Zhang <zjf...@apache.org> Closes #14857 from zjffdu/SPARK-17261. (cherry picked from commit ea662286561aa9fe321cb0a0e10cdeaf60440b90) Signed-off-by: Davies Liu <davies....@gmail.com> commit 29ac2f62e88ea8e280b474e61cdb2ab0a0d92a94 Author: Felix Cheung <felixcheun...@hotmail.com> Date: 2016-09-02T17:12:10Z [SPARK-17376][SPARKR] Spark version should be available in R ## What changes were proposed in this pull request? Add sparkR.version() API. ``` > sparkR.version() [1] "2.1.0-SNAPSHOT" ``` ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheun...@hotmail.com> Closes #14935 from felixcheung/rsparksessionversion. (cherry picked from commit 812333e4336113e44d2c9473bcba1cee4a989d2c) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit d4ae35d02f92df407e54b65c2d6b48388448f031 Author: Felix Cheung <felixcheun...@hotmail.com> Date: 2016-09-02T17:28:37Z [SPARKR][DOC] regexp_extract should doc that it returns empty string when match fails ## What changes were proposed in this pull request? Doc change - see https://issues.apache.org/jira/browse/SPARK-16324 ## How was this patch tested? manual check Author: Felix Cheung <felixcheun...@hotmail.com> Closes #14934 from felixcheung/regexpextractdoc. (cherry picked from commit 419eefd811a4e29a73bc309157f150751e478db5) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit 03d9af6043ae443ced004383c996fa8eebf3a1d1 Author: Felix Cheung <felixcheun...@hotmail.com> Date: 2016-09-02T18:08:25Z [SPARK-17376][SPARKR] followup - change since version ## What changes were proposed in this pull request? change since version in doc ## How was this patch tested? manual Author: Felix Cheung <felixcheun...@hotmail.com> Closes #14939 from felixcheung/rsparkversion2. (cherry picked from commit eac1d0e921345b5d15aa35d8c565140292ab2af3) Signed-off-by: Felix Cheung <felixche...@apache.org> commit c9c36fa0c7bccefde808bdbc32b04e8555356001 Author: Davies Liu <dav...@databricks.com> Date: 2016-09-02T22:10:12Z [SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in DataFrameWriter Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. Ideally, we should make all the analyzer rules all idempotent, that may require lots of effort to double checking them one by one (may be not easy). An easier approach could be never feed a optimized plan into Analyzer, this PR fix the case for RunnableComand, they will be optimized, during execution, the passed `query` will also be passed into QueryExecution again. This PR make these `query` not part of the children, so they will not be optimized and analyzed again. Right now, we did not know a logical plan is optimized or not, we could introduce a flag for that, and make sure a optimized logical plan will not be analyzed again. Added regression tests. Author: Davies Liu <dav...@databricks.com> Closes #14797 from davies/fix_writer. (cherry picked from commit ed9c884dcf925500ceb388b06b33bd2c95cd2ada) Signed-off-by: Davies Liu <davies....@gmail.com> commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b Author: Sameer Agarwal <samee...@cs.berkeley.edu> Date: 2016-09-02T22:16:16Z [SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <samee...@cs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. (cherry picked from commit a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a) Signed-off-by: Davies Liu <davies....@gmail.com> commit b8f65dad7be22231e982aaec3bbd69dbeacc20da Author: Davies Liu <davies....@gmail.com> Date: 2016-09-02T22:40:02Z Fix build commit c0ea7707127c92ecb51794b96ea40d7cdb28b168 Author: Davies Liu <davies....@gmail.com> Date: 2016-09-02T23:05:37Z Revert "[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error" This reverts commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b. commit 12a2e2a5ab5db12f39a7b591e914d52058e1581b Author: Junyang Qian <junya...@databricks.com> Date: 2016-09-03T04:11:57Z [SPARKR][MINOR] Fix docs for sparkR.session and count ## What changes were proposed in this pull request? This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users. ## How was this patch tested? Manual test. ![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png) Author: Junyang Qian <junya...@databricks.com> Closes #14942 from junyangq/fixSparkRSessionDoc. (cherry picked from commit d2fde6b72c4aede2e7edb4a7e6653fb1e7b19924) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit 949544d017ab25b43b683cd5c1e6783d87bfce45 Author: CodingCat <zhunans...@gmail.com> Date: 2016-09-03T09:03:40Z [SPARK-17347][SQL][EXAMPLES] Encoder in Dataset example has incorrect type ## What changes were proposed in this pull request? We propose to fix the Encoder type in the Dataset example ## How was this patch tested? The PR will be tested with the current unit test cases Author: CodingCat <zhunans...@gmail.com> Closes #14901 from CodingCat/SPARK-17347. (cherry picked from commit 97da41039b2b8fa7f93caf213ae45b9973925995) Signed-off-by: Sean Owen <so...@cloudera.com> commit 196d62eae05be0d87a20776fa07208b7ea2ddc90 Author: Sandeep Singh <sand...@techaddict.me> Date: 2016-09-03T14:35:19Z [MINOR][SQL] Not dropping all necessary tables ## What changes were proposed in this pull request? was not dropping table `parquet_t3` ## How was this patch tested? tested `LogicalPlanToSQLSuite` locally Author: Sandeep Singh <sand...@techaddict.me> Closes #13767 from techaddict/minor-8. (cherry picked from commit a8a35b39b92fc9000eaac102c67c66be30b05e54) Signed-off-by: Sean Owen <so...@cloudera.com> commit a7f5e7066f935d58d702a3e86b85aa175291d0fc Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-08-10T08:25:01Z [SPARK-16959][SQL] Rebuild Table Comment when Retrieving Metadata from Hive Metastore ### What changes were proposed in this pull request? The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty. This PR is to fix the issue. ### How was this patch tested? Fixed the test case to verify the change. Author: gatorsmile <gatorsm...@gmail.com> Closes #14550 from gatorsmile/tableComment. (cherry picked from commit bdd537164dcfeec5e9c51d54791ef16997ff2597) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 3500dbc9bcce243b6656f308ee4941de0350d198 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-07-26T10:46:12Z [SPARK-16663][SQL] desc table should be consistent between data source and hive serde tables Currently there are 2 inconsistence: 1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema 2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null new test in `HiveDDLSuite` Author: Wenchen Fan <wenc...@databricks.com> Closes #14302 from cloud-fan/minor3. (cherry picked from commit a2abb583caaec9a2cecd5d65b05d172fc096c125) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 704215d3055bad7957d1d6da1a1a526c0d27d37d Author: Herman van Hovell <hvanhov...@databricks.com> Date: 2016-09-03T17:02:20Z [SPARK-17335][SQL] Fix ArrayType and MapType CatalogString. ## What changes were proposed in this pull request? the `catalogString` for `ArrayType` and `MapType` currently calls the `simpleString` method on its children. This is a problem when the child is a struct, the `struct.simpleString` implementation truncates the number of fields it shows (25 at max). This breaks the generation of a proper `catalogString`, and has shown to cause errors while writing to Hive. This PR fixes this by providing proper `catalogString` implementations for `ArrayData` or `MapData`. ## How was this patch tested? Added testing for `catalogString` to `DataTypeSuite`. Author: Herman van Hovell <hvanhov...@databricks.com> Closes #14938 from hvanhovell/SPARK-17335. (cherry picked from commit c2a1576c230697f56f282b6388c79835377e0f2f) Signed-off-by: Herman van Hovell <hvanhov...@databricks.com> commit e387c8ba86f89115eb2eabac070c215f451c5f0f Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-09-05T03:17:37Z [SPARK-17391][TEST][2.0] Fix Two Test Failures After Backport ### What changes were proposed in this pull request? In the latest branch 2.0, we have two test case failure due to backport. - test("ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.") - test("SPARK-6212: The EXPLAIN output of CTAS only shows the analyzed plan") ### How was this patch tested? N/A Author: gatorsmile <gatorsm...@gmail.com> Closes #14951 from gatorsmile/fixTestFailure. commit f92d87455214005e60b2d58aa814aaabd2ac9495 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-09-06T02:45:54Z [SPARK-17353][SPARK-16943][SPARK-16942][BACKPORT-2.0][SQL] Fix multiple bugs in CREATE TABLE LIKE command ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/14531. The existing `CREATE TABLE LIKE` command has multiple issues: - The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location. - The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`. - When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406) - The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept. - The `INDEX` table is not supported. Thus, we should throw an exception in this case. - `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain. - Add a support for temp tables - Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table. - `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features. - When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`. This PR is to fix the above issues. ### How was this patch tested? Improve the test coverage by adding more test cases Author: gatorsmile <gatorsm...@gmail.com> Closes #14946 from gatorsmile/createTableLike20. commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34 Author: Sean Zhong <seanzh...@databricks.com> Date: 2016-09-06T02:50:07Z [SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs ## What changes were proposed in this pull request? `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs to include currying construction arguments, otherwise it reports AssertException telling that the construction argument values' count doesn't match the construction argument names' count. For class `MetastoreRelation`, it has a currying construction parameter `client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #14928 from clockfly/metastore_relation_toJSON. (cherry picked from commit afb3d5d301d004fd748ad305b3d72066af4ebb6c) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit dd27530c7a1f4670a8e28be37c81952eca456752 Author: Yadong Qi <qiyadong2...@gmail.com> Date: 2016-09-06T02:57:21Z [SPARK-17358][SQL] Cached table(parquet/orc) should be shard between beelines ## What changes were proposed in this pull request? Cached table(parquet/orc) couldn't be shard between beelines, because the `sameResult` method used by `CacheManager` always return false(`sparkSession` are different) when compare two `HadoopFsRelation` in different beelines. So we make `sparkSession` a curry parameter. ## How was this patch tested? Beeline1 ``` 1: jdbc:hive2://localhost:10000> CACHE TABLE src_pqt; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (5.143 seconds) 1: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | plan | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | == Physical Plan == InMemoryTableScan [key#49, value#50] +- InMemoryRelation [key#49, value#50], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt` +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string> | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ ``` Beeline2 ``` 0: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | plan | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | == Physical Plan == InMemoryTableScan [key#68, value#69] +- InMemoryRelation [key#68, value#69], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt` +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string> | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ ``` Author: Yadong Qi <qiyadong2...@gmail.com> Closes #14913 from watermen/SPARK-17358. (cherry picked from commit 64e826f91eabb1a22d3d163d71fbb7b6d2185f25) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit f56b70fec2d31fd062320bb328c320e4eca72f1d Author: Yin Huai <yh...@databricks.com> Date: 2016-09-06T04:13:28Z Revert "[SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs" This reverts commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34. commit 286ccd6ba9e3927e8d445c2f56b6f1f5c77e11df Author: Sean Zhong <seanzh...@databricks.com> Date: 2016-09-06T07:42:52Z [SPARK-17369][SQL][2.0] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs backport https://github.com/apache/spark/pull/14928 to 2.0 ## What changes were proposed in this pull request? `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs to include currying construction arguments, otherwise it reports AssertException telling that the construction argument values' count doesn't match the construction argument names' count. For class `MetastoreRelation`, it has a currying construction parameter `client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #14968 from clockfly/metastore_toJSON_fix_for_spark_2.0. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org