GitHub user marymwu opened a pull request: https://github.com/apache/spark/pull/21759
sfas ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marymwu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21759.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21759 ---- commit dcf36ad54598118408c1425e81aa6552f42328c8 Author: Dongjoon Hyun <dongjoon@...> Date: 2016-05-03T13:02:04Z [SPARK-15057][GRAPHX] Remove stale TODO comment for making `enum` in GraphGenerators This PR removes a stale TODO comment in `GraphGenerators.scala` Just comment removed. Author: Dongjoon Hyun <dongj...@apache.org> Closes #12839 from dongjoon-hyun/SPARK-15057. (cherry picked from commit 46965cd014fd4ba68bdec15156ec9bcc27d9b217) Signed-off-by: Reynold Xin <r...@databricks.com> commit 1dc30f189ac30f070068ca5f60b7b4c85f2adc9e Author: Bryan Cutler <cutlerb@...> Date: 2016-05-19T02:48:36Z [DOC][MINOR] ml.feature Scala and Python API sync I reviewed Scala and Python APIs for ml.feature and corrected discrepancies. Built docs locally, ran style checks Author: Bryan Cutler <cutl...@gmail.com> Closes #13159 from BryanCutler/ml.feature-api-sync. (cherry picked from commit b1bc5ebdd52ed12aea3fdc7b8f2fa2d00ea09c6b) Signed-off-by: Reynold Xin <r...@databricks.com> commit 642f00980f1de13a0f6d1dc8bc7ed5b0547f3a9d Author: Zheng RuiFeng <ruifengz@...> Date: 2016-05-15T14:59:49Z [MINOR] Fix Typos 1,Rename matrix args in BreezeUtil to upper to match the doc 2,Fix several typos in ML and SQL manual tests Author: Zheng RuiFeng <ruife...@foxmail.com> Closes #13078 from zhengruifeng/fix_ann. (cherry picked from commit c7efc56c7b6fc99c005b35c335716ff676856c6c) Signed-off-by: Reynold Xin <r...@databricks.com> commit 2126fb0c2b2bb8ac4c5338df15182fcf8713fb2f Author: Sandeep Singh <sandeep@...> Date: 2016-05-19T09:44:26Z [CORE][MINOR] Remove redundant set master in OutputCommitCoordinatorIntegrationSuite Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43. existing tests Author: Sandeep Singh <sand...@techaddict.me> Closes #13168 from techaddict/minor-1. (cherry picked from commit 3facca5152e685d9c7da96bff5102169740a4a06) Signed-off-by: Reynold Xin <r...@databricks.com> commit 1fc0f95eb8abbb9cc8ede2139670e493e6939317 Author: Andrew Or <andrew@...> Date: 2016-05-20T05:40:03Z [HOTFIX] Test compilation error from 52b967f commit dd0c7fb39cac44e8f0d73f9884fd1582c25e9cf4 Author: Reynold Xin <rxin@...> Date: 2016-05-20T05:46:08Z Revert "[HOTFIX] Test compilation error from 52b967f" This reverts commit 1fc0f95eb8abbb9cc8ede2139670e493e6939317. commit f8d0177c31d43eab59a7535945f3dfa24e906273 Author: Davies Liu <davies.liu@...> Date: 2016-05-18T23:02:52Z Revert "[SPARK-15392][SQL] fix default value of size estimation of logical plan" This reverts commit fc29b896dae08b957ed15fa681b46162600a4050. (cherry picked from commit 84b23453ddb0a97e3d81306de0a5dcb64f88bdd0) Signed-off-by: Reynold Xin <r...@databricks.com> commit 2ef645724a7f229309a87c5053b0fbdf45d06f52 Author: Takuya UESHIN <ueshin@...> Date: 2016-05-20T05:55:44Z [SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject. ## What changes were proposed in this pull request? The following code: ``` val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_)) ``` throws an Exception: ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) ... Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) ... ``` This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`. The analyzed and optimized plans of the above example are as follows: ``` == Analyzed Logical Plan == _1: string Project [_1#420] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421] +- Filter <function1>.apply +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2 +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] == Optimized Logical Plan == !Project [_1#420] +- Filter <function1>.apply +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] ``` This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`. The plans after this patch are as follows: ``` == Analyzed Logical Plan == _1: string Project [_1#420] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421] +- Filter <function1>.apply +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2 +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] == Optimized Logical Plan == Project [_1#416] +- Filter <function1>.apply +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] ``` ## How was this patch tested? Existing tests and I added a test to check if `filter and then select` works. Author: Takuya UESHIN <ues...@happy-camper.st> Closes #13096 from ueshin/issues/SPARK-15313. (cherry picked from commit d5e1c5acde95158db38448526c8afad4a6d21dc2) Signed-off-by: Reynold Xin <r...@databricks.com> commit 612866473503cbf4f025ae9678cef0f75a94aba8 Author: Andrew Or <andrew@...> Date: 2016-05-20T05:55:29Z [HOTFIX] Add back intended change from SPARK-15392 This was accidentally reverted in f8d0177. commit 47feebd13dca730c7769bcdc64a0ecc5b6c6c563 Author: Lianhui Wang <lianhuiwang09@...> Date: 2016-05-20T06:03:59Z [SPARK-15335][SQL] Implement TRUNCATE TABLE Command ## What changes were proposed in this pull request? Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446 This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005). ## How was this patch tested? Added a test case. Author: Lianhui Wang <lianhuiwan...@gmail.com> Closes #13170 from lianhuiwang/truncate. (cherry picked from commit 09a00510c4759ff87abb0b2fdf1630ddf36ca12c) Signed-off-by: Reynold Xin <r...@databricks.com> commit 8fb087772d7e226a188e2f3298abb603fd3909ed Author: dding3 <dingding@...> Date: 2016-05-09T08:43:07Z [SPARK-15172][ML] Explicitly tell user initial coefficients is ignored when size mismatch happened in LogisticRegression ## What changes were proposed in this pull request? Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression ## How was this patch tested? local build Author: dding3 <dingd...@dingding-ubuntu.sh.intel.com> Closes #12948 from dding3/master. (cherry picked from commit a78fbfa619a13421b294328b80c82510ca7efed0) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit e4e3e9867e3aba6f3c32bc2c2d060bc681d829c9 Author: wm...@hotmail.com <wm624@...> Date: 2016-05-20T06:21:17Z [SPARK-15363][ML][EXAMPLE] Example code shouldn't use VectorImplicits._, asML/fromML ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In this DataFrame example, we use VectorImplicits._, which is private API. Since Vectors object has public API, we use Vectors.fromML instead of implicts. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually run the example. Author: wm...@hotmail.com <wm...@hotmail.com> Closes #13213 from wangmiao1981/ml. (cherry picked from commit 4c7a6b385c79f4de07a89495afce4f8e73b06086) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 539dfa205dacea72188642f15773a30a99f8e8ac Author: Zheng RuiFeng <ruifengz@...> Date: 2016-05-20T06:26:11Z [SPARK-15398][ML] Update the warning message to recommend ML usage ## What changes were proposed in this pull request? MLlib are not recommended to use, and some methods are even deprecated. Update the warning message to recommend ML usage. ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! |Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS |for more conventional use. """.stripMargin) } ``` To ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! |Please use org.apache.spark.ml.classification.LogisticRegression |for more conventional use. """.stripMargin) } ``` ## How was this patch tested? local build Author: Zheng RuiFeng <ruife...@foxmail.com> Closes #13190 from zhengruifeng/update_recd. (cherry picked from commit 47a2940da97caa55bbb8bb8ec1d51c9f6d5041c6) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 5f73f627f966926ac477663642903f175cad54d0 Author: sethah <seth.hendrickson16@...> Date: 2016-05-20T06:29:37Z [SPARK-15394][ML][DOCS] User guide typos and grammar audit ## What changes were proposed in this pull request? Correct some typos and incorrectly worded sentences. ## How was this patch tested? Doc changes only. Note that many of these changes were identified by whomfire01 Author: sethah <seth.hendrickso...@gmail.com> Closes #13180 from sethah/ml_guide_audit. (cherry picked from commit 5e203505f1a092e5849ebd01d9ff9e4fc6cdc34a) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 9963fd4398d7ef6c632fc9851ef64bd71a87aa12 Author: Yanbo Liang <ybliang8@...> Date: 2016-05-20T06:35:20Z [SPARK-15339][ML] ML 2.0 QA: Scala APIs and code audit for regression ## What changes were proposed in this pull request? * ```GeneralizedLinearRegression``` API docs enhancement. * The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol``` * Make some methods more private. * Fix a minor bug of LinearRegression. * Fix some other issues. ## How was this patch tested? Existing tests. Author: Yanbo Liang <yblia...@gmail.com> Closes #13129 from yanboliang/spark-15339. (cherry picked from commit c94b34ebbf4c6ce353c899c571beb34e8db98917) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 4d13348f861fd391c64433a1691c1b7f33a36db1 Author: gatorsmile <gatorsmile@...> Date: 2016-05-20T06:38:25Z [SPARK-15367][SQL] Add refreshTable back #### What changes were proposed in this pull request? `refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`. In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`). #### How was this patch tested? Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility` Author: gatorsmile <gatorsm...@gmail.com> Closes #13156 from gatorsmile/refreshTable. (cherry picked from commit 39fd469078271aa12f3163606000e06e382d35dc) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 4e25d6e8ce9ce88a58fc0ea0e00cc7b68370a62d Author: Andrew Or <andrew@...> Date: 2016-05-20T06:43:01Z [SPARK-15421][SQL] Validate DDL property values ## What changes were proposed in this pull request? When we parse DDLs involving table or database properties, we need to validate the values. E.g. if we alter a database's property without providing a value: ``` ALTER DATABASE my_db SET DBPROPERTIES('some_key') ``` Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values. In such cases, we should throw exceptions instead. ## How was this patch tested? `DDLCommandSuite` Author: Andrew Or <and...@databricks.com> Closes #13205 from andrewor14/ddl-prop-values. (cherry picked from commit 257375019266ab9e3c320e33026318cc31f58ada) Signed-off-by: Andrew Or <and...@databricks.com> commit 53c09f065fac9cabe479cd1f205810230eda110d Author: Andrew Or <andrew@...> Date: 2016-05-20T06:44:10Z [SPARK-15417][SQL][PYTHON] PySpark shell always uses in-memory catalog ## What changes were proposed in this pull request? There is no way to use the Hive catalog in `pyspark-shell`. This is because we used to create a `SparkContext` before calling `SparkSession.enableHiveSupport().getOrCreate()`, which just gets the existing `SparkContext` instead of creating a new one. As a result, `spark.sql.catalogImplementation` was never propagated. ## How was this patch tested? Manual. Author: Andrew Or <and...@databricks.com> Closes #13203 from andrewor14/fix-pyspark-shell. (cherry picked from commit c32b1b162e7e5ecc5c823f79ba9f23cbd1407dbf) Signed-off-by: Andrew Or <and...@databricks.com> commit 1346f3cd6cf78c940f646bb2b808ae3b22f936b3 Author: Liang-Chi Hsieh <simonh@...> Date: 2016-05-20T11:40:13Z [SPARK-15444][PYSPARK][ML][HOTFIX] Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression ## What changes were proposed in this pull request? Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Closes #13220 from viirya/hotfix-regresstion. (cherry picked from commit 4e739331187f2acdd84a5e65857edb62e58a0f8f) Signed-off-by: Nick Pentreath <ni...@za.ibm.com> commit 93f9f928e45988371d7d86f080b4e9971c03fbc9 Author: WeichenXu <weichenxu123@...> Date: 2016-05-20T13:17:19Z [SPARK-15203][DEPLOY] The spark daemon shell script error, daemon process start successfully but script output fail message ## What changes were proposed in this pull request? fix the bug: The spark daemon shell script error, daemon process start successfully but script output fail message ## How was this patch tested? existing test. Author: WeichenXu <weichenxu...@outlook.com> Closes #13172 from WeichenXu123/fix-spark-15203. (cherry picked from commit a3ceb875c64421ced8e52db6d8e51aec9b758e3e) Signed-off-by: Sean Owen <so...@cloudera.com> commit 0066d35cc909361460fa99f3791453741dfd707e Author: Yanbo Liang <ybliang8@...> Date: 2016-05-20T16:30:20Z [SPARK-15222][SPARKR][ML] SparkR ML examples update in 2.0 ## What changes were proposed in this pull request? Update example code in examples/src/main/r/ml.R to reflect the new algorithms. * spark.glm and glm * spark.survreg * spark.naiveBayes * spark.kmeans ## How was this patch tested? Offline test. Author: Yanbo Liang <yblia...@gmail.com> Closes #13000 from yanboliang/spark-15222. (cherry picked from commit 9a9c6f5c22248c5a891e9d3b788ff12b6b4718b2) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 78c8825bd4b5b86596ccf260c15bf97a9689b6ac Author: Takuya UESHIN <ueshin@...> Date: 2016-05-20T16:34:55Z [SPARK-15308][SQL] RowEncoder should preserve nested column name. ## What changes were proposed in this pull request? The following code generates wrong schema: ``` val schema = new StructType().add( "struct", new StructType() .add("i", IntegerType, nullable = false) .add( "s", new StructType().add("int", IntegerType, nullable = false), nullable = false), nullable = false) val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema)) ds.printSchema() ``` This should print as follows: ``` root |-- struct: struct (nullable = false) | |-- i: integer (nullable = false) | |-- s: struct (nullable = false) | | |-- int: integer (nullable = false) ``` but the result is: ``` root |-- struct: struct (nullable = false) | |-- col1: integer (nullable = false) | |-- col2: struct (nullable = false) | | |-- col1: integer (nullable = false) ``` This PR fixes `RowEncoder` to preserve nested column name. ## How was this patch tested? Existing tests and I added a test to check if `RowEncoder` preserves nested column name. Author: Takuya UESHIN <ues...@happy-camper.st> Closes #13090 from ueshin/issues/SPARK-15308. (cherry picked from commit d2e1aa97ef5bf7cfffc777a178f44ab8fa775266) Signed-off-by: Reynold Xin <r...@databricks.com> commit a879e7c32e41326387e0754095a5f14d781e1cf1 Author: Reynold Xin <rxin@...> Date: 2016-05-20T16:36:14Z [SPARK-15435][SQL] Append Command to all commands ## What changes were proposed in this pull request? We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands. ## How was this patch tested? Updated test cases to reflect the renames. Author: Reynold Xin <r...@databricks.com> Closes #13215 from rxin/SPARK-15435. (cherry picked from commit e8adc552df80af413e1d31b020489612d13a8770) Signed-off-by: Reynold Xin <r...@databricks.com> commit 0dd3bdc2738a8ddaa69c471b2f31fd6f3d41ce46 Author: Takuya UESHIN <ueshin@...> Date: 2016-05-20T16:38:34Z [SPARK-15400][SQL] CreateNamedStruct and CreateNamedStructUnsafe should preserve metadata of value expressions if it is NamedExpression. ## What changes were proposed in this pull request? `CreateNamedStruct` and `CreateNamedStructUnsafe` should preserve metadata of value expressions if it is `NamedExpression` like `CreateStruct` or `CreateStructUnsafe` are doing. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ues...@happy-camper.st> Closes #13193 from ueshin/issues/SPARK-15400. (cherry picked from commit 2cbe96e64d5f84474b2eb59bed9ce3ab543d8aff) Signed-off-by: Reynold Xin <r...@databricks.com> commit 96e41dc6a5d8ad44f0756255e15452affabb079b Author: wm...@hotmail.com <wm624@...> Date: 2016-05-20T17:27:41Z [SPARK-15360][SPARK-SUBMIT] Should print spark-submit usage when no arguments is specified (Please fill in changes proposed in this fix) In 2.0, ./bin/spark-submit doesn't print out usage, but it raises an exception. In this PR, an exception handling is added in the Main.java when the exception is thrown. In the handling code, if there is no additional argument, it prints out usage. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually tested. ./bin/spark-submit Usage: spark-submit [options] <app jar | python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. Author: wm...@hotmail.com <wm...@hotmail.com> Closes #13163 from wangmiao1981/submit. (cherry picked from commit fe2fcb48039ac897242e2cfaed31703fa6116db7) Signed-off-by: Marcelo Vanzin <van...@cloudera.com> commit 1677fd31937fde19fdfc8323cdb33b44f3a67204 Author: Davies Liu <davies.liu@...> Date: 2016-05-20T17:44:26Z [HOTFIX] disable stress test commit e99b22080b47e0596254cad4ac6eb28b8c4c69a0 Author: Kousuke Saruta <sarutak@...> Date: 2016-05-20T17:56:35Z [SPARK-15165] [SPARK-15205] [SQL] Introduce place holder for comments in generated code ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? Existing tests. Author: Kousuke Saruta <saru...@oss.nttdata.co.jp> Closes #12979 from sarutak/SPARK-15205. (cherry picked from commit 22947cd0213856442025baf653be588c6c707e36) Signed-off-by: Davies Liu <davies....@gmail.com> commit 42e63c35a60dc256759cb42260ba1113df05c74b Author: Shixiong Zhu <shixiong@...> Date: 2016-05-20T19:38:46Z [SPARK-15190][SQL] Support using SQLUserDefinedType for case classes ## What changes were proposed in this pull request? Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work. This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`. ## How was this patch tested? `encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")` Author: Shixiong Zhu <shixi...@databricks.com> Closes #12965 from zsxwing/SPARK-15190. (cherry picked from commit dfa61f7b136ae060bbe04e3c0da1148da41018c7) Signed-off-by: Michael Armbrust <mich...@databricks.com> commit 3ed9ba6e1a8b84f69b21c4d17d0edb574de5c176 Author: Michael Armbrust <michael@...> Date: 2016-05-20T20:00:29Z [SPARK-10216][SQL] Revert "[] Avoid creating empty files during overwrit⦠This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames. Author: Michael Armbrust <mich...@databricks.com> Closes #13181 from marmbrus/revert12855. (cherry picked from commit 2ba3ff044900d16d5f6331523526f785864c1e62) Signed-off-by: Michael Armbrust <mich...@databricks.com> commit 89e29870bb73dac9dfebd3c3663320e4fdc6c03a Author: Davies Liu <davies@...> Date: 2016-05-20T20:21:53Z [SPARK-15438][SQL] improve explain of whole stage codegen ## What changes were proposed in this pull request? Currently, the explain of a query with whole-stage codegen looks like this ``` >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain() == Physical Plan == WholeStageCodegen : +- Project [id#1L] : +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None : :- Range 0, 1, 4, 1000, [id#1L] : +- INPUT +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) +- WholeStageCodegen : +- Range 0, 1, 4, 1000, [id#4L] ``` The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together). This PR will change it to: ``` >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain() == Physical Plan == *Project [id#0L] +- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None :- *Range 0, 1, 4, 1000, [id#0L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *Range 0, 1, 4, 1000, [id#3L] ``` The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand. ## How was this patch tested? Manually ran some queries and check the explain. Author: Davies Liu <dav...@databricks.com> Closes #13204 from davies/explain_codegen. (cherry picked from commit 0e70fd61b4bc92bd744fc44dd3cbe91443207c72) Signed-off-by: Reynold Xin <r...@databricks.com> ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org