GitHub user tengpeng opened a pull request: https://github.com/apache/spark/pull/20729
[SPARK-23578][ML]Add multicolumn support for Binarizer [Spark-20542] added an API that Bucketizer that can bin multiple columns. Based on this change, a multicolumn support is added for Binarizer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tengpeng/spark Binarizer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20729.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20729 ---- commit 9ca0f6eaf6744c090cab4ac6720cf11c9b83915e Author: gatorsmile <gatorsmile@...> Date: 2018-01-11T13:32:36Z [SPARK-23000][TEST-HADOOP2.6] Fix Flaky test suite DataSourceWithHiveMetastoreCatalogSuite ## What changes were proposed in this pull request? The Spark 2.3 branch still failed due to the flaky test suite `DataSourceWithHiveMetastoreCatalogSuite `. https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/ Although https://github.com/apache/spark/pull/20207 is unable to reproduce it in Spark 2.3, it sounds like the current DB of Spark's Catalog is changed based on the following stacktrace. Thus, we just need to reset it. ``` [info] DataSourceWithHiveMetastoreCatalogSuite: 02:40:39.486 ERROR org.apache.hadoop.hive.ql.parse.CalcitePlanner: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:14 Table not found 't' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1594) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1545) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10077) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:694) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:683) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:683) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:673) at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:185) at org.apache.spark.sql.test.SQLTestUtilsBase$class.withTable(SQLTestUtils.scala:273) at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:139) at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:163) at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163) at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite$class.run(Suite.scala:1147) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:521) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ## How was this patch tested? N/A Author: gatorsmile <gatorsm...@gmail.com> Closes #20218 from gatorsmile/testFixAgain. (cherry picked from commit 76892bcf2c08efd7e9c5b16d377e623d82fe695e) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit f624850fe8acce52240217f376316734a23be00b Author: gatorsmile <gatorsmile@...> Date: 2018-01-11T13:33:42Z [SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and fillna ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18164 introduces the behavior changes. We need to document it. ## How was this patch tested? N/A Author: gatorsmile <gatorsm...@gmail.com> Closes #20234 from gatorsmile/docBehaviorChange. (cherry picked from commit b46e58b74c82dac37b7b92284ea3714919c5a886) Signed-off-by: hyukjinkwon <gurwls...@gmail.com> commit b94debd2b01b87ef1d2a34d48877e38ade0969e6 Author: Marcelo Vanzin <vanzin@...> Date: 2018-01-11T18:37:35Z [SPARK-22994][K8S] Use a single image for all Spark containers. This change allows a user to submit a Spark application on kubernetes having to provide a single image, instead of one image for each type of container. The image's entry point now takes an extra argument that identifies the process that is being started. The configuration still allows the user to provide different images for each container type if they so desire. On top of that, the entry point was simplified a bit to share more code; mainly, the same env variable is used to propagate the user-defined classpath to the different containers. Aside from being modified to match the new behavior, the 'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh' to more closely match its purpose; the old name was a little awkward and now also not entirely correct, since there is a single image. It was also moved to 'bin' since it's not necessarily an admin tool. Docs have been updated to match the new behavior. Tested locally with minikube. Author: Marcelo Vanzin <van...@cloudera.com> Closes #20192 from vanzin/SPARK-22994. (cherry picked from commit 0b2eefb674151a0af64806728b38d9410da552ec) Signed-off-by: Marcelo Vanzin <van...@cloudera.com> commit f891ee3249e04576dd579cbab6f8f1632550e6bd Author: Jose Torres <jose@...> Date: 2018-01-11T18:52:12Z [SPARK-22908] Add kafka source and sink for continuous processing. ## What changes were proposed in this pull request? Add kafka source and sink for continuous processing. This involves two small changes to the execution engine: * Bring data reader close() into the normal data reader thread to avoid thread safety issues. * Fix up the semantics of the RECONFIGURING StreamExecution state. State updates are now atomic, and we don't have to deal with swallowing an exception. ## How was this patch tested? new unit tests Author: Jose Torres <j...@databricks.com> Closes #20096 from jose-torres/continuous-kafka. (cherry picked from commit 6f7aaed805070d29dcba32e04ca7a1f581fa54b9) Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com> commit 2ec302658c98038962c9b7a90fd2cff751a35ffa Author: Bago Amirbekian <bago@...> Date: 2018-01-11T21:57:15Z [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline ## What changes were proposed in this pull request? Including VectorSizeHint in RFormula piplelines will allow them to be applied to streaming dataframes. ## How was this patch tested? Unit tests. Author: Bago Amirbekian <b...@databricks.com> Closes #20238 from MrBago/rFormulaVectorSize. (cherry picked from commit 186bf8fb2e9ff8a80f3f6bcb5f2a0327fa79a1c9) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> commit 964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea Author: Sameer Agarwal <sameerag@...> Date: 2018-01-11T23:23:10Z Preparing Spark release v2.3.0-rc1 commit 6bb22961c0c9df1a1f22e9491894895b297f5288 Author: Sameer Agarwal <sameerag@...> Date: 2018-01-11T23:23:17Z Preparing development version 2.3.1-SNAPSHOT commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d Author: WeichenXu <weichen.xu@...> Date: 2018-01-12T00:20:30Z [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu <weichen...@databricks.com> Closes #20209 from WeichenXu123/ohe_py. (cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2 Author: ho3rexqj <ho3rexqj@...> Date: 2018-01-12T07:27:00Z [SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances of broadcast variable values When resources happen to be constrained on an executor the first time a broadcast variable is instantiated it is persisted to disk by the BlockManager. Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock from other instances of that broadcast variable spawns another instance of the underlying value. That is, broadcast variables are spawned once per executor **unless** memory is constrained, in which case every instance of a broadcast variable is provided with a unique copy of the underlying value. This patch fixes the above by explicitly caching the underlying values using weak references in a ReferenceMap. Author: ho3rexqj <ho3re...@gmail.com> Closes #20183 from ho3rexqj/fix/cache-broadcast-values. (cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit d512d873b3f445845bd113272d7158388427f8a6 Author: WeichenXu <weichen.xu@...> Date: 2018-01-12T09:27:02Z [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated ## What changes were proposed in this pull request? mark OneHotEncoder python API deprecated ## How was this patch tested? N/A Author: WeichenXu <weichen...@databricks.com> Closes #20241 from WeichenXu123/mark_ohe_deprecated. (cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6) Signed-off-by: Nick Pentreath <ni...@za.ibm.com> commit 6152da3893a05b3f8dc0f13895af9be9548e5895 Author: Marco Gaido <marcogaido91@...> Date: 2018-01-12T10:04:44Z [SPARK-23025][SQL] Support Null type in scala reflection ## What changes were proposed in this pull request? Add support for `Null` type in the `schemaFor` method for Scala reflection. ## How was this patch tested? Added UT Author: Marco Gaido <marcogaid...@gmail.com> Closes #20219 from mgaido91/SPARK-23025. (cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit db27a93652780f234f3c5fe750ef07bc5525d177 Author: Dongjoon Hyun <dongjoon@...> Date: 2018-01-12T18:18:42Z [MINOR][BUILD] Fix Java linter errors ## What changes were proposed in this pull request? This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, this will be the final one. ``` $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] (imports) UnusedImports: Unused import - java.io.IOException. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9] (modifier) ModifierOrder: 'private' modifier out of order with the JLS suggestions. [ERROR] src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java:[464] (sizes) LineLength: Line is longer than 100 characters (found 102). ``` ## How was this patch tested? Manual. ``` $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongj...@apache.org> Closes #20242 from dongjoon-hyun/fix_lint_java_2.3_rc1. (cherry picked from commit 7bd14cfd40500a0b6462cda647bdbb686a430328) Signed-off-by: Sameer Agarwal <samee...@apache.org> commit 02176f4c2f60342068669b215485ffd443346aed Author: Marco Gaido <marcogaido91@...> Date: 2018-01-12T19:25:37Z [SPARK-22975][SS] MetricsReporter should not throw exception when there was no progress reported ## What changes were proposed in this pull request? `MetricsReporter ` assumes that there has been some progress for the query, ie. `lastProgress` is not null. If this is not true, as it might happen in particular conditions, a `NullPointerException` can be thrown. The PR checks whether there is a `lastProgress` and if this is not true, it returns a default value for the metrics. ## How was this patch tested? added UT Author: Marco Gaido <marcogaid...@gmail.com> Closes #20189 from mgaido91/SPARK-22975. (cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de) Signed-off-by: Shixiong Zhu <zsxw...@gmail.com> commit 60bcb4685022c29a6ddcf707b505369687ec7da6 Author: Sameer Agarwal <sameerag@...> Date: 2018-01-12T23:07:14Z Revert "[SPARK-22908] Add kafka source and sink for continuous processing." This reverts commit f891ee3249e04576dd579cbab6f8f1632550e6bd. commit ca27d9cb5e30b6a50a4c8b7d10ac28f4f51d44ee Author: hyukjinkwon <gurwls223@...> Date: 2018-01-13T07:13:44Z [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF ## What changes were proposed in this pull request? This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch. We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF. For example, please consider this example: ```python from pyspark.sql.functions import pandas_udf, col, lit df = spark.range(1) f = pandas_udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ |<lambda>(text, id)| +------------------+ | 1| +------------------+ ``` ```python from pyspark.sql.functions import udf, col, lit df = spark.range(1) f = udf(lambda x, y: len(x) + y, "long") df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ |<lambda>(text, id)| +------------------+ | 4| +------------------+ ``` ## How was this patch tested? Manually built the doc and checked the output. Author: hyukjinkwon <gurwls...@gmail.com> Closes #20237 from HyukjinKwon/SPARK-22980. (cherry picked from commit cd9f49a2aed3799964976ead06080a0f7044a0c3) Signed-off-by: hyukjinkwon <gurwls...@gmail.com> commit 801ffd799922e1c2751d3331874b88a67da8cf01 Author: Yuming Wang <yumwang@...> Date: 2018-01-13T16:01:44Z [SPARK-22870][CORE] Dynamic allocation should allow 0 idle time ## What changes were proposed in this pull request? This pr to make `0` as a valid value for `spark.dynamicAllocation.executorIdleTimeout`. For details, see the jira description: https://issues.apache.org/jira/browse/SPARK-22870. ## How was this patch tested? N/A Author: Yuming Wang <yumw...@ebay.com> Author: Yuming Wang <wgy...@gmail.com> Closes #20080 from wangyum/SPARK-22870. (cherry picked from commit fc6fe8a1d0f161c4788f3db94de49a8669ba3bcc) Signed-off-by: Sean Owen <so...@cloudera.com> commit 8d32ed5f281317ba380aa6b8b3f3f041575022cb Author: xubo245 <601450868@...> Date: 2018-01-13T18:28:57Z [SPARK-23036][SQL][TEST] Add withGlobalTempView for testing ## What changes were proposed in this pull request? Add withGlobalTempView when create global temp view, like withTempView and withView. And correct some improper usage. Please see jira. There are other similar place like that. I will fix it if community need. Please confirm it. ## How was this patch tested? no new test. Author: xubo245 <601450...@qq.com> Closes #20228 from xubo245/DropTempView. (cherry picked from commit bd4a21b4820c4ebaf750131574a6b2eeea36907e) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit 0fc5533e53ad03eb67590ddd231f40c2713150c3 Author: CodingCat <zhunansjtu@...> Date: 2018-01-13T18:36:32Z [SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's size ## What changes were proposed in this pull request? as per discussion in https://github.com/apache/spark/pull/19864#discussion_r156847927 the current HadoopFsRelation is purely based on the underlying file size which is not accurate and makes the execution vulnerable to errors like OOM Users can enable CBO with the functionalities in https://github.com/apache/spark/pull/19864 to avoid this issue This JIRA proposes to add a configurable factor to sizeInBytes method in HadoopFsRelation class so that users can mitigate this problem without CBO ## How was this patch tested? Existing tests Author: CodingCat <zhunans...@gmail.com> Author: Nan Zhu <nan...@uber.com> Closes #20072 from CodingCat/SPARK-22790. (cherry picked from commit ba891ec993c616dc4249fc786c56ea82ed04a827) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit bcd87ae0775d16b7c3b9de0c4f2db36eb3679476 Author: Takeshi Yamamuro <yamamuro@...> Date: 2018-01-13T21:39:38Z [SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in compareAndGetNewStats ## What changes were proposed in this pull request? This pr fixed code to compare values in `compareAndGetNewStats`. The test below fails in the current master; ``` val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) * 2) val newStats5 = CommandUtils.compareAndGetNewStats( Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None) assert(newStats5.isEmpty) ``` ## How was this patch tested? Added some tests in `CommandUtilsSuite`. Author: Takeshi Yamamuro <yamam...@apache.org> Closes #20245 from maropu/SPARK-21213-FOLLOWUP. (cherry picked from commit 0066d6f6fa604817468471832968d4339f71c5cb) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit 1f4a08b15ab47cf6c3bb08c783497422f30d0709 Author: foxish <ramanathana@...> Date: 2018-01-14T05:34:28Z [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses) ## What changes were proposed in this pull request? Including the `-Pkubernetes` flag in a few places it was missed. ## How was this patch tested? checkstyle, mima through manual tests. Author: foxish <ramanath...@google.com> Closes #20256 from foxish/SPARK-23063. (cherry picked from commit c3548d11c3c57e8f2c6ebd9d2d6a3924ddcd3cba) Signed-off-by: Felix Cheung <felixche...@apache.org> commit a335a49ce4672b44e5f818145214040a67c722ba Author: Dongjoon Hyun <dongjoon@...> Date: 2018-01-14T07:26:12Z [SPARK-23038][TEST] Update docker/spark-test (JDK/OS) ## What changes were proposed in this pull request? This PR aims to update the followings in `docker/spark-test`. - JDK7 -> JDK8 Spark 2.2+ supports JDK8 only. - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel) The end of life of `precise` was April 28, 2017. ## How was this patch tested? Manual. * Master ``` $ cd external/docker $ ./build $ export SPARK_HOME=... $ docker run -v $SPARK_HOME:/opt/spark spark-test-master CONTAINER_IP=172.17.0.3 ... 18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and started at http://172.17.0.3:8080 18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066. 18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066 18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE ``` * Slave ``` $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://172.17.0.3:7077 CONTAINER_IP=172.17.0.4 ... 18/01/11 06:51:54 INFO Worker: Successfully registered with master spark://172.17.0.3:7077 ``` After slave starts, master will show ``` 18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4 cores, 1024.0 MB RAM ``` Author: Dongjoon Hyun <dongj...@apache.org> Closes #20230 from dongjoon-hyun/SPARK-23038. (cherry picked from commit 7a3d0aad2b89aef54f7dd580397302e9ff984d9d) Signed-off-by: Felix Cheung <felixche...@apache.org> commit 0d425c3362dc648d5c85b2b07af1a9df23b6d422 Author: Felix Cheung <felixcheung_m@...> Date: 2018-01-14T10:43:10Z [SPARK-23069][DOCS][SPARKR] fix R doc for describe missing text ## What changes were proposed in this pull request? fix doc truncated ## How was this patch tested? manually Author: Felix Cheung <felixcheun...@hotmail.com> Closes #20263 from felixcheung/r23docfix. (cherry picked from commit 66738d29c59871b29d26fc3756772b95ef536248) Signed-off-by: hyukjinkwon <gurwls...@gmail.com> commit 5fbbd94d509dbbcfa1fe940569049f72ff4a6e89 Author: Takeshi Yamamuro <yamamuro@...> Date: 2018-01-14T14:26:21Z [SPARK-23021][SQL] AnalysisBarrier should override innerChildren to print correct explain output ## What changes were proposed in this pull request? `AnalysisBarrier` in the current master cuts off explain results for parsed logical plans; ``` scala> Seq((1, 1)).toDF("a", "b").groupBy("a").count().sample(0.1).explain(true) == Parsed Logical Plan == Sample 0.0, 0.1, false, -7661439431999668039 +- AnalysisBarrier Aggregate [a#5], [a#5, count(1) AS count#14L] ``` To fix this, `AnalysisBarrier` needs to override `innerChildren` and this pr changed the output to; ``` == Parsed Logical Plan == Sample 0.0, 0.1, false, -5086223488015741426 +- AnalysisBarrier +- Aggregate [a#5], [a#5, count(1) AS count#14L] +- Project [_1#2 AS a#5, _2#3 AS b#6] +- LocalRelation [_1#2, _2#3] ``` ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <yamam...@apache.org> Closes #20247 from maropu/SPARK-23021-2. (cherry picked from commit 990f05c80347c6eec2ee06823cff587c9ea90b49) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit 9051e1a265dc0f1dc19fd27a0127ffa47f3ac245 Author: Sandor Murakozi <smurakozi@...> Date: 2018-01-14T14:32:35Z [SPARK-23051][CORE] Fix for broken job description in Spark UI ## What changes were proposed in this pull request? In 2.2, Spark UI displayed the stage description if the job description was not set. This functionality was broken, the GUI has shown no description in this case. In addition, the code uses jobName and jobDescription instead of stageName and stageDescription when JobTableRowData is created. In this PR the logic producing values for the job rows was modified to find the latest stage attempt for the job and use that as a fallback if job description was missing. StageName and stageDescription are also set using values from stage and jobName/description is used only as a fallback. ## How was this patch tested? Manual testing of the UI, using the code in the bug report. Author: Sandor Murakozi <smurak...@gmail.com> Closes #20251 from smurakozi/SPARK-23051. (cherry picked from commit 60eeecd7760aee6ce2fd207c83ae40054eadaf83) Signed-off-by: Sean Owen <so...@cloudera.com> commit 2879236b92b5712b7438b972404375bbf1993df8 Author: guoxiaolong <guo.xiaolong1@...> Date: 2018-01-14T18:02:49Z [SPARK-22999][SQL] show databases like command' can remove the like keyword ## What changes were proposed in this pull request? SHOW DATABASES (LIKE pattern = STRING)? Can be like the back increase? When using this command, LIKE keyword can be removed. You can refer to the SHOW TABLES command, SHOW TABLES 'test *' and SHOW TABELS like 'test *' can be used. Similarly SHOW DATABASES 'test *' and SHOW DATABASES like 'test *' can be used. ## How was this patch tested? unit tests manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <guo.xiaolo...@zte.com.cn> Closes #20194 from guoxiaolongzte/SPARK-22999. (cherry picked from commit 42a1a15d739890bdfbb367ef94198b19e98ffcb7) Signed-off-by: gatorsmile <gatorsm...@gmail.com> commit 30574fd3716dbdf553cfd0f4d33164ab8fbccb77 Author: Takeshi Yamamuro <yamamuro@...> Date: 2018-01-15T02:55:21Z [SPARK-23054][SQL] Fix incorrect results of casting UserDefinedType to String ## What changes were proposed in this pull request? This pr fixed the issue when casting `UserDefinedType`s into strings; ``` >>> from pyspark.ml.classification import MultilayerPerceptronClassifier >>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"]) >>> df.selectExpr("CAST(features AS STRING)").show(truncate = False) +-------------------------------------------+ |features | +-------------------------------------------+ |[6,1,0,0,2800000020,2,0,0,0] | |[6,1,0,0,2800000020,2,0,0,3ff0000000000000]| +-------------------------------------------+ ``` The root cause is that `Cast` handles input data as `UserDefinedType.sqlType`(this is underlying storage type), so we should pass data into `UserDefinedType.deserialize` then `toString`. This pr modified the result into; ``` +---------+ |features | +---------+ |[0.0,0.0]| |[0.0,1.0]| +---------+ ``` ## How was this patch tested? Added tests in `UserDefinedTypeSuite `. Author: Takeshi Yamamuro <yamam...@apache.org> Closes #20246 from maropu/SPARK-23054. (cherry picked from commit b98ffa4d6dabaf787177d3f14b200fc4b118c7ce) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 81b989903af0cdcb6c19e6e8e7bdbac455a2c281 Author: Dongjoon Hyun <dongjoon@...> Date: 2018-01-15T04:06:56Z [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` should work for ORC files ## What changes were proposed in this pull request? When `spark.sql.files.ignoreCorruptFiles=true`, we should ignore corrupted ORC files. ## How was this patch tested? Pass the Jenkins with a newly added test case. Author: Dongjoon Hyun <dongj...@apache.org> Closes #20240 from dongjoon-hyun/SPARK-23049. (cherry picked from commit 9a96bfc8bf021cb4b6c62fac6ce1bcf87affcd43) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 188999a3401357399d8d2b30f440d8b0b0795fc5 Author: Takeshi Yamamuro <yamamuro@...> Date: 2018-01-15T08:26:52Z [SPARK-23023][SQL] Cast field data to strings in showString ## What changes were proposed in this pull request? The current `Datset.showString` prints rows thru `RowEncoder` deserializers like; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------------------------------------------+ |a | +------------------------------------------------------------+ |[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]| +------------------------------------------------------------+ ``` This result is incorrect because the correct one is; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------+ |a | +------------------------+ |[[1, 2], [3], [4, 5, 6]]| +------------------------+ ``` So, this pr fixed code in `showString` to cast field data to strings before printing. ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <yamam...@apache.org> Closes #20214 from maropu/SPARK-23023. (cherry picked from commit b59808385cfe24ce768e5b3098b9034e64b99a5a) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 3491ca4fb5c2e3fecd727f7a31b8efbe74032bcc Author: Yuming Wang <yumwang@...> Date: 2018-01-15T13:49:34Z [SPARK-19550][BUILD][FOLLOW-UP] Remove MaxPermSize for sql module ## What changes were proposed in this pull request? Remove `MaxPermSize` for `sql` module ## How was this patch tested? Manually tested. Author: Yuming Wang <yumw...@ebay.com> Closes #20268 from wangyum/SPARK-19550-MaxPermSize. (cherry picked from commit a38c887ac093d7cf343d807515147d87ca931ce7) Signed-off-by: Sean Owen <so...@cloudera.com> commit c6a3b9297f0246cfc02a57ec099ca23db90f343f Author: gatorsmile <gatorsmile@...> Date: 2018-01-15T14:32:38Z [SPARK-23070] Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 ## What changes were proposed in this pull request? Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 and add the missing exclusions to `v23excludes` in `MimaExcludes`. No item can be un-excluded in `v23excludes`. ## How was this patch tested? The existing tests. Author: gatorsmile <gatorsm...@gmail.com> Closes #20264 from gatorsmile/bump22. (cherry picked from commit bd08a9e7af4137bddca638e627ad2ae531bce20f) Signed-off-by: gatorsmile <gatorsm...@gmail.com> ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org