GitHub user laixiaohang opened a pull request: https://github.com/apache/spark/pull/16348
Branch 2.0.4399 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/laixiaohang/spark branch-2.0.4399 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16348.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16348 ---- commit c9c36fa0c7bccefde808bdbc32b04e8555356001 Author: Davies Liu <dav...@databricks.com> Date: 2016-09-02T22:10:12Z [SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in DataFrameWriter Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. Ideally, we should make all the analyzer rules all idempotent, that may require lots of effort to double checking them one by one (may be not easy). An easier approach could be never feed a optimized plan into Analyzer, this PR fix the case for RunnableComand, they will be optimized, during execution, the passed `query` will also be passed into QueryExecution again. This PR make these `query` not part of the children, so they will not be optimized and analyzed again. Right now, we did not know a logical plan is optimized or not, we could introduce a flag for that, and make sure a optimized logical plan will not be analyzed again. Added regression tests. Author: Davies Liu <dav...@databricks.com> Closes #14797 from davies/fix_writer. (cherry picked from commit ed9c884dcf925500ceb388b06b33bd2c95cd2ada) Signed-off-by: Davies Liu <davies....@gmail.com> commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b Author: Sameer Agarwal <samee...@cs.berkeley.edu> Date: 2016-09-02T22:16:16Z [SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <samee...@cs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. (cherry picked from commit a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a) Signed-off-by: Davies Liu <davies....@gmail.com> commit b8f65dad7be22231e982aaec3bbd69dbeacc20da Author: Davies Liu <davies....@gmail.com> Date: 2016-09-02T22:40:02Z Fix build commit c0ea7707127c92ecb51794b96ea40d7cdb28b168 Author: Davies Liu <davies....@gmail.com> Date: 2016-09-02T23:05:37Z Revert "[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error" This reverts commit a3930c3b9afa9f7eba2a5c8b8f279ca38e348e9b. commit 12a2e2a5ab5db12f39a7b591e914d52058e1581b Author: Junyang Qian <junya...@databricks.com> Date: 2016-09-03T04:11:57Z [SPARKR][MINOR] Fix docs for sparkR.session and count ## What changes were proposed in this pull request? This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users. ## How was this patch tested? Manual test. ![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png) Author: Junyang Qian <junya...@databricks.com> Closes #14942 from junyangq/fixSparkRSessionDoc. (cherry picked from commit d2fde6b72c4aede2e7edb4a7e6653fb1e7b19924) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit 949544d017ab25b43b683cd5c1e6783d87bfce45 Author: CodingCat <zhunans...@gmail.com> Date: 2016-09-03T09:03:40Z [SPARK-17347][SQL][EXAMPLES] Encoder in Dataset example has incorrect type ## What changes were proposed in this pull request? We propose to fix the Encoder type in the Dataset example ## How was this patch tested? The PR will be tested with the current unit test cases Author: CodingCat <zhunans...@gmail.com> Closes #14901 from CodingCat/SPARK-17347. (cherry picked from commit 97da41039b2b8fa7f93caf213ae45b9973925995) Signed-off-by: Sean Owen <so...@cloudera.com> commit 196d62eae05be0d87a20776fa07208b7ea2ddc90 Author: Sandeep Singh <sand...@techaddict.me> Date: 2016-09-03T14:35:19Z [MINOR][SQL] Not dropping all necessary tables ## What changes were proposed in this pull request? was not dropping table `parquet_t3` ## How was this patch tested? tested `LogicalPlanToSQLSuite` locally Author: Sandeep Singh <sand...@techaddict.me> Closes #13767 from techaddict/minor-8. (cherry picked from commit a8a35b39b92fc9000eaac102c67c66be30b05e54) Signed-off-by: Sean Owen <so...@cloudera.com> commit a7f5e7066f935d58d702a3e86b85aa175291d0fc Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-08-10T08:25:01Z [SPARK-16959][SQL] Rebuild Table Comment when Retrieving Metadata from Hive Metastore ### What changes were proposed in this pull request? The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty. This PR is to fix the issue. ### How was this patch tested? Fixed the test case to verify the change. Author: gatorsmile <gatorsm...@gmail.com> Closes #14550 from gatorsmile/tableComment. (cherry picked from commit bdd537164dcfeec5e9c51d54791ef16997ff2597) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 3500dbc9bcce243b6656f308ee4941de0350d198 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-07-26T10:46:12Z [SPARK-16663][SQL] desc table should be consistent between data source and hive serde tables Currently there are 2 inconsistence: 1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema 2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null new test in `HiveDDLSuite` Author: Wenchen Fan <wenc...@databricks.com> Closes #14302 from cloud-fan/minor3. (cherry picked from commit a2abb583caaec9a2cecd5d65b05d172fc096c125) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 704215d3055bad7957d1d6da1a1a526c0d27d37d Author: Herman van Hovell <hvanhov...@databricks.com> Date: 2016-09-03T17:02:20Z [SPARK-17335][SQL] Fix ArrayType and MapType CatalogString. ## What changes were proposed in this pull request? the `catalogString` for `ArrayType` and `MapType` currently calls the `simpleString` method on its children. This is a problem when the child is a struct, the `struct.simpleString` implementation truncates the number of fields it shows (25 at max). This breaks the generation of a proper `catalogString`, and has shown to cause errors while writing to Hive. This PR fixes this by providing proper `catalogString` implementations for `ArrayData` or `MapData`. ## How was this patch tested? Added testing for `catalogString` to `DataTypeSuite`. Author: Herman van Hovell <hvanhov...@databricks.com> Closes #14938 from hvanhovell/SPARK-17335. (cherry picked from commit c2a1576c230697f56f282b6388c79835377e0f2f) Signed-off-by: Herman van Hovell <hvanhov...@databricks.com> commit e387c8ba86f89115eb2eabac070c215f451c5f0f Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-09-05T03:17:37Z [SPARK-17391][TEST][2.0] Fix Two Test Failures After Backport ### What changes were proposed in this pull request? In the latest branch 2.0, we have two test case failure due to backport. - test("ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.") - test("SPARK-6212: The EXPLAIN output of CTAS only shows the analyzed plan") ### How was this patch tested? N/A Author: gatorsmile <gatorsm...@gmail.com> Closes #14951 from gatorsmile/fixTestFailure. commit f92d87455214005e60b2d58aa814aaabd2ac9495 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-09-06T02:45:54Z [SPARK-17353][SPARK-16943][SPARK-16942][BACKPORT-2.0][SQL] Fix multiple bugs in CREATE TABLE LIKE command ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/14531. The existing `CREATE TABLE LIKE` command has multiple issues: - The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location. - The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`. - When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406) - The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept. - The `INDEX` table is not supported. Thus, we should throw an exception in this case. - `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain. - Add a support for temp tables - Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table. - `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features. - When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`. This PR is to fix the above issues. ### How was this patch tested? Improve the test coverage by adding more test cases Author: gatorsmile <gatorsm...@gmail.com> Closes #14946 from gatorsmile/createTableLike20. commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34 Author: Sean Zhong <seanzh...@databricks.com> Date: 2016-09-06T02:50:07Z [SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs ## What changes were proposed in this pull request? `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs to include currying construction arguments, otherwise it reports AssertException telling that the construction argument values' count doesn't match the construction argument names' count. For class `MetastoreRelation`, it has a currying construction parameter `client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #14928 from clockfly/metastore_relation_toJSON. (cherry picked from commit afb3d5d301d004fd748ad305b3d72066af4ebb6c) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit dd27530c7a1f4670a8e28be37c81952eca456752 Author: Yadong Qi <qiyadong2...@gmail.com> Date: 2016-09-06T02:57:21Z [SPARK-17358][SQL] Cached table(parquet/orc) should be shard between beelines ## What changes were proposed in this pull request? Cached table(parquet/orc) couldn't be shard between beelines, because the `sameResult` method used by `CacheManager` always return false(`sparkSession` are different) when compare two `HadoopFsRelation` in different beelines. So we make `sparkSession` a curry parameter. ## How was this patch tested? Beeline1 ``` 1: jdbc:hive2://localhost:10000> CACHE TABLE src_pqt; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (5.143 seconds) 1: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | plan | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | == Physical Plan == InMemoryTableScan [key#49, value#50] +- InMemoryRelation [key#49, value#50], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt` +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string> | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ ``` Beeline2 ``` 0: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | plan | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | == Physical Plan == InMemoryTableScan [key#68, value#69] +- InMemoryRelation [key#68, value#69], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt` +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string> | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ ``` Author: Yadong Qi <qiyadong2...@gmail.com> Closes #14913 from watermen/SPARK-17358. (cherry picked from commit 64e826f91eabb1a22d3d163d71fbb7b6d2185f25) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit f56b70fec2d31fd062320bb328c320e4eca72f1d Author: Yin Huai <yh...@databricks.com> Date: 2016-09-06T04:13:28Z Revert "[SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs" This reverts commit 7b1aa2153bc6c8b753dba0710fe7b5d031158a34. commit 286ccd6ba9e3927e8d445c2f56b6f1f5c77e11df Author: Sean Zhong <seanzh...@databricks.com> Date: 2016-09-06T07:42:52Z [SPARK-17369][SQL][2.0] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs backport https://github.com/apache/spark/pull/14928 to 2.0 ## What changes were proposed in this pull request? `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs to include currying construction arguments, otherwise it reports AssertException telling that the construction argument values' count doesn't match the construction argument names' count. For class `MetastoreRelation`, it has a currying construction parameter `client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #14968 from clockfly/metastore_toJSON_fix_for_spark_2.0. commit c0f1f536dc75c9a1a932282046718228b95d2f70 Author: Sean Zhong <seanzh...@databricks.com> Date: 2016-09-06T08:05:50Z [SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #14915 from clockfly/json_oom. (cherry picked from commit 6f13aa7dfee12b1b301bd10a1050549008ecc67e) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 95e44dca1d99ff7904c3c2e174f0f2123062ce3c Author: Davies Liu <dav...@databricks.com> Date: 2016-09-06T17:46:31Z [SPARK-16922] [SPARK-17211] [SQL] make the address of values portable in LongToUnsafeRowMap ## What changes were proposed in this pull request? In LongToUnsafeRowMap, we use offset of a value as pointer, stored in a array also in the page for chained values. The offset is not portable, because Platform.LONG_ARRAY_OFFSET will be different with different JVM Heap size, then the deserialized LongToUnsafeRowMap will be corrupt. This PR will change to use portable address (without Platform.LONG_ARRAY_OFFSET). ## How was this patch tested? Added a test case with random generated keys, to improve the coverage. But this test is not a regression test, that could require a Spark cluster that have at least 32G heap in driver or executor. Author: Davies Liu <dav...@databricks.com> Closes #14927 from davies/longmap. (cherry picked from commit f7e26d788757f917b32749856bb29feb7b4c2987) Signed-off-by: Davies Liu <davies....@gmail.com> commit 534380484ac5f56bd3e14a8917a24ca6cccf198f Author: Sameer Agarwal <samee...@cs.berkeley.edu> Date: 2016-09-06T17:48:53Z [SPARK-16334] [BACKPORT] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error ## What changes were proposed in this pull request? Backports https://github.com/apache/spark/pull/14941 in 2.0. This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameeragcs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. Author: Sameer Agarwal <samee...@cs.berkeley.edu> Closes #14944 from sameeragarwal/branch-2.0. commit 130a80fd87bd8bb275f59af6d81c2e7dcc9707f9 Author: Adam Roberts <arobe...@uk.ibm.com> Date: 2016-09-06T21:13:25Z [SPARK-17378][BUILD] Upgrade snappy-java to 1.1.2.6 ## What changes were proposed in this pull request? Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)" ## How was this patch tested? Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian) Author: Adam Roberts <arobe...@uk.ibm.com> Closes #14958 from a-roberts/master. (cherry picked from commit 6c08dbf683875ff1ba724447e0531f673bcff8ba) Signed-off-by: Sean Owen <so...@cloudera.com> commit 0ae97863781200ec96f89ad98e5d11bb1778fab0 Author: Sandeep Singh <sand...@techaddict.me> Date: 2016-09-06T21:18:28Z [SPARK-17299] TRIM/LTRIM/RTRIM should not strips characters other than spaces ## What changes were proposed in this pull request? TRIM/LTRIM/RTRIM should not strips characters other than spaces, we were trimming all chars small than ASCII 0x20(space) ## How was this patch tested? fixed existing tests. Author: Sandeep Singh <sand...@techaddict.me> Closes #14924 from techaddict/SPARK-17299. (cherry picked from commit 7775d9f224e22400c6c8c093652a383f4af66ee0) Signed-off-by: Sean Owen <so...@cloudera.com> commit 015751421bc444e350ad15c6f2e8a52f2da5b6e9 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-06T22:07:28Z [SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() ## What changes were proposed in this pull request? This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException. We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long. This patch addresses the bug by modifying `BlockManager`'s `get()` and `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`) ## How was this patch tested? Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed. Author: Josh Rosen <joshro...@databricks.com> Closes #14952 from JoshRosen/SPARK-17110. (cherry picked from commit 29cfab3f1524c5690be675d24dda0a9a1806d6ff) Signed-off-by: Josh Rosen <joshro...@databricks.com> commit f3cfce09274741cc04bf2f00e873b3b64976b6d5 Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-09-06T23:49:06Z [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor' ## What changes were proposed in this pull request? Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$` ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixi...@databricks.com> Closes #14983 from zsxwing/SPARK-17316-3. (cherry picked from commit 175b4344112b376cbbbd05265125ed0e1b87d507) Signed-off-by: Shixiong Zhu <shixi...@databricks.com> commit a23d4065c5705b805c69e569ea177167d44b5244 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-09-06T02:36:00Z [SPARK-17279][SQL] better error message for exceptions during ScalaUDF execution ## What changes were proposed in this pull request? If `ScalaUDF` throws exceptions during executing user code, sometimes it's hard for users to figure out what's wrong, especially when they use Spark shell. An example ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40) at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) ... ``` We should catch these exceptions and rethrow them with better error message, to say that the exception is happened in scala udf. This PR also does some clean up for `ScalaUDF` and add a unit test suite for it. ## How was this patch tested? the new test suite Author: Wenchen Fan <wenc...@databricks.com> Closes #14850 from cloud-fan/npe. (cherry picked from commit 8d08f43d09157b98e559c0be6ce6fd571a35e0d1) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 796577b43d3df94f5d3a8e4baeb0aa03fbbb3f21 Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2016-09-07T02:34:11Z [SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to save file names in FileStreamSource ## What changes were proposed in this pull request? When we create a filestream on a directory that has partitioned subdirs (i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which internally is a Stream[String]. This is because of this [line](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93), where a LinkedHashSet.values.toSeq returns Stream. Then when the [FileStreamSource](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79) filters this Stream[String] to remove the seen files, it creates a new Stream[String], which has a filter function that has a $outer reference to the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] causes NotSerializableException. This will happened even if there is just one file in the dir. Its important to note that this behavior is different in Scala 2.11. There is no $outer reference to FileStreamSource, so it does not throw NotSerializableException. However, with a large sequence of files (tested with 10000 files), it throws StackOverflowError. This is because how Stream class is implemented. Its basically like a linked list, and attempting to serialize a long Stream requires *recursively* going through linked list, thus resulting in StackOverflowError. In short, across both Scala 2.10 and 2.11, serialization fails when both the following conditions are true. - file stream defined on a partitioned directory - directory has 10k+ files The right solution is to convert the seq to an array before writing to the log. This PR implements this fix in two ways. - Changing all uses for HDFSMetadataLog to ensure Array is used instead of Seq - Added a `require` in HDFSMetadataLog such that it is never used with type Seq ## How was this patch tested? Added unit test that test that ensures the file stream source can handle with 10000 files. This tests fails in both Scala 2.10 and 2.11 with different failures as indicated above. Author: Tathagata Das <tathagata.das1...@gmail.com> Closes #14987 from tdas/SPARK-17372. (cherry picked from commit eb1ab88a86ce35f3d6ba03b3a798099fbcf6b3fc) Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com> commit ee6301a88e3b109398cec9bc470b5a88f72654dd Author: Clark Fitzgerald <clarkfi...@gmail.com> Date: 2016-09-07T06:40:37Z [SPARK-16785] R dapply doesn't return array or raw columns Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors. cc shivaram Unit tests Author: Clark Fitzgerald <clarkfi...@gmail.com> Closes #14783 from clarkfitzg/SPARK-16785. (cherry picked from commit 9fccde4ff80fb0fd65a9e90eb3337965e4349de4) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit c8811adaa6b2fb6c5ca31520908d148326ebaf18 Author: Herman van Hovell <hvanhov...@databricks.com> Date: 2016-09-07T08:38:56Z [SPARK-17296][SQL] Simplify parser join processing [BACKPORT 2.0] ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/14867 to branch-2.0. It fixes a number of join ordering bugs. ## How was this patch tested? Added tests to `PlanParserSuite`. Author: Herman van Hovell <hvanhov...@databricks.com> Closes #14984 from hvanhovell/SPARK-17296-branch-2.0. commit e6caceb5e141a1665b21d04079a86baca041e453 Author: Srinivasa Reddy Vundela <v...@cloudera.com> Date: 2016-09-07T11:41:03Z [MINOR][SQL] Fixing the typo in unit test ## What changes were proposed in this pull request? Fixing the typo in the unit test of CodeGenerationSuite.scala ## How was this patch tested? Ran the unit test after fixing the typo and it passes Author: Srinivasa Reddy Vundela <v...@cloudera.com> Closes #14989 from vundela/typo_fix. (cherry picked from commit 76ad89e9241fb2dece95dd445661dd95ee4ef699) Signed-off-by: Sean Owen <so...@cloudera.com> commit 078ac0e6321aeb72c670a65ec90b9c20ef0a7788 Author: Eric Liang <e...@databricks.com> Date: 2016-09-07T19:33:50Z [SPARK-17370] Shuffle service files not invalidated when a slave is lost ## What changes were proposed in this pull request? DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime. However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files. The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event. ## How was this patch tested? Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected. cc mateiz Author: Eric Liang <e...@databricks.com> Closes #14931 from ericl/sc-4439. (cherry picked from commit 649fa4bf1d6fc9271ae56b6891bc93ebf57858d1) Signed-off-by: Josh Rosen <joshro...@databricks.com> commit 067752ce08dc035ee807d20be2202c385f88f01c Author: Marcelo Vanzin <van...@cloudera.com> Date: 2016-09-07T23:43:05Z [SPARK-16533][CORE] - backport driver deadlock fix to 2.0 ## What changes were proposed in this pull request? Backport changes from #14710 and #14925 to 2.0 Author: Marcelo Vanzin <van...@cloudera.com> Author: Angus Gerry <ango...@gmail.com> Closes #14933 from angolon/SPARK-16533-2.0. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org