GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/11330
[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 You can merge this pull request into a Git repository by running: $ git pull https://github.com/damnMeddlingKid/spark udt-union-all Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11330.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11330 ---- commit 6f0f1d9e04a8db47e2f6f8fcfe9dea9de0f633da Author: Cheng Lian <l...@databricks.com> Date: 2016-01-25T23:05:05Z [SPARK-12934][SQL] Count-min sketch serialization This PR adds serialization support for `CountMinSketch`. A version number is added to version the serialized binary format. Author: Cheng Lian <l...@databricks.com> Closes #10893 from liancheng/cms-serialization. commit be375fcbd200fb0e210b8edcfceb5a1bcdbba94b Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-01-26T00:23:59Z [SPARK-12879] [SQL] improve the unsafe row writing framework As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenc...@databricks.com> Closes #10809 from cloud-fan/unsafe-projection. commit 109061f7ad27225669cbe609ec38756b31d4e1b9 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-01-26T01:58:11Z [SPARK-12936][SQL] Initial bloom filter implementation This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java). Some difference from the design doc: * expose `bitSize` instead of `sizeInBytes` to user. * always need the `expectedInsertions` parameter when create bloom filter. Author: Wenchen Fan <wenc...@databricks.com> Closes #10883 from cloud-fan/bloom-filter. commit fdcc3512f7b45e5b067fc26cb05146f79c4a5177 Author: tedyu <yuzhih...@gmail.com> Date: 2016-01-26T02:23:47Z [SPARK-12934] use try-with-resources for streams liancheng please take a look Author: tedyu <yuzhih...@gmail.com> Closes #10906 from tedyu/master. commit b66afdeb5253913d916dcf159aaed4ffdc15fd4b Author: Holden Karau <hol...@us.ibm.com> Date: 2016-01-26T06:38:31Z [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer Add Python API for ml.feature.QuantileDiscretizer. One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model. cc brkyvz & mengxr Author: Holden Karau <hol...@us.ibm.com> Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer. commit ae47ba718a280fc12720a71b981c38dbe647f35b Author: Xusen Yin <yinxu...@gmail.com> Date: 2016-01-26T06:41:52Z [SPARK-12834] Change ser/de of JavaArray and JavaList https://issues.apache.org/jira/browse/SPARK-12834 We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxu...@gmail.com> Closes #10772 from yinxusen/SPARK-12834. commit 27c910f7f29087d1ac216d4933d641d6515fd6ad Author: Xiangrui Meng <m...@databricks.com> Date: 2016-01-26T06:53:34Z [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test in PySpark for now I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086. gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"? cc: jkbradley Author: Xiangrui Meng <m...@databricks.com> Closes #10909 from mengxr/SPARK-10086. commit d54cfed5a6953a9ce2b9de2f31ee2d673cb5cc62 Author: Reynold Xin <r...@databricks.com> Date: 2016-01-26T08:51:08Z [SQL][MINOR] A few minor tweaks to CSV reader. This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <r...@databricks.com> Closes #10919 from rxin/csv-minor. commit 6743de3a98e3f0d0e6064ca1872fa88c3aeaa143 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-01-26T08:53:05Z [SPARK-12937][SQL] bloom filter serialization This PR adds serialization support for BloomFilter. A version number is added to version the serialized binary format. Author: Wenchen Fan <wenc...@databricks.com> Closes #10920 from cloud-fan/bloom-filter. commit 5936bf9fa85ccf7f0216145356140161c2801682 Author: Liang-Chi Hsieh <vii...@gmail.com> Date: 2016-01-26T11:36:00Z [SPARK-12961][CORE] Prevent snappy-java memory leak JIRA: https://issues.apache.org/jira/browse/SPARK-12961 To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object. JoshRosen Author: Liang-Chi Hsieh <vii...@gmail.com> Closes #10875 from viirya/prevent-snappy-memory-leak. commit 649e9d0f5b2d5fc13f2dd5be675331510525927f Author: Sean Owen <so...@cloudera.com> Date: 2016-01-26T11:55:28Z [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <so...@cloudera.com> Closes #10413 from srowen/SPARK-3369. commit ae0309a8812a4fade3a0ea67d8986ca870aeb9eb Author: zhuol <zh...@yahoo-inc.com> Date: 2016-01-26T15:40:02Z [SPARK-10911] Executors should System.exit on clean shutdown. Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441. Author: zhuol <zh...@yahoo-inc.com> Closes #9946 from zhuoliu/10911. commit 08c781ca672820be9ba32838bbe40d2643c4bde4 Author: Sameer Agarwal <sam...@databricks.com> Date: 2016-01-26T15:50:37Z [SPARK-12682][SQL] Add support for (optionally) not storing tables in hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sam...@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata. commit cbd507d69cea24adfb335d8fe26ab5a13c053ffc Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-01-26T19:31:54Z [SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instructions for streaming-akka project Since `actorStream` is an external project, we should add the linking and deploying instructions for it. A follow up PR of #10744 Author: Shixiong Zhu <shixi...@databricks.com> Closes #10856 from zsxwing/akka-link-instruction. commit 8beab68152348c44cf2f89850f792f164b06470d Author: Xusen Yin <yinxu...@gmail.com> Date: 2016-01-26T19:56:46Z [SPARK-11923][ML] Python API for ml.feature.ChiSqSelector https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxu...@gmail.com> Closes #10186 from yinxusen/SPARK-11923. commit fbf7623d49525e3aa6b08f482afd7ee8118d80cb Author: Xusen Yin <yinxu...@gmail.com> Date: 2016-01-26T21:18:01Z [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxu...@gmail.com> Closes #10863 from yinxusen/SPARK-12952. commit ee74498de372b16fe6350e3617e9e6ec87c6ae7b Author: Josh Rosen <joshro...@databricks.com> Date: 2016-01-26T22:20:11Z [SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted order in dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshro...@databricks.com> Closes #10885 from JoshRosen/SPARK-8725. commit 83507fea9f45c336d73dd4795b8cb37bcd63e31d Author: Cheng Lian <l...@databricks.com> Date: 2016-01-26T22:29:29Z [SQL] Minor Scaladoc format fix Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <l...@databricks.com> Closes #10926 from liancheng/agg-doc-fix. commit 19fdb21afbf0eae4483cf6d4ef32daffd1994b89 Author: Jeff Zhang <zjf...@apache.org> Date: 2016-01-26T22:58:39Z [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files. Author: Jeff Zhang <zjf...@apache.org> Closes #10913 from zjffdu/SPARK-12993. commit eb917291ca1a2d68ca0639cb4b1464a546603eba Author: Holden Karau <hol...@us.ibm.com> Date: 2016-01-26T23:53:48Z [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <hol...@us.ibm.com> Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code. commit 22662b241629b56205719ede2f801a476e10a3cd Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-01-27T01:24:40Z [SPARK-12614][CORE] Don't throw non fatal exception from ask Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`. Author: Shixiong Zhu <shixi...@databricks.com> Closes #10568 from zsxwing/send-ask-fail. commit 1dac964c1b996d38c65818414fc8401961a1de8a Author: Jeff Zhang <zjf...@apache.org> Date: 2016-01-27T01:31:19Z [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and⦠⦠Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjf...@apache.org> Closes #9595 from zjffdu/SPARK-11622. commit 555127387accdd7c1cf236912941822ba8af0a52 Author: Nong Li <n...@databricks.com> Date: 2016-01-27T01:34:01Z [SPARK-12854][SQL] Implement complex types support in ColumnarBatch This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <n...@databricks.com> Closes #10820 from nongli/spark-12854. commit b72611f20a03c790b6fd341b6ffdb3b5437609ee Author: Holden Karau <hol...@us.ibm.com> Date: 2016-01-27T01:59:05Z [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review. Author: Holden Karau <hol...@us.ibm.com> Author: Holden Karau <hol...@pigscanfly.ca> Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized. commit e7f9199e709c46a6b5ad6b03c9ecf12cc19e3a41 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-27T03:29:47Z [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <yblia...@gmail.com> Closes #10829 from yanboliang/spark-12903. commit ce38a35b764397fcf561ac81de6da96579f5c13e Author: Cheng Lian <l...@databricks.com> Date: 2016-01-27T04:12:34Z [SPARK-12935][SQL] DataFrame API for Count-Min Sketch This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <l...@databricks.com> Closes #10911 from liancheng/cms-df-api. commit 58f5d8c1da6feeb598aa5f74ffe1593d4839d11d Author: Cheng Lian <l...@databricks.com> Date: 2016-01-27T04:30:13Z [SPARK-12728][SQL] Integrates SQL generation with native view This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <l...@databricks.com> Author: Yin Huai <yh...@databricks.com> Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view. commit bae3c9a4eb0c320999e5dbafd62692c12823e07d Author: Nishkam Ravi <nishkamr...@gmail.com> Date: 2016-01-27T05:14:39Z [SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext shutdown If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence. vanzin Author: Nishkam Ravi <nishkamr...@gmail.com> Author: nishkamravi2 <nishkamr...@gmail.com> Closes #10881 from nishkamravi2/master_netty. commit 4db255c7aa756daa224d61905db745b6bccc9173 Author: Xusen Yin <yinxu...@gmail.com> Date: 2016-01-27T05:16:56Z [SPARK-12780] Inconsistency returning value of ML python models' properties https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxu...@gmail.com> Closes #10724 from yinxusen/SPARK-12780. commit 90b0e562406a8bac529e190472e7f5da4030bf5c Author: BenFradet <benjamin.fra...@gmail.com> Date: 2016-01-27T09:27:11Z [SPARK-12983][CORE][DOC] Correct metrics.properties.template There are some typos or plain unintelligible sentences in the metrics template. Author: BenFradet <benjamin.fra...@gmail.com> Closes #10902 from BenFradet/SPARK-12983. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org