+1 On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mich...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes > if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.6.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is *v1.6.0-rc2 > (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b) > <https://github.com/apache/spark/tree/v1.6.0-rc2>* > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1169/ > > The test repository (versioned as v1.6.0-rc2) for this release can be > found at: > https://repository.apache.org/content/repositories/orgapachespark-1168/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/ > > ======================================= > == How can I help test this release? == > ======================================= > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions. > > ================================================ > == What justifies a -1 vote for this release? == > ================================================ > This vote is happening towards the end of the 1.6 QA period, so -1 votes > should only occur for significant regressions from 1.5. Bugs already > present in 1.5, minor regressions, or bugs related to new features will not > block this release. > > =============================================================== > == What should happen to JIRA tickets still targeting 1.6.0? == > =============================================================== > 1. It is OK for documentation patches to target 1.6.0 and still go into > branch-1.6, since documentations will be published separately from the > release. > 2. New features for non-alpha-modules should target 1.7+. > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target > version. > > > ================================================== > == Major changes to help you focus your testing == > ================================================== > > Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming > > - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> > trackStateByKey has been renamed to mapWithState > > Spark SQL > > - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165> > SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix > bugs in eviction of storage memory by execution. > - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct > passing null into ScalaUDF > > Notable Features Since 1.5Spark SQL > > - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet > Performance - Improve Parquet scan performance when using flat schemas. > - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810> > Session Management - Isolated devault database (i.e USE mydb) even on > shared clusters. > - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset > API - A type-safe API (similar to RDDs) that performs many operations > on serialized binary data and code generation (i.e. Project Tungsten). > - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified > Memory Management - Shared memory for execution and caching instead of > exclusive division of the regions. > - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL > Queries on Files - Concise syntax for running SQL queries over files > of any supported format without registering a table. > - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading > non-standard JSON files - Added options to read non-standard JSON > files (e.g. single-quotes, unquoted attributes) > - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> > Per-operator > Metrics for SQL Execution - Display statistics on a peroperator basis > for memory usage and spilled data size. > - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star > (*) expansion for StructTypes - Makes it easier to nest and unest > arbitrary numbers of columns > - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>, > SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory > Columnar Cache Performance - Significant (up to 14x) speed up when > caching data that contains complex types in DataFrames or SQL. > - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast > null-safe joins - Joins using null-safe equality (<=>) will now > execute using SortMergeJoin instead of computing a cartisian product. > - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL > Execution Using Off-Heap Memory - Support for configuring query > execution to occur using off-heap memory to avoid GC overhead > - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> > Datasource > API Avoid Double Filter - When implemeting a datasource with filter > pushdown, developers can now tell Spark SQL to avoid double evaluating a > pushed-down filter. > - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced > Layout of Cached Data - storing partitioning and ordering schemes in > In-memory table scan, and adding distributeBy and localSort to DF API > - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive > query execution - Intial support for automatically selecting the > number of reducers for joins and aggregations. > - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved > query planner for queries having distinct aggregations - Query plans > of distinct aggregations are more robust when distinct columns have high > cardinality. > > Spark Streaming > > - API Updates > - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New > improved state management - mapWithState - a DStream transformation > for stateful stream processing, supercedes updateStateByKey in > functionality and performance. > - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> > Kinesis > record deaggregation - Kinesis streams have been upgraded to use > KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated > records. > - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> > Kinesis > message handler function - Allows arbitraray function to be applied > to a Kinesis record in the Kinesis receiver before to customize what > data > is to be stored in memory. > - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python > Streamng Listener API - Get streaming statistics (scheduling > delays, batch processing times, etc.) in streaming. > > > - UI Improvements > - Made failures visible in the streaming tab, in the timelines, > batch list, and batch details page. > - Made output operations visible in the streaming tab as progress > bars. > > MLlibNew algorithms/models > > - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival > analysis - Log-linear model for survival analysis > - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal > equation for least squares - Normal equation solver, providing R-like > model summary statistics > - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online > hypothesis testing - A/B testing in the Spark Streaming framework > - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New > feature transformers - ChiSqSelector, QuantileDiscretizer, SQL > transformer > - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting > K-Means clustering - Fast top-down clustering variant of K-Means > > API improvements > > - ML Pipelines > - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> > Pipeline > persistence - Save/load for ML Pipelines, with partial coverage of > spark.ml algorithms > - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA > in ML Pipelines - API for Latent Dirichlet Allocation in ML > Pipelines > - R API > - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like > statistics for GLMs - (Partial) R-like stats for ordinary least > squares via summary(model) > - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature > interactions in R formula - Interaction operator ":" in R formula > - Python API - Many improvements to Python API to approach feature > parity > > Misc improvements > > - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>, > SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance > weights for GLMs - Logistic and Linear Regression can take instance > weights > - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>, > SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate > and bivariate statistics in DataFrames - Variance, stddev, > correlations, etc. > - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM > data source - LIBSVM as a SQL data sourceDocumentation improvements > - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since > versions - Documentation includes initial version when classes and > methods were added > - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable > example code - Automated testing for code in user guide examples > > Deprecations > > - In spark.mllib.clustering.KMeans, the "runs" parameter has been > deprecated. > - In spark.ml.classification.LogisticRegressionModel and > spark.ml.regression.LinearRegressionModel, the "weights" field has been > deprecated, in favor of the new name "coefficients." This helps > disambiguate from instance (row) weights given to algorithms. > > Changes of behavior > > - spark.mllib.tree.GradientBoostedTrees validationTol has changed > semantics in 1.6. Previously, it was a threshold for absolute change in > error. Now, it resembles the behavior of GradientDescent convergenceTol: > For large errors, it uses relative error (relative to the previous error); > for small errors (< 0.01), it uses absolute error. > - spark.ml.feature.RegexTokenizer: Previously, it did not convert > strings to lowercase before tokenizing. Now, it converts to lowercase by > default, with an option not to. This matches the behavior of the simpler > Tokenizer transformer. > - Spark SQL's partition discovery has been changed to only discover > partition directories that are children of the given path. (i.e. if > path="/my/data/x=1" then x=1 will no longer be considered a partition > but only children of x=1.) This behavior can be overridden by manually > specifying the basePath that partitioning discovery should start with ( > SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>). > - When casting a value of an integral type to timestamp (e.g. casting > a long value to timestamp), the value is treated as being in seconds > instead of milliseconds (SPARK-11724 > <https://issues.apache.org/jira/browse/SPARK-11724>). > - With the improved query planner for queries having distinct > aggregations (SPARK-9241 > <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a > query having a single distinct aggregation has been changed to a more > robust version. To switch back to the plan generated by Spark 1.5's > planner, please set spark.sql.specializeSingleDistinctAggPlanning to > true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>). > > >