Re:Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Allen Zhang Wed, 16 Dec 2015 17:57:23 -0800

plus 1






在 2015-12-17 09:39:39，"Joseph Bradley" <jos...@databricks.com> 写道：

+1


On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <r...@databricks.com> wrote:

+1




On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <m...@clearstorydata.com> wrote:

+1


On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mich...@databricks.com> 
wrote:

Please vote on releasing the following candidate as Apache Spark version 1.6.0!

The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a 
majority of at least 3 +1 PMC votes are cast.



[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/


The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)


The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/


Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc


The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1174/


The test repository (versioned as v1.6.0-rc3) for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1173/


The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/


=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.


================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes should 
only occur for significant regressions from 1.5. Bugs already present in 1.5, 
minor regressions, or bugs related to new features will not block this release.


===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into 
branch-1.6, since documentations will be published separately from the release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target 
version.




==================================================
== Major changes to help you focus your testing ==
==================================================


Notable changes since 1.6 RC2

- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed


Notable changes since 1.6 RC1

Spark Streaming
SPARK-2629  trackStateByKey has been renamed to mapWithState
Spark SQL
SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
SPARK-11787 Parquet Performance - Improve Parquet scan performance when using 
flat schemas.
SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even 
on shared clusters.
SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs many 
operations on serialized binary data and code generation (i.e. Project 
Tungsten).
SPARK-10000 Unified Memory Management - Shared memory for execution and caching 
instead of exclusive division of the regions.
SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over 
files of any supported format without registering a table.
SPARK-11745 Reading non-standard JSON files - Added options to read 
non-standard JSON files (e.g. single-quotes, unquoted attributes)
SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a 
peroperator basis for memory usage and spilled data size.
SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and 
unest arbitrary numbers of columns
SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up 
to 14x) speed up when caching data that contains complex types in DataFrames or 
SQL.
SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will 
now execute using SortMergeJoin instead of computing a cartisian product.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query 
execution to occur using off-heap memory to avoid GC overhead
SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource 
with filter pushdown, developers can now tell Spark SQL to avoid double 
evaluating a pushed-down filter.
SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering 
schemes in In-memory table scan, and adding distributeBy and localSort to DF API
SPARK-9858  Adaptive query execution - Intial support for automatically 
selecting the number of reducers for joins and aggregations.
SPARK-9241  Improved query planner for queries having distinct aggregations - 
Query plans of distinct aggregations are more robust when distinct columns have 
high cardinality.
Spark Streaming
API Updates
SPARK-2629  New improved state management - mapWithState - a DStream 
transformation for stateful stream processing, supercedes updateStateByKey in 
functionality and performance.
SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded 
to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated 
records.
SPARK-10891 Kinesis message handler function - Allows arbitraray function to be 
applied to a Kinesis record in the Kinesis receiver before to customize what 
data is to be stored in memory.
SPARK-6328  Python Streamng Listener API - Get streaming statistics (scheduling 
delays, batch processing times, etc.) in streaming.
UI Improvements
Made failures visible in the streaming tab, in the timelines, batch list, and 
batch details page.
Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
SPARK-8518  Survival analysis - Log-linear model for survival analysis
SPARK-9834  Normal equation for least squares - Normal equation solver, 
providing R-like model summary statistics
SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming 
framework
SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL 
transformer
SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of 
K-Means
API improvements
ML Pipelines
SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial 
coverage of spark.mlalgorithms
SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML 
Pipelines
R API
SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary 
least squares via summary(model)
SPARK-9681  Feature interactions in R formula - Interaction operator ":" in R 
formula
Python API - Many improvements to Python API to approach feature parity
Misc improvements
SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear 
Regression can take instance weights
SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - 
Variance, stddev, correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
SPARK-7751  @since versions - Documentation includes initial version when 
classes and methods were added
SPARK-11337 Testable example code - Automated testing for code in user guide 
examples
Deprecations
In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
In spark.ml.classification.LogisticRegressionModel and 
spark.ml.regression.LinearRegressionModel, the "weights" field has been 
deprecated, in favor of the new name "coefficients." This helps disambiguate 
from instance (row) weights given to algorithms.
Changes of behavior
spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 
1.6. Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of GradientDescent convergenceTol: For large errors, it 
uses relative error (relative to the previous error); for small errors (< 
0.01), it uses absolute error.
spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to 
lowercase before tokenizing. Now, it converts to lowercase by default, with an 
option not to. This matches the behavior of the simpler Tokenizer transformer.
Spark SQL's partition discovery has been changed to only discover partition 
directories that are children of the given path. (i.e. if path="/my/data/x=1" 
then x=1 will no longer be considered a partition but only children of x=1.) 
This behavior can be overridden by manually specifying the basePath that 
partitioning discovery should start with (SPARK-11678).
When casting a value of an integral type to timestamp (e.g. casting a long 
value to timestamp), the value is treated as being in seconds instead of 
milliseconds (SPARK-11724).
With the improved query planner for queries having distinct aggregations 
(SPARK-9241), the plan of a query having a single distinct aggregation has been 
changed to a more robust version. To switch back to the plan generated by Spark 
1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true 
(SPARK-12077).

Re:Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Reply via email to