Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Timothy O Thu, 17 Dec 2015 07:41:58 -0800

+1


    On Thursday, December 17, 2015 8:22 AM, Kousuke Saruta 
<saru...@oss.nttdata.co.jp> wrote:
 

  +1
 
 On 2015/12/17 6:32, Michael Armbrust wrote:
  
  Please vote on releasing the following candidate as Apache Spark version 
1.6.0! 
 The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if 
a majority of at least 3 +1 PMC votes are cast.
  
  [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this 
package because ... 
  To learn more about Apache Spark, please see http://spark.apache.org/ 
  The tag to be voted on is v1.6.0-rc3 
(168c89e07c51fa24b0bb88582c739cec0acb44d7) 
  The release files, including signatures, digests, etc. can be found at: 
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/ 
  Release artifacts are signed with the following key: 
https://people.apache.org/keys/committer/pwendell.asc 
   The staging repository for this release can be found at: 
https://repository.apache.org/content/repositories/orgapachespark-1174/  
  The test repository (versioned as v1.6.0-rc3) for this release can be found 
at: https://repository.apache.org/content/repositories/orgapachespark-1173/ 
  The documentation corresponding to this release can be found at: 
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/ 
  ======================================= == How can I help test this release? 
== ======================================= If you are a Spark user, you can 
help us test this release by taking an existing Spark workload and running on 
this release candidate, then reporting any regressions. 
  ================================================ == What justifies a -1 vote 
for this release? == ================================================ This vote 
is happening towards the end of the 1.6 QA period, so -1 votes should only 
occur for significant regressions from 1.5. Bugs already present in 1.5, minor 
regressions, or bugs related to new features will not block this release. 
  =============================================================== == What 
should happen to JIRA tickets still targeting 1.6.0? == 
=============================================================== 1. It is OK for 
documentation patches to target 1.6.0 and still go into branch-1.6, since 
documentations will be published separately from the release. 2. New features 
for non-alpha-modules should target 1.7+. 3. Non-blocker bug fixes should 
target 1.6.1 or 1.7.0, or drop the target version. 
  
  ================================================== == Major changes to help 
you focus your testing == ================================================== 
   
Notable changes since 1.6 RC2
 
 - SPARK_VERSION has been set correctly
 - SPARK-12199 ML Docs are publishing correctly
 - SPARK-12345 Mesos cluster mode has been fixed 
  
Notable changes since 1.6 RC1
 
 
Spark Streaming
    
   - SPARK-2629  trackStateByKey has been renamed to mapWithState
 
Spark SQL
    
   - SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by 
execution.
   - SPARK-12258 correct passing null into ScalaUDF
 
Notable Features Since 1.5
 
Spark SQL
    
   - SPARK-11787 Parquet Performance - Improve Parquet scan performance when 
using flat schemas.
   - SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) 
even on shared clusters.
   - SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs 
many operations on serialized binary data and code generation (i.e. Project 
Tungsten).
   - SPARK-10000 Unified Memory Management - Shared memory for execution and 
caching instead of exclusive division of the regions.
   - SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries 
over files of any supported format without registering a table.
   - SPARK-11745 Reading non-standard JSON files - Added options to read 
non-standard JSON files (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on 
a peroperator basis for memory usage and spilled data size.
   - SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest 
and unest arbitrary numbers of columns
   - SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - 
Significant (up to 14x) speed up when caching data that contains complex types 
in DataFrames or SQL.
   - SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) 
will now execute using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring 
query execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 Datasource API Avoid Double Filter - When implemeting a 
datasource with filter pushdown, developers can now tell Spark SQL to avoid 
double evaluating a pushed-down filter.
   - SPARK-4849  Advanced Layout of Cached Data - storing partitioning and 
ordering schemes in In-memory table scan, and adding distributeBy and localSort 
to DF API
   - SPARK-9858  Adaptive query execution - Intial support for automatically 
selecting the number of reducers for joins and aggregations.
   - SPARK-9241  Improved query planner for queries having distinct 
aggregations - Query plans of distinct aggregations are more robust when 
distinct columns have high cardinality.
 
Spark Streaming
    
   - API Updates       
      - SPARK-2629  New improved state management - mapWithState - a DStream 
transformation for stateful stream processing, supercedes updateStateByKey in 
functionality and performance.
      - SPARK-11198 Kinesis record deaggregation - Kinesis streams have been 
upgraded to use KCL 1.4.0 and supports transparent deaggregation of 
KPL-aggregated records.
      - SPARK-10891 Kinesis message handler function - Allows arbitraray 
function to be applied to a Kinesis record in the Kinesis receiver before to 
customize what data is to be stored in  memory.
      - SPARK-6328  Python Streamng Listener API - Get streaming statistics 
(scheduling delays, batch processing times, etc.) in streaming.
 
    
   - UI Improvements       
      - Made failures visible in the streaming tab, in the timelines, batch 
list, and batch details page.
      - Made output operations visible in the streaming tab as progress bars.
 
 
MLlib
 
New algorithms/models
    
   - SPARK-8518  Survival analysis - Log-linear model for survival analysis
   - SPARK-9834  Normal equation for least squares - Normal equation solver, 
providing R-like model summary statistics
   - SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming 
framework
   - SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, 
SQL transformer
   - SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering 
variant of K-Means
 
API improvements
    
   - ML Pipelines       
      - SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with 
partial coverage of spark.mlalgorithms
      - SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation 
in ML Pipelines
 
   - R API       
      - SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for 
ordinary least squares via summary(model)
      - SPARK-9681  Feature interactions in R formula - Interaction operator 
":" in R formula
 
   - Python API - Many improvements to Python API to approach feature parity
 
Misc improvements
    
   - SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear 
Regression can take instance weights
   - SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames 
- Variance, stddev, correlations, etc.
   - SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source    
Documentation improvements
 
   - SPARK-7751  @since versions - Documentation includes initial version when 
classes and methods were added
   - SPARK-11337 Testable example code - Automated testing for code in user 
guide examples
 
Deprecations
    
   - In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
   - In spark.ml.classification.LogisticRegressionModel and 
spark.ml.regression.LinearRegressionModel, the "weights" field has been 
deprecated, in favor of the new name "coefficients." This helps disambiguate 
from instance (row) weights given to algorithms.
 
Changes of behavior
    
   - spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics 
in 1.6. Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of GradientDescent convergenceTol: For large errors, it 
uses relative error (relative to the previous error); for small errors (< 
0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to 
lowercase before tokenizing. Now, it converts to lowercase by  default, with an 
option not to. This matches the behavior of the simpler Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover 
partition directories that are children of the given path. (i.e. if 
path="/my/data/x=1" then x=1 will no longer be considered a partition but only 
children of x=1.) This behavior can be overridden by manually specifying the 
basePath that partitioning discovery should start with (SPARK-11678).
   - When casting a value of an integral type to timestamp (e.g. casting a long 
value to timestamp), the value is treated as being in seconds  instead of 
milliseconds (SPARK-11724).
   - With the improved query planner for queries having distinct aggregations 
(SPARK-9241), the plan of a query having a single distinct aggregation has been 
changed to a more robust version. To switch back to the plan generated by Spark 
1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true 
(SPARK-12077).

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Reply via email to