+1 (non binding)
Tested with different samples.
RegardsJB
Sent from my Samsung device
-------- Original message --------
From: Michael Armbrust <[email protected]>
Date: 12/12/2015 18:39 (GMT+01:00)
To: [email protected]
Subject: [VOTE] Release Apache Spark 1.6.0 (RC2)
Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0[ ] -1 Do not release this
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc2 (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
The release files, including signatures, digests, etc. can be found
at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
Release artifacts are signed with the following
key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found
at:https://repository.apache.org/content/repositories/orgapachespark-1169/
The test repository (versioned as v1.6.0-rc2) for this release can be found
at:https://repository.apache.org/content/repositories/orgapachespark-1168/
The documentation corresponding to this release can be found
at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
========================================= How can I help test this release?
=========================================If you are a Spark user, you can help
us test this release by taking an existing Spark workload and running on this
release candidate, then reporting any regressions.
================================================== What justifies a -1 vote for
this release? ==================================================This vote is
happening towards the end of the 1.6 QA period, so -1 votes should only occur
for significant regressions from 1.5. Bugs already present in 1.5, minor
regressions, or bugs related to new features will not block this release.
================================================================= What should
happen to JIRA tickets still targeting 1.6.0?
=================================================================1. It is OK
for documentation patches to target 1.6.0 and still go into branch-1.6, since
documentations will be published separately from the release.2. New features
for non-alpha-modules should target 1.7+.3. Non-blocker bug fixes should target
1.6.1 or 1.7.0, or drop the target version.
==================================================== Major changes to help you
focus your testing ====================================================
Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark StreamingSPARK-2629
trackStateByKey has been renamed to mapWithStateSpark SQLSPARK-12165
SPARK-12189 Fix bugs in eviction of storage memory by execution.SPARK-12258
correct passing null into ScalaUDFNotable Features Since 1.5Spark
SQLSPARK-11787 Parquet Performance - Improve Parquet scan performance when
using flat schemas.SPARK-10810 Session Management - Isolated devault database
(i.e USE mydb) even on shared clusters.SPARK-9999 Dataset API - A type-safe
API (similar to RDDs) that performs many operations on serialized binary data
and code generation (i.e. Project Tungsten).SPARK-10000 Unified Memory
Management - Shared memory for execution and caching instead of exclusive
division of the regions.SPARK-11197 SQL Queries on Files - Concise syntax for
running SQL queries over files of any supported format without registering a
table.SPARK-11745 Reading non-standard JSON files - Added options to read
non-standard JSON files (e.g. single-quotes, unquoted attributes)SPARK-10412
Per-operator Metrics for SQL Execution - Display statistics on a peroperator
basis for memory usage and spilled data size.SPARK-11329 Star (*) expansion for
StructTypes - Makes it easier to nest and unest arbitrary numbers of
columnsSPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
Significant (up to 14x) speed up when caching data that contains complex types
in DataFrames or SQL.SPARK-11111 Fast null-safe joins - Joins using null-safe
equality (<=>) will now execute using SortMergeJoin instead of computing a
cartisian product.SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
configuring query execution to occur using off-heap memory to avoid GC
overheadSPARK-10978 Datasource API Avoid Double Filter - When implemeting a
datasource with filter pushdown, developers can now tell Spark SQL to avoid
double evaluating a pushed-down filter.SPARK-4849 Advanced Layout of Cached
Data - storing partitioning and ordering schemes in In-memory table scan, and
adding distributeBy and localSort to DF APISPARK-9858 Adaptive query execution
- Intial support for automatically selecting the number of reducers for joins
and aggregations.SPARK-9241 Improved query planner for queries having distinct
aggregations - Query plans of distinct aggregations are more robust when
distinct columns have high cardinality.Spark StreamingAPI UpdatesSPARK-2629
New improved state management - mapWithState - a DStream transformation for
stateful stream processing, supercedes updateStateByKey in functionality and
performance.SPARK-11198 Kinesis record deaggregation - Kinesis streams have
been upgraded to use KCL 1.4.0 and supports transparent deaggregation of
KPL-aggregated records.SPARK-10891 Kinesis message handler function - Allows
arbitraray function to be applied to a Kinesis record in the Kinesis receiver
before to customize what data is to be stored in memory.SPARK-6328 Python
Streamng Listener API - Get streaming statistics (scheduling delays, batch
processing times, etc.) in streaming.UI ImprovementsMade failures visible in
the streaming tab, in the timelines, batch list, and batch details page.Made
output operations visible in the streaming tab as progress bars.MLlibNew
algorithms/modelsSPARK-8518 Survival analysis - Log-linear model for survival
analysisSPARK-9834 Normal equation for least squares - Normal equation solver,
providing R-like model summary statisticsSPARK-3147 Online hypothesis testing
- A/B testing in the Spark Streaming frameworkSPARK-9930 New feature
transformers - ChiSqSelector, QuantileDiscretizer, SQL transformerSPARK-6517
Bisecting K-Means clustering - Fast top-down clustering variant of K-MeansAPI
improvementsML PipelinesSPARK-6725 Pipeline persistence - Save/load for ML
Pipelines, with partial coverage of spark.ml algorithmsSPARK-5565 LDA in ML
Pipelines - API for Latent Dirichlet Allocation in ML PipelinesR APISPARK-9836
R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares
via summary(model)SPARK-9681 Feature interactions in R formula - Interaction
operator ":" in R formulaPython API - Many improvements to Python API to
approach feature parityMisc improvementsSPARK-7685 , SPARK-9642 Instance
weights for GLMs - Logistic and Linear Regression can take instance
weightsSPARK-10384, SPARK-10385 Univariate and bivariate statistics in
DataFrames - Variance, stddev, correlations, etc.SPARK-10117 LIBSVM data source
- LIBSVM as a SQL data sourceDocumentation improvementsSPARK-7751 @since
versions - Documentation includes initial version when classes and methods were
addedSPARK-11337 Testable example code - Automated testing for code in user
guide examplesDeprecationsIn spark.mllib.clustering.KMeans, the "runs"
parameter has been deprecated.In
spark.ml.classification.LogisticRegressionModel and
spark.ml.regression.LinearRegressionModel, the "weights" field has been
deprecated, in favor of the new name "coefficients." This helps disambiguate
from instance (row) weights given to algorithms.Changes of
behaviorspark.mllib.tree.GradientBoostedTrees validationTol has changed
semantics in 1.6. Previously, it was a threshold for absolute change in error.
Now, it resembles the behavior of GradientDescent convergenceTol: For large
errors, it uses relative error (relative to the previous error); for small
errors (< 0.01), it uses absolute error.spark.ml.feature.RegexTokenizer:
Previously, it did not convert strings to lowercase before tokenizing. Now, it
converts to lowercase by default, with an option not to. This matches the
behavior of the simpler Tokenizer transformer.Spark SQL's partition discovery
has been changed to only discover partition directories that are children of
the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be
considered a partition but only children of x=1.) This behavior can be
overridden by manually specifying the basePath that partitioning discovery
should start with (SPARK-11678).When casting a value of an integral type to
timestamp (e.g. casting a long value to timestamp), the value is treated as
being in seconds instead of milliseconds (SPARK-11724).With the improved query
planner for queries having distinct aggregations (SPARK-9241), the plan of a
query having a single distinct aggregation has been changed to a more robust
version. To switch back to the plan generated by Spark 1.5's planner, please
set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).