Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Yin Huai Tue, 22 Dec 2015 20:40:23 -0800

+1

On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee <denny.g....@gmail.com> wrote:


> +1
>
> On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson <ilike...@gmail.com> wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen <joshro...@databricks.com>
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 1.6.0!
>>>>>>
>>>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is *v1.6.0-rc4
>>>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>>>
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>>>
>>>>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>>>>> found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>>
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>>>
>>>>>> =======================================
>>>>>> == How can I help test this release? ==
>>>>>> =======================================
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>>
>>>>>> ================================================
>>>>>> == What justifies a -1 vote for this release? ==
>>>>>> ================================================
>>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>>> votes should only occur for significant regressions from 1.5. Bugs 
>>>>>> already
>>>>>> present in 1.5, minor regressions, or bugs related to new features will 
>>>>>> not
>>>>>> block this release.
>>>>>>
>>>>>> ===============================================================
>>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>>> ===============================================================
>>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>>> into branch-1.6, since documentations will be published separately from 
>>>>>> the
>>>>>> release.
>>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>>> target version.
>>>>>>
>>>>>>
>>>>>> ==================================================
>>>>>> == Major changes to help you focus your testing ==
>>>>>> ==================================================
>>>>>>
>>>>>> Notable changes since 1.6 RC3
>>>>>>
>>>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>>>> Timestamps/Arrays/Decimal
>>>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>>>   - SPARK-12413 - Fix mesos HA
>>>>>>
>>>>>> Notable changes since 1.6 RC2
>>>>>> - SPARK_VERSION has been set correctly
>>>>>> - SPARK-12199 ML Docs are publishing correctly
>>>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>>>
>>>>>> Notable changes since 1.6 RC1
>>>>>> Spark Streaming
>>>>>>
>>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>>
>>>>>> Spark SQL
>>>>>>
>>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>>    bugs in eviction of storage memory by execution.
>>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> 
>>>>>> correct
>>>>>>    passing null into ScalaUDF
>>>>>>
>>>>>> Notable Features Since 1.5Spark SQL
>>>>>>
>>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> 
>>>>>> Parquet
>>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>>    schemas.
>>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>>    Session Management - Isolated devault database (i.e USE mydb)
>>>>>>    even on shared clusters.
>>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> 
>>>>>> Dataset
>>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>>    Tungsten).
>>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> 
>>>>>> Unified
>>>>>>    Memory Management - Shared memory for execution and caching
>>>>>>    instead of exclusive division of the regions.
>>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>>    files of any supported format without registering a table.
>>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> 
>>>>>> Reading
>>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> 
>>>>>> Per-operator
>>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>>    basis for memory usage and spilled data size.
>>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>>    arbitrary numbers of columns
>>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>
>>>>>>    , SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> 
>>>>>> In-memory
>>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up
>>>>>>    when caching data that contains complex types in DataFrames or SQL.
>>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> 
>>>>>> Datasource
>>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>>    evaluating a pushed-down filter.
>>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> 
>>>>>> Advanced
>>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF 
>>>>>> API
>>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> 
>>>>>> Adaptive
>>>>>>    query execution - Intial support for automatically selecting the
>>>>>>    number of reducers for joins and aggregations.
>>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> 
>>>>>> Improved
>>>>>>    query planner for queries having distinct aggregations - Query
>>>>>>    plans of distinct aggregations are more robust when distinct columns 
>>>>>> have
>>>>>>    high cardinality.
>>>>>>
>>>>>> Spark Streaming
>>>>>>
>>>>>>    - API Updates
>>>>>>       - SPARK-2629
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>>>       improved state management - mapWithState - a DStream
>>>>>>       transformation for stateful stream processing, supercedes
>>>>>>       updateStateByKey in functionality and performance.
>>>>>>       - SPARK-11198
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>>       use KCL 1.4.0 and supports transparent deaggregation of 
>>>>>> KPL-aggregated
>>>>>>       records.
>>>>>>       - SPARK-10891
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>>       message handler function - Allows arbitraray function to be
>>>>>>       applied to a Kinesis record in the Kinesis receiver before to 
>>>>>> customize
>>>>>>       what data is to be stored in memory.
>>>>>>       - SPARK-6328
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>>>       delays, batch processing times, etc.) in streaming.
>>>>>>
>>>>>>
>>>>>>    - UI Improvements
>>>>>>       - Made failures visible in the streaming tab, in the
>>>>>>       timelines, batch list, and batch details page.
>>>>>>       - Made output operations visible in the streaming tab as
>>>>>>       progress bars.
>>>>>>
>>>>>> MLlibNew algorithms/models
>>>>>>
>>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> 
>>>>>> Survival
>>>>>>    analysis - Log-linear model for survival analysis
>>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> 
>>>>>> Normal
>>>>>>    equation for least squares - Normal equation solver, providing
>>>>>>    R-like model summary statistics
>>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> 
>>>>>> Online
>>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>>    transformer
>>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> 
>>>>>> Bisecting
>>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>>
>>>>>> API improvements
>>>>>>
>>>>>>    - ML Pipelines
>>>>>>       - SPARK-6725
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>>>       persistence - Save/load for ML Pipelines, with partial
>>>>>>       coverage of spark.mlalgorithms
>>>>>>       - SPARK-5565
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-5565> LDA in ML
>>>>>>       Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
>>>>>>    - R API
>>>>>>       - SPARK-9836
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>>>       statistics for GLMs - (Partial) R-like stats for ordinary
>>>>>>       least squares via summary(model)
>>>>>>       - SPARK-9681
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>>>       interactions in R formula - Interaction operator ":" in R
>>>>>>       formula
>>>>>>    - Python API - Many improvements to Python API to approach
>>>>>>    feature parity
>>>>>>
>>>>>> Misc improvements
>>>>>>
>>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> 
>>>>>> Instance
>>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>>    instance weights
>>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>
>>>>>>    , SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> 
>>>>>> Univariate
>>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>>    correlations, etc.
>>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> 
>>>>>> LIBSVM
>>>>>>    data source - LIBSVM as a SQL data sourceDocumentation
>>>>>>    improvements
>>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> 
>>>>>> @since
>>>>>>    versions - Documentation includes initial version when classes
>>>>>>    and methods were added
>>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> 
>>>>>> Testable
>>>>>>    example code - Automated testing for code in user guide examples
>>>>>>
>>>>>> Deprecations
>>>>>>
>>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>>    deprecated.
>>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has 
>>>>>> been
>>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>>
>>>>>> Changes of behavior
>>>>>>
>>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change 
>>>>>> in
>>>>>>    error. Now, it resembles the behavior of GradientDescent 
>>>>>> convergenceTol:
>>>>>>    For large errors, it uses relative error (relative to the previous 
>>>>>> error);
>>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase 
>>>>>> by
>>>>>>    default, with an option not to. This matches the behavior of the 
>>>>>> simpler
>>>>>>    Tokenizer transformer.
>>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>>    discover partition directories that are children of the given path. 
>>>>>> (i.e.
>>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>>    partition but only children of x=1.) This behavior can be
>>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>>    discovery should start with (SPARK-11678
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>>    - With the improved query planner for queries having distinct
>>>>>>    aggregations (SPARK-9241
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of
>>>>>>    a query having a single distinct aggregation has been changed to a 
>>>>>> more
>>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>>     to true (SPARK-12077
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Reply via email to