Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Yin Huai Sat, 12 Dec 2015 12:55:29 -0800

+1

Critical and blocker issues of SQL have been addressed.


On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <[email protected]>
wrote:

> I'll kick off the voting with a +1.
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <[email protected]>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> 
>> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> 
>> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> 
>> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated 
>> records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> 
>> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> 
>> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> 
>> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Reply via email to