Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Michael Armbrust Sat, 12 Dec 2015 15:18:47 -0800

Sean, if you would like to -1 the release you are certainly entitled to,
but in the past we have never held a release for documentation only
issues.  If you'd like to change the policy of the project I'm not sure
that a voting thread is the right place to do it.


I think the right question here, is "How are users going to be affected by
this temporary issue?".  Given that I'm pretty certain that no users build
the documentation from the release themselves and instead consume it from
the published documentation, the docs contained in the release seem less
important as far as voting on the artifacts is concerned.

In contrast, there have been several threads on the users list asking when
the release is going to happen.  Should we make them wait longer for
something that isn't going to affect their usage of the release?  I would
vote no.  That doesn't mean that we shouldn't fix the documentation issue.
It just means we shouldn't add unnecessary coupling where it has no benefit.

On Sat, Dec 12, 2015 at 1:50 PM, Sean Owen <[email protected]> wrote:

> I've heard this argument before, but don't quite get it. Documentation is
> part of a release, and I believe is something we're voting on here too, and
> therefore needs to 'work' as documentation. We could not release this HTML
> to the Apache site, so I think that does actually mean the artifacts
> including docs don't work as a release.
>
> Yes, I can see that the non-code artifacts can be released a little bit
> after the code artifacts with last minute fixes. But, the whole release can
> just happen later too. Why wouldn't this be a valid reason to block the
> release?
>
> On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <[email protected]>
> wrote:
>
>> Thanks Ben, but as I said in the first email, docs are published
>> separately from the release, so this isn't a valid reason to down vote the
>> RC.  We just provide them to help with testing.
>>
>> I'll ask the mllib guys to take a look at that patch though.
>> On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <[email protected]>
>> wrote:
>>
>>> -1
>>>
>>> For me the docs are not displaying except for the first page, for
>>> example
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html
>>>  is
>>> a blank page.
>>> This is because of SPARK-12199
>>> <https://github.com/apache/spark/pull/10193>:
>>> Element[W|w]iseProductExample.scala is not the same in the docs and the
>>> actual file name.
>>>
>>> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <
>>> [email protected]> wrote:
>>>
>>>> I'll kick off the voting with a +1.
>>>>
>>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <
>>>> [email protected]> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 1.6.0!
>>>>>
>>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is *v1.6.0-rc2
>>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>>>
>>>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>>>> found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>>>
>>>>> =======================================
>>>>> == How can I help test this release? ==
>>>>> =======================================
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> ================================================
>>>>> == What justifies a -1 vote for this release? ==
>>>>> ================================================
>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>> present in 1.5, minor regressions, or bugs related to new features will 
>>>>> not
>>>>> block this release.
>>>>>
>>>>> ===============================================================
>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>> ===============================================================
>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>> into branch-1.6, since documentations will be published separately from 
>>>>> the
>>>>> release.
>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>> target version.
>>>>>
>>>>>
>>>>> ==================================================
>>>>> == Major changes to help you focus your testing ==
>>>>> ==================================================
>>>>>
>>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>>>
>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>
>>>>> Spark SQL
>>>>>
>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>    bugs in eviction of storage memory by execution.
>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> 
>>>>> correct
>>>>>    passing null into ScalaUDF
>>>>>
>>>>> Notable Features Since 1.5Spark SQL
>>>>>
>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> 
>>>>> Parquet
>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>    schemas.
>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>>    on shared clusters.
>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> 
>>>>> Dataset
>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>    Tungsten).
>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> 
>>>>> Unified
>>>>>    Memory Management - Shared memory for execution and caching
>>>>>    instead of exclusive division of the regions.
>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>    files of any supported format without registering a table.
>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> 
>>>>> Reading
>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> 
>>>>> Per-operator
>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>    basis for memory usage and spilled data size.
>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>    arbitrary numbers of columns
>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> 
>>>>> In-memory
>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> 
>>>>> Datasource
>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>    evaluating a pushed-down filter.
>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> 
>>>>> Advanced
>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF 
>>>>> API
>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> 
>>>>> Adaptive
>>>>>    query execution - Intial support for automatically selecting the
>>>>>    number of reducers for joins and aggregations.
>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> 
>>>>> Improved
>>>>>    query planner for queries having distinct aggregations - Query
>>>>>    plans of distinct aggregations are more robust when distinct columns 
>>>>> have
>>>>>    high cardinality.
>>>>>
>>>>> Spark Streaming
>>>>>
>>>>>    - API Updates
>>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>        New improved state management - mapWithState - a DStream
>>>>>       transformation for stateful stream processing, supercedes
>>>>>       updateStateByKey in functionality and performance.
>>>>>       - SPARK-11198
>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>       use KCL 1.4.0 and supports transparent deaggregation of 
>>>>> KPL-aggregated
>>>>>       records.
>>>>>       - SPARK-10891
>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>       message handler function - Allows arbitraray function to be
>>>>>       applied to a Kinesis record in the Kinesis receiver before to 
>>>>> customize
>>>>>       what data is to be stored in memory.
>>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>>>>>        Python Streamng Listener API - Get streaming statistics
>>>>>       (scheduling delays, batch processing times, etc.) in streaming.
>>>>>
>>>>>
>>>>>    - UI Improvements
>>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>>       batch list, and batch details page.
>>>>>       - Made output operations visible in the streaming tab as
>>>>>       progress bars.
>>>>>
>>>>> MLlibNew algorithms/models
>>>>>
>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> 
>>>>> Survival
>>>>>    analysis - Log-linear model for survival analysis
>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>    equation for least squares - Normal equation solver, providing
>>>>>    R-like model summary statistics
>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>    transformer
>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> 
>>>>> Bisecting
>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>
>>>>> API improvements
>>>>>
>>>>>    - ML Pipelines
>>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725>
>>>>>        Pipeline persistence - Save/load for ML Pipelines, with
>>>>>       partial coverage of spark.ml algorithms
>>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565>
>>>>>        LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>>>>>       ML Pipelines
>>>>>    - R API
>>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836>
>>>>>        R-like statistics for GLMs - (Partial) R-like stats for
>>>>>       ordinary least squares via summary(model)
>>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681>
>>>>>        Feature interactions in R formula - Interaction operator ":"
>>>>>       in R formula
>>>>>    - Python API - Many improvements to Python API to approach feature
>>>>>    parity
>>>>>
>>>>> Misc improvements
>>>>>
>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>    instance weights
>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> 
>>>>> Univariate
>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>    correlations, etc.
>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> 
>>>>> LIBSVM
>>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>    versions - Documentation includes initial version when classes and
>>>>>    methods were added
>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> 
>>>>> Testable
>>>>>    example code - Automated testing for code in user guide examples
>>>>>
>>>>> Deprecations
>>>>>
>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>    deprecated.
>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>
>>>>> Changes of behavior
>>>>>
>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>>    error. Now, it resembles the behavior of GradientDescent 
>>>>> convergenceTol:
>>>>>    For large errors, it uses relative error (relative to the previous 
>>>>> error);
>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase 
>>>>> by
>>>>>    default, with an option not to. This matches the behavior of the 
>>>>> simpler
>>>>>    Tokenizer transformer.
>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>    discover partition directories that are children of the given path. 
>>>>> (i.e.
>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>    partition but only children of x=1.) This behavior can be
>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>    discovery should start with (SPARK-11678
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>    - With the improved query planner for queries having distinct
>>>>>    aggregations (SPARK-9241
>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>>    query having a single distinct aggregation has been changed to a more
>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>     to true (SPARK-12077
>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Ben Fradet.
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Reply via email to