Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Sean Owen Sat, 12 Dec 2015 13:51:30 -0800

I've heard this argument before, but don't quite get it. Documentation is
part of a release, and I believe is something we're voting on here too, and
therefore needs to 'work' as documentation. We could not release this HTML
to the Apache site, so I think that does actually mean the artifacts
including docs don't work as a release.


Yes, I can see that the non-code artifacts can be released a little bit
after the code artifacts with last minute fixes. But, the whole release can
just happen later too. Why wouldn't this be a valid reason to block the
release?

On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Thanks Ben, but as I said in the first email, docs are published
> separately from the release, so this isn't a valid reason to down vote the
> RC.  We just provide them to help with testing.
>
> I'll ask the mllib guys to take a look at that patch though.
> On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <benjamin.fra...@gmail.com>
> wrote:
>
>> -1
>>
>> For me the docs are not displaying except for the first page, for example
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html
>>  is
>> a blank page.
>> This is because of SPARK-12199
>> <https://github.com/apache/spark/pull/10193>:
>> Element[W|w]iseProductExample.scala is not the same in the docs and the
>> actual file name.
>>
>> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> I'll kick off the voting with a +1.
>>>
>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc2
>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>>
>>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>>
>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>    trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>    bugs in eviction of storage memory by execution.
>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> 
>>>> correct
>>>>    passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> 
>>>> Parquet
>>>>    Performance - Improve Parquet scan performance when using flat
>>>>    schemas.
>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>    on shared clusters.
>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>    Tungsten).
>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> 
>>>> Unified
>>>>    Memory Management - Shared memory for execution and caching instead
>>>>    of exclusive division of the regions.
>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>    files of any supported format without registering a table.
>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> 
>>>> Reading
>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> 
>>>> Per-operator
>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>    basis for memory usage and spilled data size.
>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>    arbitrary numbers of columns
>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> 
>>>> In-memory
>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> 
>>>> Datasource
>>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>>    pushed-down filter.
>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> 
>>>> Advanced
>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> 
>>>> Adaptive
>>>>    query execution - Intial support for automatically selecting the
>>>>    number of reducers for joins and aggregations.
>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> 
>>>> Improved
>>>>    query planner for queries having distinct aggregations - Query
>>>>    plans of distinct aggregations are more robust when distinct columns 
>>>> have
>>>>    high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>>    - API Updates
>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>       improved state management - mapWithState - a DStream
>>>>       transformation for stateful stream processing, supercedes
>>>>       updateStateByKey in functionality and performance.
>>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>>        Kinesis record deaggregation - Kinesis streams have been
>>>>       upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>       KPL-aggregated records.
>>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>>        Kinesis message handler function - Allows arbitraray function
>>>>       to be applied to a Kinesis record in the Kinesis receiver before to
>>>>       customize what data is to be stored in memory.
>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> 
>>>> Python
>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>       delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>>    - UI Improvements
>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>       batch list, and batch details page.
>>>>       - Made output operations visible in the streaming tab as
>>>>       progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> 
>>>> Survival
>>>>    analysis - Log-linear model for survival analysis
>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>    equation for least squares - Normal equation solver, providing
>>>>    R-like model summary statistics
>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>    transformer
>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> 
>>>> Bisecting
>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>>    - ML Pipelines
>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> 
>>>> Pipeline
>>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>>       of spark.ml algorithms
>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>>       Pipelines
>>>>    - R API
>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> 
>>>> R-like
>>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>>       squares via summary(model)
>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> 
>>>> Feature
>>>>       interactions in R formula - Interaction operator ":" in R formula
>>>>    - Python API - Many improvements to Python API to approach feature
>>>>    parity
>>>>
>>>> Misc improvements
>>>>
>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>>    weights
>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> 
>>>> Univariate
>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>    correlations, etc.
>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>    versions - Documentation includes initial version when classes and
>>>>    methods were added
>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> 
>>>> Testable
>>>>    example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>    deprecated.
>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>    For large errors, it uses relative error (relative to the previous 
>>>> error);
>>>>    for small errors (< 0.01), it uses absolute error.
>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>    Tokenizer transformer.
>>>>    - Spark SQL's partition discovery has been changed to only discover
>>>>    partition directories that are children of the given path. (i.e. if
>>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>>    partition but only children of x=1.) This behavior can be
>>>>    overridden by manually specifying the basePath that partitioning
>>>>    discovery should start with (SPARK-11678
>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>    casting a long value to timestamp), the value is treated as being in
>>>>    seconds instead of milliseconds (SPARK-11724
>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>    - With the improved query planner for queries having distinct
>>>>    aggregations (SPARK-9241
>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>    query having a single distinct aggregation has been changed to a more
>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>     to true (SPARK-12077
>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Ben Fradet.
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Reply via email to