(I can't -1 this.) I do agree that docs have been treated as if separate from releases in the past. With more maturity in the release process, I'm questioning that now, as I don't think it's normal. It would be a reason to release or not release this particular tarball, so a vote thread is the right place to discuss it.
I'm surprised you're suggesting there's not a coupling between a release's code and the docs for that release. If a release happens and some time later docs come out, that has some effect on people's usage. Surely, the ideal is for docs for x.y to come from the bits for x.y, and thus are available at the same time. Reality is something else, and your argument is practical, that the release is again behind and so shouldn't we overlook this minor problem to get it out? This particular problem has to get fixed, soon, we agree. It's minor by virtue of being hopefully temporary. But if it can/will be fixed quickly, what's the hurry? I get it, people want a releases sooner than later all else equal, but this is always true. It'd be nice to talk about what behaviors have led to being behind schedule and this perceived rush to finish now, since this same thing has happened in 1.5, 1.4. I'd rather at least collect some opinions on it than invalidate the question. On Sat, Dec 12, 2015 at 11:17 PM, Michael Armbrust <mich...@databricks.com> wrote: > Sean, if you would like to -1 the release you are certainly entitled to, > but in the past we have never held a release for documentation only > issues. If you'd like to change the policy of the project I'm not sure > that a voting thread is the right place to do it. > > I think the right question here, is "How are users going to be affected by > this temporary issue?". Given that I'm pretty certain that no users build > the documentation from the release themselves and instead consume it from > the published documentation, the docs contained in the release seem less > important as far as voting on the artifacts is concerned. > > In contrast, there have been several threads on the users list asking when > the release is going to happen. Should we make them wait longer for > something that isn't going to affect their usage of the release? I would > vote no. That doesn't mean that we shouldn't fix the documentation issue. > It just means we shouldn't add unnecessary coupling where it has no benefit. > > On Sat, Dec 12, 2015 at 1:50 PM, Sean Owen <so...@cloudera.com> wrote: > >> I've heard this argument before, but don't quite get it. Documentation is >> part of a release, and I believe is something we're voting on here too, and >> therefore needs to 'work' as documentation. We could not release this HTML >> to the Apache site, so I think that does actually mean the artifacts >> including docs don't work as a release. >> >> Yes, I can see that the non-code artifacts can be released a little bit >> after the code artifacts with last minute fixes. But, the whole release can >> just happen later too. Why wouldn't this be a valid reason to block the >> release? >> >> On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> Thanks Ben, but as I said in the first email, docs are published >>> separately from the release, so this isn't a valid reason to down vote the >>> RC. We just provide them to help with testing. >>> >>> I'll ask the mllib guys to take a look at that patch though. >>> On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <benjamin.fra...@gmail.com> >>> wrote: >>> >>>> -1 >>>> >>>> For me the docs are not displaying except for the first page, for >>>> example >>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html >>>> is >>>> a blank page. >>>> This is because of SPARK-12199 >>>> <https://github.com/apache/spark/pull/10193>: >>>> Element[W|w]iseProductExample.scala is not the same in the docs and >>>> the actual file name. >>>> >>>> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> I'll kick off the voting with a +1. >>>>> >>>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>> version 1.6.0! >>>>>> >>>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and >>>>>> passes if a majority of at least 3 +1 PMC votes are cast. >>>>>> >>>>>> [ ] +1 Release this package as Apache Spark 1.6.0 >>>>>> [ ] -1 Do not release this package because ... >>>>>> >>>>>> To learn more about Apache Spark, please see http://spark.apache.org/ >>>>>> >>>>>> The tag to be voted on is *v1.6.0-rc2 >>>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b) >>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>* >>>>>> >>>>>> The release files, including signatures, digests, etc. can be found >>>>>> at: >>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/ >>>>>> >>>>>> Release artifacts are signed with the following key: >>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>> >>>>>> The staging repository for this release can be found at: >>>>>> >>>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/ >>>>>> >>>>>> The test repository (versioned as v1.6.0-rc2) for this release can be >>>>>> found at: >>>>>> >>>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/ >>>>>> >>>>>> The documentation corresponding to this release can be found at: >>>>>> >>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/ >>>>>> >>>>>> ======================================= >>>>>> == How can I help test this release? == >>>>>> ======================================= >>>>>> If you are a Spark user, you can help us test this release by taking >>>>>> an existing Spark workload and running on this release candidate, then >>>>>> reporting any regressions. >>>>>> >>>>>> ================================================ >>>>>> == What justifies a -1 vote for this release? == >>>>>> ================================================ >>>>>> This vote is happening towards the end of the 1.6 QA period, so -1 >>>>>> votes should only occur for significant regressions from 1.5. Bugs >>>>>> already >>>>>> present in 1.5, minor regressions, or bugs related to new features will >>>>>> not >>>>>> block this release. >>>>>> >>>>>> =============================================================== >>>>>> == What should happen to JIRA tickets still targeting 1.6.0? == >>>>>> =============================================================== >>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go >>>>>> into branch-1.6, since documentations will be published separately from >>>>>> the >>>>>> release. >>>>>> 2. New features for non-alpha-modules should target 1.7+. >>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the >>>>>> target version. >>>>>> >>>>>> >>>>>> ================================================== >>>>>> == Major changes to help you focus your testing == >>>>>> ================================================== >>>>>> >>>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming >>>>>> >>>>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> >>>>>> trackStateByKey has been renamed to mapWithState >>>>>> >>>>>> Spark SQL >>>>>> >>>>>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165> >>>>>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix >>>>>> bugs in eviction of storage memory by execution. >>>>>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> >>>>>> correct >>>>>> passing null into ScalaUDF >>>>>> >>>>>> Notable Features Since 1.5Spark SQL >>>>>> >>>>>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> >>>>>> Parquet >>>>>> Performance - Improve Parquet scan performance when using flat >>>>>> schemas. >>>>>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810> >>>>>> Session Management - Isolated devault database (i.e USE mydb) >>>>>> even on shared clusters. >>>>>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> >>>>>> Dataset >>>>>> API - A type-safe API (similar to RDDs) that performs many >>>>>> operations on serialized binary data and code generation (i.e. Project >>>>>> Tungsten). >>>>>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> >>>>>> Unified >>>>>> Memory Management - Shared memory for execution and caching >>>>>> instead of exclusive division of the regions. >>>>>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL >>>>>> Queries on Files - Concise syntax for running SQL queries over >>>>>> files of any supported format without registering a table. >>>>>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> >>>>>> Reading >>>>>> non-standard JSON files - Added options to read non-standard JSON >>>>>> files (e.g. single-quotes, unquoted attributes) >>>>>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> >>>>>> Per-operator >>>>>> Metrics for SQL Execution - Display statistics on a peroperator >>>>>> basis for memory usage and spilled data size. >>>>>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star >>>>>> (*) expansion for StructTypes - Makes it easier to nest and unest >>>>>> arbitrary numbers of columns >>>>>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917> >>>>>> , SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> >>>>>> In-memory >>>>>> Columnar Cache Performance - Significant (up to 14x) speed up >>>>>> when caching data that contains complex types in DataFrames or SQL. >>>>>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast >>>>>> null-safe joins - Joins using null-safe equality (<=>) will now >>>>>> execute using SortMergeJoin instead of computing a cartisian product. >>>>>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL >>>>>> Execution Using Off-Heap Memory - Support for configuring query >>>>>> execution to occur using off-heap memory to avoid GC overhead >>>>>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> >>>>>> Datasource >>>>>> API Avoid Double Filter - When implemeting a datasource with >>>>>> filter pushdown, developers can now tell Spark SQL to avoid double >>>>>> evaluating a pushed-down filter. >>>>>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> >>>>>> Advanced >>>>>> Layout of Cached Data - storing partitioning and ordering schemes >>>>>> in In-memory table scan, and adding distributeBy and localSort to DF >>>>>> API >>>>>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> >>>>>> Adaptive >>>>>> query execution - Intial support for automatically selecting the >>>>>> number of reducers for joins and aggregations. >>>>>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> >>>>>> Improved >>>>>> query planner for queries having distinct aggregations - Query >>>>>> plans of distinct aggregations are more robust when distinct columns >>>>>> have >>>>>> high cardinality. >>>>>> >>>>>> Spark Streaming >>>>>> >>>>>> - API Updates >>>>>> - SPARK-2629 >>>>>> <https://issues.apache.org/jira/browse/SPARK-2629> New >>>>>> improved state management - mapWithState - a DStream >>>>>> transformation for stateful stream processing, supercedes >>>>>> updateStateByKey in functionality and performance. >>>>>> - SPARK-11198 >>>>>> <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis >>>>>> record deaggregation - Kinesis streams have been upgraded to >>>>>> use KCL 1.4.0 and supports transparent deaggregation of >>>>>> KPL-aggregated >>>>>> records. >>>>>> - SPARK-10891 >>>>>> <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis >>>>>> message handler function - Allows arbitraray function to be >>>>>> applied to a Kinesis record in the Kinesis receiver before to >>>>>> customize >>>>>> what data is to be stored in memory. >>>>>> - SPARK-6328 >>>>>> <https://issues.apache.org/jira/browse/SPARK-6328> Python >>>>>> Streamng Listener API - Get streaming statistics (scheduling >>>>>> delays, batch processing times, etc.) in streaming. >>>>>> >>>>>> >>>>>> - UI Improvements >>>>>> - Made failures visible in the streaming tab, in the >>>>>> timelines, batch list, and batch details page. >>>>>> - Made output operations visible in the streaming tab as >>>>>> progress bars. >>>>>> >>>>>> MLlibNew algorithms/models >>>>>> >>>>>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> >>>>>> Survival >>>>>> analysis - Log-linear model for survival analysis >>>>>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> >>>>>> Normal >>>>>> equation for least squares - Normal equation solver, providing >>>>>> R-like model summary statistics >>>>>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> >>>>>> Online >>>>>> hypothesis testing - A/B testing in the Spark Streaming framework >>>>>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New >>>>>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL >>>>>> transformer >>>>>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> >>>>>> Bisecting >>>>>> K-Means clustering - Fast top-down clustering variant of K-Means >>>>>> >>>>>> API improvements >>>>>> >>>>>> - ML Pipelines >>>>>> - SPARK-6725 >>>>>> <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline >>>>>> persistence - Save/load for ML Pipelines, with partial >>>>>> coverage of spark.ml algorithms >>>>>> - SPARK-5565 >>>>>> <https://issues.apache.org/jira/browse/SPARK-5565> LDA in ML >>>>>> Pipelines - API for Latent Dirichlet Allocation in ML Pipelines >>>>>> - R API >>>>>> - SPARK-9836 >>>>>> <https://issues.apache.org/jira/browse/SPARK-9836> R-like >>>>>> statistics for GLMs - (Partial) R-like stats for ordinary >>>>>> least squares via summary(model) >>>>>> - SPARK-9681 >>>>>> <https://issues.apache.org/jira/browse/SPARK-9681> Feature >>>>>> interactions in R formula - Interaction operator ":" in R >>>>>> formula >>>>>> - Python API - Many improvements to Python API to approach >>>>>> feature parity >>>>>> >>>>>> Misc improvements >>>>>> >>>>>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>, >>>>>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> >>>>>> Instance >>>>>> weights for GLMs - Logistic and Linear Regression can take >>>>>> instance weights >>>>>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384> >>>>>> , SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> >>>>>> Univariate >>>>>> and bivariate statistics in DataFrames - Variance, stddev, >>>>>> correlations, etc. >>>>>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> >>>>>> LIBSVM >>>>>> data source - LIBSVM as a SQL data sourceDocumentation >>>>>> improvements >>>>>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> >>>>>> @since >>>>>> versions - Documentation includes initial version when classes >>>>>> and methods were added >>>>>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> >>>>>> Testable >>>>>> example code - Automated testing for code in user guide examples >>>>>> >>>>>> Deprecations >>>>>> >>>>>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been >>>>>> deprecated. >>>>>> - In spark.ml.classification.LogisticRegressionModel and >>>>>> spark.ml.regression.LinearRegressionModel, the "weights" field has >>>>>> been >>>>>> deprecated, in favor of the new name "coefficients." This helps >>>>>> disambiguate from instance (row) weights given to algorithms. >>>>>> >>>>>> Changes of behavior >>>>>> >>>>>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed >>>>>> semantics in 1.6. Previously, it was a threshold for absolute change >>>>>> in >>>>>> error. Now, it resembles the behavior of GradientDescent >>>>>> convergenceTol: >>>>>> For large errors, it uses relative error (relative to the previous >>>>>> error); >>>>>> for small errors (< 0.01), it uses absolute error. >>>>>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert >>>>>> strings to lowercase before tokenizing. Now, it converts to lowercase >>>>>> by >>>>>> default, with an option not to. This matches the behavior of the >>>>>> simpler >>>>>> Tokenizer transformer. >>>>>> - Spark SQL's partition discovery has been changed to only >>>>>> discover partition directories that are children of the given path. >>>>>> (i.e. >>>>>> if path="/my/data/x=1" then x=1 will no longer be considered a >>>>>> partition but only children of x=1.) This behavior can be >>>>>> overridden by manually specifying the basePath that partitioning >>>>>> discovery should start with (SPARK-11678 >>>>>> <https://issues.apache.org/jira/browse/SPARK-11678>). >>>>>> - When casting a value of an integral type to timestamp (e.g. >>>>>> casting a long value to timestamp), the value is treated as being in >>>>>> seconds instead of milliseconds (SPARK-11724 >>>>>> <https://issues.apache.org/jira/browse/SPARK-11724>). >>>>>> - With the improved query planner for queries having distinct >>>>>> aggregations (SPARK-9241 >>>>>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of >>>>>> a query having a single distinct aggregation has been changed to a >>>>>> more >>>>>> robust version. To switch back to the plan generated by Spark 1.5's >>>>>> planner, please set spark.sql.specializeSingleDistinctAggPlanning >>>>>> to true (SPARK-12077 >>>>>> <https://issues.apache.org/jira/browse/SPARK-12077>). >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Ben Fradet. >>>> >>> >> >