Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Dong Joon Hyun
+1 (non-binding)

I built and tested on CentOS 7.3.1611 / OpenJDK 1.8.131 / R 3.3.3
with “-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver –Psparkr”.
Java/Scala/R tests passed as expected.

There are two minor things.


  1.  For the deprecation documentation issue 
(https://github.com/apache/spark/pull/18207),
I hope it goes to `Release Note` instead of blocking the current voting.

Something like `http://spark.apache.org/releases/spark-release-2-1-0.html`.


  1.  3rd Party test suite may fail due to the following difference
Previously, until Spark 2.1.1, the count was ‘1’.
It is https://issues.apache.org/jira/browse/SPARK-20954 .

scala> sql("create table t(a int)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("desc table t").show
+--+-+---+
|  col_name|data_type|comment|
+--+-+---+
|# col_name|data_type|comment|
| a|  int|   null|
+--+-+---+

scala> sql("desc table t").count
res2: Long = 2

Bests,
Dongjoon.




From: Michael Armbrust 
Date: Monday, June 5, 2017 at 12:14 PM
To: "dev@spark.apache.org" 
Subject: [VOTE] Apache Spark 2.2.0 (RC4)

Please vote on releasing the following candidate as Apache Spark version 2.2.0. 
The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.2.0-rc4 
(377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1.


Re: a stage can belong to more than one job please?

2017-06-06 Thread ??????????
Hi  Mark,


Thanks.






 
---Original---
From: "Mark Hamstra"
Date: 2017/6/6 23:27:43
To: "dev";
Cc: "user";
Subject: Re: a stage can belong to more than one job please?


Yes, a Stage can be part of more than one Job. The jobIds field of Stage is 
used repeatedly in the DAGScheduler.

On Tue, Jun 6, 2017 at 5:04 AM, ?? <1427357...@qq.com> wrote:
Hi all,

I read same code of spark about stage.

The constructor of stage keep the first job  ID the stage was part of.
does that means a stage can belong to more than one job please? And I find the 
member jobIds is never used. It looks strange.


thanks adv

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-06 Thread Zoltan Ivanfi
Hi Michael,

To answer this I think we should distinguish between the long-term fix and
the short-term fix.

If understand the replies correctly, everyone agrees that the desired
long-term fix is to have two separate SQL types (TIMESTAMP [WITH|WITHOUT]
TIME ZONE). Because of having separate types, mixing them as you described
can not happen (unless a new feature intentionally allows that). Of course,
conversions are still needed, but there are many examples from different
database systems that we can follow.

Since having two separate types is a huge effort, for a short term solution
I would suggest allowing the single existing TIMESTAMP type to allow both
semantics, configurable per table. The implementation of timezone-agnostic
semantics could be similar to Hive. In Hive, just like in Spark, a
timestamp is UTC-normalized internally but it is shown as a local time when
it gets displayed. To achieve timezone-agnostic behavior, Hive still uses
UTC-based timestamps in memory and adjusts on-disk data to/from this
internal representation if needed. When the on-disk data is UTC-normalized
as well, it matches this internal representation, so the on-disk value
directly corresponds to the UTC instant of the in-memory representation.

When the on-disk data is supposed to have timezone-agnostic semantics, the
on-disk value is made to match the local time value of the in-memory
timestamp, so the value that ultimately gets displayed to the user has
timezone-agnostic semantics (although the corresponding UTC value will be
different depending on the local time zone). So instead of implementing a
separate in-memory representation for timezone-agnostic timestamps, the
desired on-disk semantics are simulated on top of the existing
representation. Timestamps are adjusted during reading/writing as needed.

Implementing this workaround takes a lot less effort and simplifies some
scenarios as well. For example, the situation that you described (union of
two queries returning timestamps of different semantics) does not have to
be handled explicitly, since the in-memory representation are the same,
including their interpretation. Semantics only matter when reading/writing
timestamps from/to disk.

A disadvantage of this workaround is that it is not perfect. In most time
zones, there is an hour skipped by the DST change every year.
Timezone-agnostic timestamps from that single hour can not be emulated this
way, because they are invalid in the local timezone, so there is no UTC
instant that would ultimately get displayed as the desired timestamp. But
that only affects ~0.01% of all timestamps and adapting this workaround
would allow interoperability with 99.99% of timezone-agnostic timestamps
written by Impala and Hive instead of the current situation in which 0% of
these timestamps are interpreted correctly.

Please let me know if some parts of my description were unclear and I will
gladly elaborate on them.

Thanks,

Zoltan

On Fri, Jun 2, 2017 at 9:41 PM Michael Allman  wrote:

> Hi Zoltan,
>
> I don't fully understand your proposal for table-specific timestamp type
> semantics. I think it will be helpful to everyone in this conversation if
> you can identify the expected behavior for a few concrete scenarios.
>
> Suppose we have a Hive metastore table hivelogs with a column named ts
> with the hive timestamp type as described here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp.
> This table was created by Hive and is usually accessed through Hive or
> Presto.
>
> Suppose again we have a Hive metastore table sparklogs with a column named
> ts with the Spark SQL timestamp type as described here:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$.
> This table was created by Spark SQL and is usually accessed through Spark
> SQL.
>
> Let's say Spark SQL sets and reads a table property called
> timestamp_interp to determine timestamp type semantics for that table.
> Consider a dataframe df defined by sql("SELECT sts as ts FROM sparklogs
> UNION ALL SELECT hts as ts FROM hivelogs"). Suppose the timestamp_interp
> table property is absent from hivelogs. For each possible value of
> timestamp_interp set on the table sparklogs,
>
> 1. does df successfully pass analysis (i.e. is it a valid query)?
> 2. if it's a valid dataframe, what is the type of the ts column?
> 3. if it's a valid dataframe, what are the semantics of the type of the ts
> column?
>
> Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can
> you answer the same three questions for each combination of
> timestamp_interp on hivelogs and sparklogs?
>
> Thank you.
>
> Michael
>
>
> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi  wrote:
>
> Hi,
>
> We would like to solve the problem of interoperability of existing data,
> and that is the main use case for having table-level control. Spark should
> be able to read timestamps written by Impala or Hive and at 

[build system] RISELab is @ the spark summit, come say hi!

2017-06-06 Thread shane knapp
we've got a booth in the expo center, feel free to stop by, say hi and
get some stickers!

(complaining about jenkins is also welcome, and i will happily join in!)

:)

shane (formerly amplab, now riselab)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Holden Karau
+1 pip install to local virtual env works, no local version string (was
blocking the pypi upload).


On Tue, Jun 6, 2017 at 8:03 AM, Felix Cheung 
wrote:

> All tasks on the R QA umbrella are completed
> SPARK-20512
>
> We can close this.
>
>
>
> _
> From: Sean Owen 
> Sent: Tuesday, June 6, 2017 1:16 AM
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
> To: Michael Armbrust 
> Cc: 
>
>
>
> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / no-go release decision, and I think this interferes. The rest of JIRA
> noise doesn't matter much. You can see we're already resorting to secondary
> communications as a result ("anyone have any issues that need to be fixed
> before I cut another RC?" emails) because this is kind of ignored, and
> think we're swapping out a decent mechanism for worse one.
>
> I suspect, as you do, that there's no to-do here in which case they should
> be resolved and we're still on track for release. I'd wait on +1 until then.
>
>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is
used repeatedly in the DAGScheduler.

On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> I read same code of spark about stage.
>
> The constructor of stage keep the first job  ID the stage was part of.
> does that means a stage can belong to more than one job
> please? And I find the member jobIds is never used. It looks strange.
>
>
> thanks adv
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Felix Cheung
All tasks on the R QA umbrella are completed
SPARK-20512

We can close this.



_
From: Sean Owen mailto:so...@cloudera.com>>
Sent: Tuesday, June 6, 2017 1:16 AM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: Michael Armbrust mailto:mich...@databricks.com>>
Cc: mailto:dev@spark.apache.org>>


On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2 knowing 
that they were unlikely to pass.  That said, I still think these early RCs are 
valuable. I know several users that wanted to test new features in 2.2 that 
have used them.  Now, if we would prefer to call them preview or RC0 or 
something I'd be okay with that as well.

They are valuable, I only suggest it's better to note explicitly when there are 
blockers or must-do tasks that will fail the RC. It makes a big difference to 
whether one would like to +1.

I meant more than just calling them something different. An early RC could be 
voted as a released 'preview' artifact, at the start of the notional QA period, 
with a lower bar to passing, and releasable with known issues. This encourages 
more testing. It also resolves the controversy about whether it's OK to include 
an RC in a product (separate thread).


Regarding doc updates, I don't think it is a requirement that they be voted on 
as part of the release.  Even if they are something version specific.  I think 
we have regularly updated the website with documentation that was merged after 
the release.

They're part of the source release too, as markdown, and should be voted on. 
I've never understood otherwise. Have we actually released docs and then later 
changed them, so that they don't match the release? I don't recall that, but I 
do recall updating the non-version-specific website.

Aside from the oddity of having docs generated from x.y source not match docs 
published for x.y, you want the same protections for doc source that the 
project distributes as anything else. It's not just correctness, but liability. 
The hypothetical is always that someone included copyrighted text or something 
without permission and now the project can't rely on the argument that it made 
a good-faith effort to review what it released on the site. Someone becomes 
personally liable.

These are pretty technical reasons though. More practically, what's the hurry 
to release if docs aren't done (_if_ they're not done)? It's being presented as 
normal practice, but seems quite exceptional.


I personally don't think the QA umbrella JIRAs are particularly effective, but 
I also wouldn't ban their use if others think they are.  However, I do think 
that real QA needs an RC to test, so I think it is fine that there is still 
outstanding QA to be done when an RC is cut.  For example, I plan to run a 
bunch of streaming workloads on RC4 and will vote accordingly.

QA on RCs is great (see above). The problem is, I can't distinguish between a 
JIRA that means "we must test in general", which sounds like something you too 
would ignore, and one that means "there is specific functionality we have to 
check before a release that we haven't looked at yet", which is a committer 
waving a flag that they implicitly do not want a release until resolved. I 
wouldn't +1 a release that had a Blocker software defect one of us reported.

I know I'm harping on this, but this is the one mechanism we do use 
consistently (Blocker JIRAs) to clearly communicate about issues vital to a go 
/ no-go release decision, and I think this interferes. The rest of JIRA noise 
doesn't matter much. You can see we're already resorting to secondary 
communications as a result ("anyone have any issues that need to be fixed 
before I cut another RC?" emails) because this is kind of ignored, and think 
we're swapping out a decent mechanism for worse one.

I suspect, as you do, that there's no to-do here in which case they should be 
resolved and we're still on track for release. I'd wait on +1 until then.





Performance regression for partitioned parquet data

2017-06-06 Thread Bertrand Bossy
Hi,

since moving to spark 2.1 from 2.0, we experience a performance regression
when reading a large, partitioned parquet dataset:

We observe many (hundreds) very short jobs executing before the job that
reads the data is starting. I looked into this issue and pinned it down to
PartitioningAwareFileIndex: While recursively listing the directories, if a
directory contains more
than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32)
paths, the children are listed using a spark job. Because the tree is
listed serially, this can result in a lot of small spark jobs executed one
after the other and the overhead dominates. Performance can be improved by
tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However,
this is not a satisfactory solution.

I think that the current behaviour could be improved by walking the
directory tree in breadth first search order and only launching one spark
job to list files in parallel if the number of paths to be listed at some
level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .

Does this approach make sense? I have found "Regression in file listing
performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the
most closely related ticket.

Unless there is a reason for the current behaviour, I will create a ticket
on this soon. I might have some time in the coming days to work on this.

Regards,
Bertrand

-- 

Bertrand Bossy | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.


a stage can belong to more than one job please?

2017-06-06 Thread ??????????
Hi all,

I read same code of spark about stage.

The constructor of stage keep the first job  ID the stage was part of.
does that means a stage can belong to more than one job please? And I find the 
member jobIds is never used. It looks strange.


thanks adv

Are release docs part of a release?

2017-06-06 Thread Sean Owen
That's good, but, I think we should agree on whether release docs are part
of a release. It's important to reasoning about releases.

To be clear, you're suggesting that, say, right now you are OK with
updating this page with a few more paragraphs?
http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html  Even
though those paragraphs can't be in the released 2.1.0 doc source?

First, what is everyone's understanding of the answer?

The only official guidance I can find is
http://www.apache.org/legal/release-policy.html#distribute-other-artifacts ,
which suggests that docs need to be released similarly, with signatures.
Not quite the same question, but strongly implies they're treated like any
other source that is released with a vote.

--

WHAT ARE THE REQUIREMENTS TO DISTRIBUTE OTHER ARTIFACTS IN ADDITION TO THE
SOURCE PACKAGE?


ASF releases typically contain additional material together with the source
package. This material may include documentation concerning the release but
must contain LICENSE and NOTICE files. As mentioned above, these artifacts
must be signed by a committer with a detached signature if they are to be
placed in the project's distribution directory.

Again, these artifacts may be distributed only if they contain LICENSE and
NOTICE files. For example, the Java artifact format is based on a
compressed directory structure and those projects wishing to distribute
jars must place LICENSE and NOTICE files in the META-INF directory within
the jar.

Nothing in this section is meant to supersede the requirements defined here
 and here

 that all releases be primarily based on a signed source package.

On Tue, Jun 6, 2017 at 9:50 AM Nick Pentreath 
wrote:

> The website updates for ML QA (SPARK-20507) are not *actually* critical
> as the project website certainly can be updated separately from the source
> code guide and is not part of the release to be voted on. In future that
> particular work item for the QA process could be marked down in priority,
> and is definitely not a release blocker.
>
> In any event I just resolved SPARK-20507, as I don't believe any website
> updates are required for this release anyway. That fully resolves the ML QA
> umbrella (SPARK-20499).
>
>>
>>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
Now, on the subject of (ML) QA JIRAs.

>From the ML side, I believe they are required (I think others such as
Joseph will agree and in fact have already said as much).

Most are marked as Blockers, though of those the Python API coverage is
strictly not a Blocker as we will never hold the release for API parity
issues (unless of course there is some critical bug or missing thing, but
that really falls under the standard RC bug triage process).

I believe they are Blockers, since they involve auditing binary compat and
new public APIs, visibility issues, Java compat etc. I think it's obvious
that a RC should not pass if these have not been checked.

I actually agree that docs and user guide are absolutely part of the
release, and in fact are one of the more important pieces of the release.
Apart from the issues Sean mentions, not treating these things are critical
issues or even blockers is what inevitably over time leads to the user
guide being out of date, missing important features, etc.

In practice for ML at least we definitely aim to have all the doc / guide
issues done before the final release.

Now in terms of process, none of these QA issues really require an RC, they
can all be carried out once the release branch is cut. Some of the issues
like binary compat are perhaps a bit more tricky but inevitably involves
manually checking through MiMa exclusions added, to verify they are ok, etc
- so again an actual RC is not required here.

So really the answer is to more aggressively burn down these QA issues the
moment the release branch has been cut. Again, I think this echoes what
Joseph has said in previous threads.



On Tue, 6 Jun 2017 at 10:16 Sean Owen  wrote:

> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / n

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
The website updates for ML QA (SPARK-20507) are not *actually* critical as
the project website certainly can be updated separately from the source
code guide and is not part of the release to be voted on. In future that
particular work item for the QA process could be marked down in priority,
and is definitely not a release blocker.

In any event I just resolved SPARK-20507, as I don't believe any website
updates are required for this release anyway. That fully resolves the ML QA
umbrella (SPARK-20499).


On Tue, 6 Jun 2017 at 10:16 Sean Owen  wrote:

> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / no-go release decision, and I think this interferes. The rest of JIRA
> noise doesn't matter much. You can see we're already resorting to secondary
> communications as a result ("anyone have any issues that need to be fixed
> before I cut another RC?" emails) because this is kind of ignored, and
> think we're swapping out a decent mechanism for worse one.
>
> I suspect, as you do, that there's no to-do here in which case they should
> be resolved and we're still on track for release. I'd wait on +1 until then.
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Sean Owen
On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
wrote:

> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
> knowing that they were unlikely to pass.  That said, I still think these
> early RCs are valuable. I know several users that wanted to test new
> features in 2.2 that have used them.  Now, if we would prefer to call them
> preview or RC0 or something I'd be okay with that as well.
>

They are valuable, I only suggest it's better to note explicitly when there
are blockers or must-do tasks that will fail the RC. It makes a big
difference to whether one would like to +1.

I meant more than just calling them something different. An early RC could
be voted as a released 'preview' artifact, at the start of the notional QA
period, with a lower bar to passing, and releasable with known issues. This
encourages more testing. It also resolves the controversy about whether
it's OK to include an RC in a product (separate thread).


Regarding doc updates, I don't think it is a requirement that they be voted
> on as part of the release.  Even if they are something version specific.  I
> think we have regularly updated the website with documentation that was
> merged after the release.
>

They're part of the source release too, as markdown, and should be voted
on. I've never understood otherwise. Have we actually released docs and
then later changed them, so that they don't match the release? I don't
recall that, but I do recall updating the non-version-specific website.

Aside from the oddity of having docs generated from x.y source not match
docs published for x.y, you want the same protections for doc source that
the project distributes as anything else. It's not just correctness, but
liability. The hypothetical is always that someone included copyrighted
text or something without permission and now the project can't rely on the
argument that it made a good-faith effort to review what it released on the
site. Someone becomes personally liable.

These are pretty technical reasons though. More practically, what's the
hurry to release if docs aren't done (_if_ they're not done)? It's being
presented as normal practice, but seems quite exceptional.



> I personally don't think the QA umbrella JIRAs are particularly effective,
> but I also wouldn't ban their use if others think they are.  However, I do
> think that real QA needs an RC to test, so I think it is fine that there is
> still outstanding QA to be done when an RC is cut.  For example, I plan to
> run a bunch of streaming workloads on RC4 and will vote accordingly.
>

QA on RCs is great (see above). The problem is, I can't distinguish between
a JIRA that means "we must test in general", which sounds like something
you too would ignore, and one that means "there is specific functionality
we have to check before a release that we haven't looked at yet", which is
a committer waving a flag that they implicitly do not want a release until
resolved. I wouldn't +1 a release that had a Blocker software defect one of
us reported.

I know I'm harping on this, but this is the one mechanism we do use
consistently (Blocker JIRAs) to clearly communicate about issues vital to a
go / no-go release decision, and I think this interferes. The rest of JIRA
noise doesn't matter much. You can see we're already resorting to secondary
communications as a result ("anyone have any issues that need to be fixed
before I cut another RC?" emails) because this is kind of ignored, and
think we're swapping out a decent mechanism for worse one.

I suspect, as you do, that there's no to-do here in which case they should
be resolved and we're still on track for release. I'd wait on +1 until then.