Re: [VOTE] Release Apache Spark 2.4.3

2019-05-01 Thread Sean Owen
+1 from me. There is little change from 2.4.2 anyway, except for the
important change to the build script that should build pyspark with
Scala 2.11 jars. I verified that the package contains the _2.11 Spark
jars, but have a look!

I'm still getting this weird error from the Kafka module when testing,
but it's a long-standing weird known issue:

[error] 
/home/ubuntu/spark-2.4.3/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
Symbol 'term org.eclipse' is missing from the classpath.
[error] This symbol is required by 'method
org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
[error] Make sure that term eclipse is in your classpath and check for
conflicting dependencies with `-Ylog-classpath`.
[error] A full rebuild may help if 'MetricsSystem.class' was compiled
against an incompatible version of org.
[error] testUtils.sendMessages(topic, data.toArray)

Killing zinc and rebuilding didn't help.
But this isn't happening in Jenkins for example, so it should be env-specific.

On Wed, May 1, 2019 at 9:39 AM Xiao Li  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.3.
>
> The vote is open until May 5th PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.3-rc1 (commit 
> c3e32bf06c35ba2580d46150923abfa795b4446a):
> https://github.com/apache/spark/tree/v2.4.3-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1324/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-docs/
>
> The list of bug fixes going into 2.4.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12345410
>
> The release is using the release script of the branch 2.4.3-rc1 with the 
> following commit 
> https://github.com/apache/spark/commit/e417168ed012190db66a21e626b2b8d2332d6c01
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.3?
> ===
>
> The current list of open tickets targeted at 2.4.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-05-01 Thread Long, Andrew
It turned out that I was unintentionally copying multiple copies of the Hadoop 
config to every partition in an rdd. >.<  I was able to debug this by setting a 
break point on the warning message and inspecting the partition object itself.

Cheers Andrew

From: Russell Spitzer 
Date: Thursday, April 25, 2019 at 8:47 AM
To: "Long, Andrew" 
Cc: dev 
Subject: Re: FW: Stage 152 contains a task of very large size (12747 KB). The 
maximum recommended task size is 100 KB

I usually only see that in regards to folks parallelizing very large objects. 
From what I know, it's really just the data inside the "Partition" class of the 
RDD that is being sent back and forth. So usually something like 
spark.parallelize(Seq(reallyBigMap)) or something like that. The parallelize 
function jams all that data into the RDD's Partition metadata so that can 
easily overwhelm the task size.

On Tue, Apr 23, 2019 at 3:57 PM Long, Andrew  
wrote:
Hey Friends,

Is there an easy way of figuring out whats being pull into the task context?  
I’ve been getting the following message which I suspect means I’ve 
unintentional caught some large objects but figuring out what those objects are 
is stumping me.

19/04/23 13:52:13 WARN org.apache.spark.internal.Logging$class TaskSetManager: 
Stage 152 contains a task of very large size (12747 KB). The maximum 
recommended task size is 100 KB

Cheers Andrew


Re: [VOTE] Release Apache Spark 2.4.3

2019-05-01 Thread Gengliang Wang
+1 (non-binding)

> 在 2019年5月1日,上午10:16,Michael Heuer  写道:
> 
> +1 (non-binding)



Re: [VOTE] Release Apache Spark 2.4.3

2019-05-01 Thread Michael Heuer
+1 (non-binding)

The binary release files are correctly built with Scala 2.11.12.

Thank you,

   michael


> On May 1, 2019, at 9:39 AM, Xiao Li  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.3.
> 
> The vote is open until May 5th PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 2.4.3
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/ 
> 
> 
> The tag to be voted on is v2.4.3-rc1 (commit 
> c3e32bf06c35ba2580d46150923abfa795b4446a):
> https://github.com/apache/spark/tree/v2.4.3-rc1 
> 
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-bin/ 
> 
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1324/ 
> 
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-docs/ 
> 
> 
> The list of bug fixes going into 2.4.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12345410 
> 
> 
> The release is using the release script of the branch 2.4.3-rc1 with the 
> following commit 
> https://github.com/apache/spark/commit/e417168ed012190db66a21e626b2b8d2332d6c01
>  
> 
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 2.4.3?
> ===
> 
> The current list of open tickets targeted at 2.4.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK 
>  and search for "Target 
> Version/s" = 2.4.3
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.



Re: [VOTE] Release Apache Spark 2.4.2

2019-05-01 Thread Felix Cheung
Just my 2c

If there is a known security issue, we should fix it rather waiting for if it 
actually could be might be affecting Spark to be found by a black hat, or worse.

I don’t think any of us want to see Spark in the news for this reason.

From: Sean Owen 
Sent: Tuesday, April 30, 2019 1:52:53 PM
To: Reynold Xin
Cc: Jungtaek Lim; Dongjoon Hyun; Wenchen Fan; Michael Heuer; Terry Kim; dev; 
Xiao Li
Subject: Re: [VOTE] Release Apache Spark 2.4.2

FWIW I'm OK with this even though I proposed the backport PR for discussion. It 
really is a tough call, balancing the potential but as-yet unclear security 
benefit vs minor but real Jackson deserialization behavior change.

Because we have a pressing need for a 2.4.3 release (really a 2.4.2.1 almost) I 
think it's reasonable to defer a final call on this in 2.4.x and revert for 
now. Leaving it in 2.4.3 makes it quite permanent.

A little more color on the discussion:
- I don't think https://github.com/apache/spark/pull/22071 mitigates the 
theoretical problem here; I would guess the attack vector is deserializing a 
malicious JSON file. This is unproven either way
- The behavior change we know is basically what you see in the revert PR: 
entries like "'foo': null" aren't written by Jackson by default in 2.7+. You 
can make them so but it needs a code tweak in any app that inherits Spark's 
Jackson
- This is not related to Scala version

This is for a discussion about re-including in 2.4.4:
- Does anyone know that the Jackson issues really _could_ affect Spark
- Does anyone have concrete examples of why the behavior change is a bigger 
deal, or not as big a deal, as anticipated?

On Tue, Apr 30, 2019 at 1:34 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Echoing both of you ... it's a bit risky to bump dependency versions in a patch 
release, especially for a super common library. (I wish we shaded Jackson).

Maybe the CVE is a sufficient reason to bump the dependency, ignoring the 
potential behavior changes that might happen, but I'd like to see a bit more 
discussions there and have 2.4.3 focusing on fixing the Scala version issue 
first.



On Mon, Apr 29, 2019 at 11:17 PM, Jungtaek Lim 
mailto:kabh...@gmail.com>> wrote:
Ah! Sorry Xiao I should check the fix version of issue (it's 2.4.3/3.0.0).

Then looks much better to revert and avoid dependency conflict in bugfix 
release. Jackson is one of known things making non-backward changes to 
non-major version, so I agree it's the thing to be careful, or shade/relocate 
and forget about it.

On Tue, Apr 30, 2019 at 3:04 PM Xiao Li 
mailto:lix...@databricks.com>> wrote:
Jungtaek,

Thanks for your inputs! Sorry for the confusion. Let me make it clear.

  *   All the previous 2.4.x [including 2.4.2] releases are using Jackson 
2.6.7.1.
  *   In the master branch, the Jackson is already upgraded to 2.9.8.
  *   Here, I just try to revert Jackson upgrade in the upcoming 2.4.3 release.

Cheers,

Xiao

On Mon, Apr 29, 2019 at 10:53 PM Jungtaek Lim 
mailto:kabh...@gmail.com>> wrote:
Just to be clear, does upgrading jackson to 2.9.8 be coupled with Scala 
version? And could you summarize one of actual broken case due to upgrade if 
you observe anything? Providing actual case would help us to weigh the impact.

Btw, my 2 cents, personally I would rather avoid upgrading dependencies in 
bugfix release unless it resolves major bugs, so reverting it from only 
branch-2.4 sounds good to me. (I still think jackson upgrade is necessary in 
master branch, avoiding lots of CVEs we will waste huge amount of time to 
identify the impact. And other libs will start making couple with jackson 2.9.x 
which conflict Spark's jackson dependency.)

If there will be a consensus regarding reverting that, we may also need to 
announce Spark 2.4.2 is discouraged to be used, otherwise end users will suffer 
from jackson version back and forth.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Apr 30, 2019 at 2:30 PM Xiao Li 
mailto:lix...@databricks.com>> wrote:
Before cutting 2.4.3, I just submitted a PR 
https://github.com/apache/spark/pull/24493 for reverting the commit 
https://github.com/apache/spark/commit/6f394a20bf49f67b4d6329a1c25171c8024a2fae.

In general, we need to be very cautious about the Jackson upgrade in the patch 
releases, especially when this upgrade could break the existing behaviors of 
the external packages or data sources, and generate different results after the 
upgrade. The external packages and data sources need to change their source 
code to keep the original behaviors. The upgrade requires more discussions 
before releasing it, I think.

In the previous PR https://github.com/apache/spark/pull/22071, we turned off 
`spark.master.rest.enabled` by default and 
added the following claim in our security doc:
The Rest Submission Server and the MesosClusterDispatcher do not support 
authentication.  You should ensure that all network access to the 

[VOTE] Release Apache Spark 2.4.3

2019-05-01 Thread Xiao Li
Please vote on releasing the following candidate as Apache Spark version
2.4.3.

The vote is open until May 5th PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.3-rc1 (commit
c3e32bf06c35ba2580d46150923abfa795b4446a):
https://github.com/apache/spark/tree/v2.4.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1324/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-docs/

The list of bug fixes going into 2.4.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12345410

The release is using the release script of the branch 2.4.3-rc1 with the
following commit
https://github.com/apache/spark/commit/e417168ed012190db66a21e626b2b8d2332d6c01

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.3?
===

The current list of open tickets targeted at 2.4.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.