Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Reynold Xin
We tried enabling blacklisting for some customers and in the cloud, very 
quickly they end up having 0 executors due to various transient errors. So 
unfortunately I think the current implementation is terrible for cloud 
deployments, and shouldn't be on by default. The heart of the issue is that the 
current implementation is not great at dealing with transient errors vs 
catastrophic errors.

+Chris who was involved with those tests.

On Thu, Mar 28, 2019 at 3:32 PM, Ankur Gupta < ankur.gu...@cloudera.com.invalid 
> wrote:

> 
> Hi all,
> 
> 
> This is a follow-on to my PR: https:/ / github. com/ apache/ spark/ pull/ 
> 24208
> ( https://github.com/apache/spark/pull/24208 ) , where I aimed to enable
> blacklisting for fetch failure by default. From the comments, there is
> interest in the community to enable overall blacklisting feature by
> default. I have listed down 3 different things that we can do and would
> like to gather feedback and see if anyone has objections with regards to
> this. Otherwise, I will just create a PR for the same.
> 
> 
> 1. *Enable blacklisting feature by default*. The blacklisting feature was
> added as part of SPARK-8425 and is available since 2.2.0. This feature was
> deemed experimental and was disabled by default. The feature blacklists an
> executor/node from running a particular task, any task in a particular
> stage or all tasks in application based on number of failures. There are
> various configurations available which control those thresholds.
> Additionally, the executor/node is only blacklisted for a configurable
> time period. The idea is to enable blacklisting feature with existing
> defaults, which are following:
> * spark.blacklist.task.maxTaskAttemptsPerExecutor = 1
> 
> * spark.blacklist.task.maxTaskAttemptsPerNode = 2
> 
> * spark.blacklist.stage.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.stage.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.application.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.application.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.timeout = 1 hour
> 
> 2. *Kill blacklisted executors/nodes by default*. This feature was added
> as part of SPARK-16554 and is available since 2.2.0. This is a follow-on
> feature to blacklisting, such that if an executor/node is blacklisted for
> the application, then it also terminates all running tasks on that
> executor for faster failure recovery.
> 
> 
> 3. *Remove legacy blacklisting timeout config* :
> spark.scheduler.executorTaskBlacklistTime
> 
> 
> Thanks,
> Ankur
>

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Ankur Gupta
Thanks Reynold! That is certainly useful to know.

@Chris Will it be possible for you to send out those details if you still
have them or better create a JIRA, so someone can work on those
improvements. If there is already a JIRA, can you please provide a link to
the same.

Additionally, if the concern is with the aggressiveness of the
blacklisting, then we can enable blacklisting feature by default with
higher thresholds for failures. Below is an alternate set of defaults that
were also proposed in the design document for max cluster utilization:

   1. spark.blacklist.task.maxTaskAttemptsPerExecutor = 2
   2. spark.blacklist.task.maxTaskAttemptsPerNode = 2
   3. spark.blacklist.stage.maxFailedTasksPerExecutor = 5
   4. spark.blacklist.stage.maxFailedExecutorsPerNode = 4
   5. spark.blacklist.application.maxFailedTasksPerExecutor = 5
   6. spark.blacklist.application.maxFailedExecutorsPerNode = 4
   7. spark.blacklist.timeout = 5 mins



On Fri, Mar 29, 2019 at 11:18 AM Reynold Xin  wrote:

> We tried enabling blacklisting for some customers and in the cloud, very
> quickly they end up having 0 executors due to various transient errors. So
> unfortunately I think the current implementation is terrible for cloud
> deployments, and shouldn't be on by default. The heart of the issue is that
> the current implementation is not great at dealing with transient errors vs
> catastrophic errors.
>
> +Chris who was involved with those tests.
>
>
>
> On Thu, Mar 28, 2019 at 3:32 PM, Ankur Gupta <
> ankur.gu...@cloudera.com.invalid> wrote:
>
>> Hi all,
>>
>> This is a follow-on to my PR: https://github.com/apache/spark/pull/24208,
>> where I aimed to enable blacklisting for fetch failure by default. From the
>> comments, there is interest in the community to enable overall blacklisting
>> feature by default. I have listed down 3 different things that we can do
>> and would like to gather feedback and see if anyone has objections with
>> regards to this. Otherwise, I will just create a PR for the same.
>>
>> 1. *Enable blacklisting feature by default*. The blacklisting feature
>> was added as part of SPARK-8425 and is available since 2.2.0. This feature
>> was deemed experimental and was disabled by default. The feature blacklists
>> an executor/node from running a particular task, any task in a particular
>> stage or all tasks in application based on number of failures. There are
>> various configurations available which control those thresholds.
>> Additionally, the executor/node is only blacklisted for a configurable time
>> period. The idea is to enable blacklisting feature with existing defaults,
>> which are following:
>>
>>1. spark.blacklist.task.maxTaskAttemptsPerExecutor = 1
>>2. spark.blacklist.task.maxTaskAttemptsPerNode = 2
>>3. spark.blacklist.stage.maxFailedTasksPerExecutor = 2
>>4. spark.blacklist.stage.maxFailedExecutorsPerNode = 2
>>5. spark.blacklist.application.maxFailedTasksPerExecutor = 2
>>6. spark.blacklist.application.maxFailedExecutorsPerNode = 2
>>7. spark.blacklist.timeout = 1 hour
>>
>> 2. *Kill blacklisted executors/nodes by default*. This feature was added
>> as part of SPARK-16554 and is available since 2.2.0. This is a follow-on
>> feature to blacklisting, such that if an executor/node is blacklisted for
>> the application, then it also terminates all running tasks on that executor
>> for faster failure recovery.
>>
>> 3. *Remove legacy blacklisting timeout config*
>> : spark.scheduler.executorTaskBlacklistTime
>>
>> Thanks,
>> Ankur
>>
>
>


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-29 Thread Bryan Cutler
PyArrow dropping Python 3.4 was mainly due to support going away at
Conda-Forge and other dependencies also dropping it.  I think we better
upgrade Jenkins Python while we are at it.  Are you all against jumping to
Python 3.6 so we are not in the same boat in September?

On Thu, Mar 28, 2019 at 7:58 PM Felix Cheung 
wrote:

> 3.4 is end of life but 3.5 is not. From your link
>
> we expect to release Python 3.5.8 around September 2019.
>
>
>
> --
> *From:* shane knapp 
> *Sent:* Thursday, March 28, 2019 7:54 PM
> *To:* Hyukjin Kwon
> *Cc:* Bryan Cutler; dev; Felix Cheung
> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>
> looks like the same for 3.5...   https://www.python.org/dev/peps/pep-0478/
>
> let's pick a python version and start testing.
>
> On Thu, Mar 28, 2019 at 7:52 PM shane knapp  wrote:
>
>>
>>> If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4
>>> to 3.5.
>>>
>>> this is inevitable.  3.4s final release was 10 days ago (
>> https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-29 Thread shane knapp
i'm not opposed to 3.6 at all.

On Fri, Mar 29, 2019 at 4:16 PM Bryan Cutler  wrote:

> PyArrow dropping Python 3.4 was mainly due to support going away at
> Conda-Forge and other dependencies also dropping it.  I think we better
> upgrade Jenkins Python while we are at it.  Are you all against jumping to
> Python 3.6 so we are not in the same boat in September?
>
> On Thu, Mar 28, 2019 at 7:58 PM Felix Cheung 
> wrote:
>
>> 3.4 is end of life but 3.5 is not. From your link
>>
>> we expect to release Python 3.5.8 around September 2019.
>>
>>
>>
>> --
>> *From:* shane knapp 
>> *Sent:* Thursday, March 28, 2019 7:54 PM
>> *To:* Hyukjin Kwon
>> *Cc:* Bryan Cutler; dev; Felix Cheung
>> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>>
>> looks like the same for 3.5...
>> https://www.python.org/dev/peps/pep-0478/
>>
>> let's pick a python version and start testing.
>>
>> On Thu, Mar 28, 2019 at 7:52 PM shane knapp  wrote:
>>
>>>
 If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4
 to 3.5.

 this is inevitable.  3.4s final release was 10 days ago (
>>> https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Dropping SortExec from SortMergeJoins on presorted data

2019-03-29 Thread tim
Hi all,

We ingest our data into dataframes with multiple naturally co-sorted
columns.
The redundant sort required during large SortMergeJoin operations takes
substantial time that we'd like to optimise -- a plain merge should be
sufficient.

Is there a mechanism to avoid these sorts in general?
Do we need to persist all our frames as tables with sortBy+bucketBy to get
this optimisation?
If we use sorted tables, does the "sorted by" metadata persist past the
first join or do we need to re-write each intermediate result to a (possibly
re-sorted?) table to maintain the metadata in the catalog?

Our data actually has multiple distinct monotonically-increasing columns.
The catalog doesn't seem to be able to capture this information, requiring a
re-sort when we want to join along a different-but-still-sorted dimension.
We've prototyped hacking the external catalog to let us intercept cases
where we want to assert that outputOrdering is equivalent to the
requiredOrdering (see SortOrder.orderingSatisfies), but this feels like an
abuse.
Table information has been lost at these points too, so we need to infer
sortedness by comparing raw column names extracted from SortOrder
expressions.
This breaks in cases where our processing has caused the data to *lose* its
sortedness.

Have we missed something simple or do we have an exotic use-case unlike
other users?

Thanks!
Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-29 Thread Felix Cheung
(I think the .invalid is added by the list server)

Personally I’d rather everyone just +1 or -1, and shouldn’t add binding or not. 
It’s really the responsibility of the RM to confirm if a vote is binding. 
Mistakes have been made otherwise.



From: Marcelo Vanzin 
Sent: Thursday, March 28, 2019 3:56 PM
To: dev
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

(Anybody knows what's the deal with all the .invalid e-mail addresses?)

Anyway. ASF has voting rules, and some things like releases follow
specific rules:
https://www.apache.org/foundation/voting.html#ReleaseVotes

So, for releases, ultimately, the only votes that "count" towards the
final tally are PMC votes. But everyone is welcome to vote, especially
if they have a reason to -1 a release. PMC members can use that to
guide how they vote, or the RM can use that to drop the RC
unilaterally if he agrees with the reason.


On Thu, Mar 28, 2019 at 3:47 PM Jonatan Jäderberg
 wrote:
>
> +1 (user vote)
>
> btw what to call a vote that is not pmc or committer?
> Some people use "non-binding”, but nobody says “my vote is binding”, and if 
> some vote is important to me, I still need to look up the who’s-who of the 
> project to be able to tally the votes.
> I like `user vote` for someone who has their say but is not speaking with any 
> authority (i.e., not pmc/committer). wdyt?
>
> Also, let’s get this release out the door!
>
> cheers,
> Jonatan
>
> On 28 Mar 2019, at 21:31, DB Tsai  wrote:
>
> +1 from myself
>
> On Thu, Mar 28, 2019 at 3:14 AM Mihaly Toth  
> wrote:
>>
>> +1 (non-binding)
>>
>> Thanks, Misi
>>
>> Sean Owen  ezt írta (időpont: 2019. márc. 28., Cs, 0:19):
>>>
>>> +1 from me - same as last time.
>>>
>>> On Wed, Mar 27, 2019 at 1:31 PM DB Tsai  wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark version 
>>> > 2.4.1.
>>> >
>>> > The vote is open until March 30 PST and passes if a majority +1 PMC votes 
>>> > are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.4.1
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.4.1-rc9 (commit 
>>> > 58301018003931454e93d8a309c7149cf84c279e):
>>> > https://github.com/apache/spark/tree/v2.4.1-rc9
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> > https://repository.apache.org/content/repositories/orgapachespark-1319/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/
>>> >
>>> > The list of bug fixes going into 2.4.1 can be found at the following URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.4.1?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.4.1 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
>>> > Version/s" = 2.4.1
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>> >
>>> >
>>> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, 
>>> > Inc
>>> >
>>> >
>>> > --

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-29 Thread Felix Cheung
+1

build source
R tests
R package CRAN check locally, r-hub



From: d_t...@apple.com on behalf of DB Tsai 
Sent: Wednesday, March 27, 2019 11:31 AM
To: dev
Subject: [VOTE] Release Apache Spark 2.4.1 (RC9)

Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until March 30 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc9 (commit 
58301018003931454e93d8a309c7149cf84c279e):
https://github.com/apache/spark/tree/v2.4.1-rc9

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1319/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [k8s][jenkins] spark dev tool docs now have k8s+minikube instructions!

2019-03-29 Thread Felix Cheung
Definitely the part on the PR. Thanks!



From: shane knapp 
Sent: Thursday, March 28, 2019 11:19 AM
To: dev; Stavros Kontopoulos
Subject: [k8s][jenkins] spark dev tool docs now have k8s+minikube instructions!

https://spark.apache.org/developer-tools.html

search for "Testing K8S".

this is pretty much how i build and test PRs locally...  the commands there are 
lifted straight from the k8s integration test jenkins build, so they might 
require a little tweaking to better suit your laptop/server.

k8s is great (except when it's not), and it's really quite easy to get set up 
(except when it's not).  stackoverflow is your friend, and the minikube slack 
was really useful.

some of this is a little hacky (running the mount process in the background, 
for example), but there's a lot of development on minikube right now...  the 
k8s project understands the importance of minikube and has dedicated 
engineering resources involved.

and finally, if you have a suggesting for the docs, open a PR!  they are always 
welcome!

shane

ps- and a special thanks to @Stavros 
Kontopoulos and the PR from hell for 
throwing me in the deep end of k8s.  :)
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-29 Thread Felix Cheung
I don’t take it as Sept 2019 is end of life for python 3.5 tho. It’s just 
saying the next release.

In any case I think in the next release it will be great to get more Python 3.x 
release test coverage.




From: shane knapp 
Sent: Friday, March 29, 2019 4:46 PM
To: Bryan Cutler
Cc: Felix Cheung; Hyukjin Kwon; dev
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

i'm not opposed to 3.6 at all.

On Fri, Mar 29, 2019 at 4:16 PM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
PyArrow dropping Python 3.4 was mainly due to support going away at Conda-Forge 
and other dependencies also dropping it.  I think we better upgrade Jenkins 
Python while we are at it.  Are you all against jumping to Python 3.6 so we are 
not in the same boat in September?

On Thu, Mar 28, 2019 at 7:58 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
3.4 is end of life but 3.5 is not. From your link

we expect to release Python 3.5.8 around September 2019.




From: shane knapp mailto:skn...@berkeley.edu>>
Sent: Thursday, March 28, 2019 7:54 PM
To: Hyukjin Kwon
Cc: Bryan Cutler; dev; Felix Cheung
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

looks like the same for 3.5...   https://www.python.org/dev/peps/pep-0478/

let's pick a python version and start testing.

On Thu, Mar 28, 2019 at 7:52 PM shane knapp 
mailto:skn...@berkeley.edu>> wrote:

If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4 to 3.5.

this is inevitable.  3.4s final release was 10 days ago 
(https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu