Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
It's a good question. Py4J seems to have been updated 5 times in 2016 and
is a bit involved (from a review point of view verifying the zip file
contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to
cloudpickle which aren't correctly tagged as backporting changes from the
fork (and this can take awhile to review since we don't always catch them
right away as being backports).

Another difficulty with looking at backports is that since our review
process for PySpark has historically been on the slow side, changes
benefiting systems like dask or IPython parallel were not backported to
Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of
cloudpickle, using a more standardized packaging of dependencies, simpler
updates of dependencies reduces friction to gaining benefits from other
related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin  wrote:

> With any dependency update (or refactoring of existing code), I always ask
> this question: what's the benefit? In this case it looks like the benefit
> is to reduce efforts in backports. Do you know how often we needed to do
> those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau 
> wrote:
>
>> Hi PySpark Developers,
>>
>> Cloudpickle is a core part of PySpark, and is originally copied from (and
>> improved from) picloud. Since then other projects have found cloudpickle
>> useful and a fork of cloudpickle
>> <https://github.com/cloudpipe/cloudpickle> was created and is now
>> maintained as its own library <https://pypi.python.org/pypi/cloudpickle> 
>> (with
>> better test coverage and resulting bug fixes I understand). We've had a few
>> PRs backporting fixes from the cloudpickle project into Spark's local copy
>> of cloudpickle - how would people feel about moving to taking an explicit
>> (pinned) dependency on cloudpickle?
>>
>> We could add cloudpickle to the setup.py and a requirements.txt file for
>> users who prefer not to do a system installation of PySpark.
>>
>> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
>> repo but could instead have a pinned version required. While we do depend
>> on a lot of py4j internal APIs, version pinning should be sufficient to
>> ensure functionality (and simplify the update process).
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Holden Karau
It's at the bottom of every message (although some mail clients hide it for
some reason), send an email to dev-unsubscr...@spark.apache.org

On Sat, Feb 18, 2017 at 11:07 AM Pritish Nawlakhe <
prit...@nirvana-international.com> wrote:

> Hi
>
> Would anyone know how to unsubscribe to this list?
>
>
>
> Thank you!!
>
> Regards
> Pritish
> Nirvana International Inc.
>
> Big Data, Hadoop, Oracle EBS and IT Solutions
> VA - SWaM, MD - MBE Certified Company
> prit...@nirvana-international.com
> http://www.nirvana-international.com
> Twitter: @nirvanainternat
>
> -Original Message-
> From: Tim Hunter [mailto:timhun...@databricks.com]
> Sent: Friday, February 17, 2017 1:49 PM
> To: bradc
> Cc: dev@spark.apache.org
> Subject: Re: Design document - MLlib's statistical package for DataFrames
>
> Hi Brad,
>
> this task is focusing on moving the existing algorithms, so that we are
> held up by parity issues.
>
> Do you have some paper suggestions for cardinality? I do not think there
> is a feature request on JIRA either.
>
> Tim
>
> On Thu, Feb 16, 2017 at 2:21 PM, bradc  wrote:
> > Hi,
> >
> > While it is also missing in spark.mllib, I'd suggest adding
> > cardinality as part of the Simple descriptive statistics for both
> spark.ml and spark.mlib?
> > This is useful even for data in double precision FP to understand the
> > "uniqueness" of the feature data.
> >
> > Cheers,
> > Brad
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/Design-docum
> > ent-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Holden Karau
Hi Spark Devs,

Spark 2.1 has been out since end of December

and we've got quite a few fixes merged for 2.1.1

.

On the Python side one of the things I'd like to see us get out into a
patch release is a packaging fix (now merged) before we upload to PyPI &
Conda, and we also have the normal batch of fixes like toLocalIterator for
large DataFrames in PySpark.

I've chatted with Felix & Shivaram who seem to think the R side is looking
close to in good shape for a 2.1.1 release to submit to CRAN (if I've
miss-spoken my apologies). The two outstanding issues that are being
tracked for R are SPARK-18817, SPARK-19237.

Looking at the other components quickly it seems like structured streaming
could also benefit from a patch release.

What do others think - are there any issues people are actively targeting
for 2.1.1? Is this too early to be considering a patch release?

Cheers,

Holden
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Holden Karau
I'd be happy to do the work of coordinating a 2.1.1 release if that's a
thing a committer can do (I think the release coordinator for the most
recent Arrow release was a committer and the final publish step took a PMC
member to upload but other than that I don't remember any issues).

On Mon, Mar 13, 2017 at 1:05 PM Sean Owen  wrote:

> It seems reasonable to me, in that other x.y.1 releases have followed ~2
> months after the x.y.0 release and it's been about 3 months since 2.1.0.
>
> Related: creating releases is tough work, so I feel kind of bad voting for
> someone else to do that much work. Would it make sense to deputize another
> release manager to help get out just the maintenance releases? this may in
> turn mean maintenance branches last longer. Experienced hands can continue
> to manage new minor and major releases as they require more coordination.
>
> I know most of the release process is written down; I know it's also still
> going to be work to make it 100% documented. Eventually it'll be necessary
> to make sure it's entirely codified anyway.
>
> Not pushing for it myself, just noting I had heard this brought up in side
> conversations before.
>
>
> On Mon, Mar 13, 2017 at 7:07 PM Holden Karau  wrote:
>
> Hi Spark Devs,
>
> Spark 2.1 has been out since end of December
> <http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Apache-Spark-2-1-0-td20390.html>
> and we've got quite a few fixes merged for 2.1.1
> <https://issues.apache.org/jira/browse/SPARK-18281?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC>
> .
>
> On the Python side one of the things I'd like to see us get out into a
> patch release is a packaging fix (now merged) before we upload to PyPI &
> Conda, and we also have the normal batch of fixes like toLocalIterator for
> large DataFrames in PySpark.
>
> I've chatted with Felix & Shivaram who seem to think the R side is looking
> close to in good shape for a 2.1.1 release to submit to CRAN (if I've
> miss-spoken my apologies). The two outstanding issues that are being
> tracked for R are SPARK-18817, SPARK-19237.
>
> Looking at the other components quickly it seems like structured streaming
> could also benefit from a patch release.
>
> What do others think - are there any issues people are actively targeting
> for 2.1.1? Is this too early to be considering a patch release?
>
> Cheers,
>
> Holden
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Should we consider a Spark 2.1.1 release?

2017-03-19 Thread Holden Karau
This discussions seems like it might benefit from its own thread as we've
previously decided to lengthen release cycles but if their are different
opinions about this it seems unrelated to the specific 2.1.1 release.

On Sun, Mar 19, 2017 at 2:57 PM Jacek Laskowski  wrote:

> Hi Mark,
>
> I appreciate your comment.
>
> My thinking is that the more frequent minor and patch releases the
> more often end users can give them a shot and be part of the bigger
> release cycle for major releases. Spark's an OSS project and we all
> can make mistakes and my thinking is is that the more eyeballs the
> less the number of the mistakes. If we make very fine/minor releases
> often we should be able to attract more people who spend their time on
> testing/verification that eventually contribute to a higher quality of
> Spark.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Mar 19, 2017 at 10:50 PM, Mark Hamstra 
> wrote:
> > That doesn't necessarily follow, Jacek. There is a point where too
> frequent
> > releases decrease quality. That is because releases don't come for free
> --
> > each one demands a considerable amount of time from release managers,
> > testers, etc. -- time that would otherwise typically be devoted to
> improving
> > (or at least adding to) the code. And that doesn't even begin to consider
> > the time that needs to be spent putting a new version into a larger
> software
> > distribution or that users need to put in to deploy and use a new
> version.
> > If you have an extremely lightweight deployment cycle, then small, quick
> > releases can make sense; but "lightweight" doesn't really describe a
> Spark
> > release. The concern for excessive overhead is a large part of the
> thinking
> > behind why we stretched out the roadmap to allow longer intervals between
> > scheduled releases. A similar concern does come into play for unscheduled
> > maintenance releases -- but I don't think that that is the forcing
> function
> > at this point: A 2.1.1 release is a good idea.
> >
> > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski 
> wrote:
> >>
> >> +1
> >>
> >> More smaller and more frequent releases (so major releases get even more
> >> quality).
> >>
> >> Jacek
> >>
> >> On 13 Mar 2017 8:07 p.m., "Holden Karau"  wrote:
> >>>
> >>> Hi Spark Devs,
> >>>
> >>> Spark 2.1 has been out since end of December and we've got quite a few
> >>> fixes merged for 2.1.1.
> >>>
> >>> On the Python side one of the things I'd like to see us get out into a
> >>> patch release is a packaging fix (now merged) before we upload to PyPI
> &
> >>> Conda, and we also have the normal batch of fixes like toLocalIterator
> for
> >>> large DataFrames in PySpark.
> >>>
> >>> I've chatted with Felix & Shivaram who seem to think the R side is
> >>> looking close to in good shape for a 2.1.1 release to submit to CRAN
> (if
> >>> I've miss-spoken my apologies). The two outstanding issues that are
> being
> >>> tracked for R are SPARK-18817, SPARK-19237.
> >>>
> >>> Looking at the other components quickly it seems like structured
> >>> streaming could also benefit from a patch release.
> >>>
> >>> What do others think - are there any issues people are actively
> targeting
> >>> for 2.1.1? Is this too early to be considering a patch release?
> >>>
> >>> Cheers,
> >>>
> >>> Holden
> >>> --
> >>> Cell : 425-233-8271
> >>> Twitter: https://twitter.com/holdenkarau
> >
> >
>
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Holden Karau
I think questions around how long the 1.6 series will be supported are
really important, but probably belong in a different thread than the 2.1.1
release discussion.

On Mon, Mar 20, 2017 at 11:34 AM Timur Shenkao  wrote:

> Hello guys,
>
> Spark benefits from stable versions not frequent ones.
> A lot of people still have 1.6.x in production. Those who wants the
> freshest (like me) can always deploy night builts.
> My question is: how long version 1.6 will be supported?
>
>
> On Sunday, March 19, 2017, Holden Karau  wrote:
>
> This discussions seems like it might benefit from its own thread as we've
> previously decided to lengthen release cycles but if their are different
> opinions about this it seems unrelated to the specific 2.1.1 release.
>
> On Sun, Mar 19, 2017 at 2:57 PM Jacek Laskowski  wrote:
>
> Hi Mark,
>
> I appreciate your comment.
>
> My thinking is that the more frequent minor and patch releases the
> more often end users can give them a shot and be part of the bigger
> release cycle for major releases. Spark's an OSS project and we all
> can make mistakes and my thinking is is that the more eyeballs the
> less the number of the mistakes. If we make very fine/minor releases
> often we should be able to attract more people who spend their time on
> testing/verification that eventually contribute to a higher quality of
> Spark.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Mar 19, 2017 at 10:50 PM, Mark Hamstra 
> wrote:
> > That doesn't necessarily follow, Jacek. There is a point where too
> frequent
> > releases decrease quality. That is because releases don't come for free
> --
> > each one demands a considerable amount of time from release managers,
> > testers, etc. -- time that would otherwise typically be devoted to
> improving
> > (or at least adding to) the code. And that doesn't even begin to consider
> > the time that needs to be spent putting a new version into a larger
> software
> > distribution or that users need to put in to deploy and use a new
> version.
> > If you have an extremely lightweight deployment cycle, then small, quick
> > releases can make sense; but "lightweight" doesn't really describe a
> Spark
> > release. The concern for excessive overhead is a large part of the
> thinking
> > behind why we stretched out the roadmap to allow longer intervals between
> > scheduled releases. A similar concern does come into play for unscheduled
> > maintenance releases -- but I don't think that that is the forcing
> function
> > at this point: A 2.1.1 release is a good idea.
> >
> > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski 
> wrote:
> >>
> >> +1
> >>
> >> More smaller and more frequent releases (so major releases get even more
> >> quality).
> >>
> >> Jacek
> >>
> >> On 13 Mar 2017 8:07 p.m., "Holden Karau"  wrote:
> >>>
> >>> Hi Spark Devs,
> >>>
> >>> Spark 2.1 has been out since end of December and we've got quite a few
> >>> fixes merged for 2.1.1.
> >>>
> >>> On the Python side one of the things I'd like to see us get out into a
> >>> patch release is a packaging fix (now merged) before we upload to PyPI
> &
> >>> Conda, and we also have the normal batch of fixes like toLocalIterator
> for
> >>> large DataFrames in PySpark.
> >>>
> >>> I've chatted with Felix & Shivaram who seem to think the R side is
> >>> looking close to in good shape for a 2.1.1 release to submit to CRAN
> (if
> >>> I've miss-spoken my apologies). The two outstanding issues that are
> being
> >>> tracked for R are SPARK-18817, SPARK-19237.
> >>>
> >>> Looking at the other components quickly it seems like structured
> >>> streaming could also benefit from a patch release.
> >>>
> >>> What do others think - are there any issues people are actively
> targeting
> >>> for 2.1.1? Is this too early to be considering a patch release?
> >>>
> >>> Cheers,
> >>>
> >>> Holden
> >>> --
> >>> Cell : 425-233-8271
> >>> Twitter: https://twitter.com/holdenkarau
> >
> >
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Outstanding Spark 2.1.1 issues

2017-03-20 Thread Holden Karau
Hi Spark Developers!

As we start working on the Spark 2.1.1 release I've been looking at our
outstanding issues still targeted for it. I've tried to break it down by
component so that people in charge of each component can take a quick look
and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
the overall list is pretty short (only 9 items - 5 if we only look at
explicitly tagged) :)

If your working on something for Spark 2.1.1 and it doesn't show up in this
list please speak up now :) We have a lot of issues (including "in
progress") that are listed as impacting 2.1.0, but they aren't targeted for
2.1.1 - if there is something you are working in their which should be
targeted for 2.1.1 please let us know so it doesn't slip through the cracks.

The query string I used for looking at the 2.1.1 open issues is:

((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
ORDER BY priority DESC

None of the open issues appear to be a regression from 2.1.0, but those
seem more likely to show up during the RC process (thanks in advance to
everyone testing their workloads :)) & generally none of them seem to be

(Note: the cfs are for Target Version/s field)

Critical Issues:
 SQL:
  SPARK-19690  - Join a
streaming DataFrame with a batch DataFrame may not work - PR
https://github.com/apache/spark/pull/17052 (review in progress by zsxwing,
currently failing Jenkins)*

Major Issues:
 SQL:
  SPARK-19035  - rand()
function in case when cause failed - no outstanding PR (consensus on JIRA
seems to be leaning towards it being a real issue but not necessarily
everyone agrees just yet - maybe we should slip this?)*
 Deploy:
  SPARK-19522 
 - --executor-memory flag doesn't work in local-cluster mode -
https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
but PR currently stalled waiting on response) *
 Core:
  SPARK-20025  - Driver
fail over will not work, if SPARK_LOCAL* env is set. -
https://github.com/apache/spark/pull/17357 (waiting on review) *
 PySpark:
 SPARK-19955  - Update
run-tests to support conda [ Part of Dropping 2.6 support -- which we
shouldn't do in a minor release -- but also fixes pip installability tests
to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)

Minor issues:
 Tests:
  SPARK-19612  - Tests
failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
consider explicitly targeting this for 2.2?
 PySpark:
  SPARK-19570  - Allow
to disable hive in pyspark shell - https://github.com/apache/
spark/pull/16906 PR exists but its difficult to add automated tests for
this (although if SPARK-19955
 gets in would make
testing this easier) - no reviewers yet. Possible re-target?*
 Structured Streaming:
  SPARK-19613  - Flaky
test: StateStoreRDDSuite.versioning and immutability - It's not targetted
for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
this for 2.2?
 ML:
  SPARK-19759 
 - ALSModel.predict on Dataframes : potential optimization by not using
blas - No PR consider re-targeting unless someone has a PR waiting in the
wings?

Explicitly targeted issues are marked with a *, the remaining issues are
listed as impacting 2.1.1 and don't have a specific target version set.

Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
blocker in SQL( SPARK-19983
 ),

Query string is:

affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND
resolution = Unresolved AND priority = targetPriority

Continuing on for unresolved 2.1.0 issues in Major there are 163 (76 of
them in progress), 65 Minor (26 in progress), and 9 trivial (6 in progress).

I'll be going through the 2.1.0 major issues with open PRs that impact the
PySpark component and seeing if any of them should be targeted for 2.1.1,
if anyone from the other components wants to take a look through we might
find some easy wins to be merged.

Cheers,

Holden :)

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Holden Karau
I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
Maybe we can get TDs input on it?

On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:

> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
> blocker
>
> Best,
>
> Nan
>
> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
> wrote:
>
> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> ------
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690 <https://issues.apache.org/jira/browse/SPARK-19690> - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035 <https://issues.apache.org/jira/browse/SPARK-19035> - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522 <https://issues.apache.org/jira/browse/SPARK-19522> - 
> --executor-memory
> flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025 <https://issues.apache.org/jira/browse/SPARK-20025> - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955 <https://issues.apache.org/jira/browse/SPARK-19955> - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612 <https://issues.apache.org/jira/browse/SPARK-19612> - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570 <https://issues.apache.org/jira/browse/SPARK-19570> - Allow
> to disable hive in pyspark shell -
> https://github.com/apache/spark/pull/16906 PR exists but its difficult to
> add automated tests for this (although if SPARK-19955
> <https://issues.apache.org/jira/browse/SPARK-19955> gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613 <https://issues.apache.org/jira/browse/SPARK-19613> - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759 <https://issues.apache.org/jira/browse/SPARK-19759>

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Holden Karau
I agree with Michael, I think we've got some outstanding issues but none of
them seem like regression from 2.1 so we should be good to start the RC
process.

On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust 
wrote:

> Please speak up if I'm wrong, but none of these seem like critical
> regressions from 2.1.  As such I'll start the RC process later today.
>
> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
> wrote:
>
>> I'm not super sure it should be a blocker for 2.1.1 -- is it a
>> regression? Maybe we can get TDs input on it?
>>
>> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>>
>>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>>> blocker
>>>
>>> Best,
>>>
>>> Nan
>>>
>>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung >> > wrote:
>>>
>>> I've been scrubbing R and think we are tracking 2 issues
>>>
>>> https://issues.apache.org/jira/browse/SPARK-19237
>>>
>>> https://issues.apache.org/jira/browse/SPARK-19925
>>>
>>>
>>>
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>>> *To:* dev@spark.apache.org
>>> *Subject:* Outstanding Spark 2.1.1 issues
>>>
>>> Hi Spark Developers!
>>>
>>> As we start working on the Spark 2.1.1 release I've been looking at our
>>> outstanding issues still targeted for it. I've tried to break it down by
>>> component so that people in charge of each component can take a quick look
>>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>>> the overall list is pretty short (only 9 items - 5 if we only look at
>>> explicitly tagged) :)
>>>
>>> If your working on something for Spark 2.1.1 and it doesn't show up in
>>> this list please speak up now :) We have a lot of issues (including "in
>>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>>> 2.1.1 - if there is something you are working in their which should be
>>> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>>>
>>> The query string I used for looking at the 2.1.1 open issues is:
>>>
>>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>>> Unresolved ORDER BY priority DESC
>>>
>>> None of the open issues appear to be a regression from 2.1.0, but those
>>> seem more likely to show up during the RC process (thanks in advance to
>>> everyone testing their workloads :)) & generally none of them seem to be
>>>
>>> (Note: the cfs are for Target Version/s field)
>>>
>>> Critical Issues:
>>>  SQL:
>>>   SPARK-19690 <https://issues.apache.org/jira/browse/SPARK-19690> - Join
>>> a streaming DataFrame with a batch DataFrame may not work - PR
>>> https://github.com/apache/spark/pull/17052 (review in progress by
>>> zsxwing, currently failing Jenkins)*
>>>
>>> Major Issues:
>>>  SQL:
>>>   SPARK-19035 <https://issues.apache.org/jira/browse/SPARK-19035> - rand()
>>> function in case when cause failed - no outstanding PR (consensus on JIRA
>>> seems to be leaning towards it being a real issue but not necessarily
>>> everyone agrees just yet - maybe we should slip this?)*
>>>  Deploy:
>>>   SPARK-19522 <https://issues.apache.org/jira/browse/SPARK-19522>
>>>  - --executor-memory flag doesn't work in local-cluster mode -
>>> https://github.com/apache/spark/pull/16975 (review in progress by
>>> vanzin, but PR currently stalled waiting on response) *
>>>  Core:
>>>   SPARK-20025 <https://issues.apache.org/jira/browse/SPARK-20025> - Driver
>>> fail over will not work, if SPARK_LOCAL* env is set. -
>>> https://github.com/apache/spark/pull/17357 (waiting on review) *
>>>  PySpark:
>>>  SPARK-19955 <https://issues.apache.org/jira/browse/SPARK-19955> -
>>> Update run-tests to support conda [ Part of Dropping 2.6 support -- which
>>> we shouldn't do in a minor release -- but also fixes pip installability
>>> tests to run in Jenkins ]-  PR failing Jenkins (I need to poke this some
>>> more, but seems like 2.7 support works but some other issues. Maybe slip to
>>&g

[Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs

2017-03-29 Thread Holden Karau
Hi PySpark Developers,

In https://issues.apache.org/jira/browse/SPARK-19955 /
https://github.com/apache/spark/pull/17355, as part of our continued Python
2.6 deprecation https://issues.apache.org/jira/browse/SPARK-15902 &
eventual removal https://issues.apache.org/jira/browse/SPARK-12661 ,
Jenkins master will now test with Python 2.7 rather than Python 2.6. If you
have a pending Python PR please re-run Jenkins tests prior to merge to
avoid issues.

For your local testing *make sure you have a version of Python 2.7
installed on your machine *otherwise it will default to using the python
executable and in the future you may run into compatibility issues.

Note: this only impacts master and has not been merged to other branches,
so if you want to make fixes that are planned for back ported to 2.1,
please continue to use 2.6 compatible Python code (and note you can always
explicitly set a python version to be run with the --python-executables
flag when testing locally).

Cheers,

Holden :)

P.S.

If you run int any issues around this please feel free (as always) to reach
out and ping me.

-- 
Cell : 425-233-8271 <(425)%20233-8271>
Twitter: https://twitter.com/holdenkarau


Re: Outstanding Spark 2.1.1 issues

2017-03-30 Thread Holden Karau
Hi All,

Just circling back to see if there is anything blocking the RC that isn't
being tracked in JIRA?

The current in progress list from ((affectedVersion = 2.1.1 AND
cf[12310320] is Empty) OR fixVersion = 2.1.1 OR cf[12310320] = "2.1.1") AND
project = spark AND resolution = Unresolved ORDER BY priority DESC is only
4 elements:

   1. SPARK-19690 <https://issues.apache.org/jira/browse/SPARK-19690> - Join
   a streaming DataFrame with a batch DataFrame may not work (PR
   https://github.com/apache/spark/pull/17052
   <https://github.com/apache/spark/pull/17052> ) - some discussion around
   re-targeting exists on the PR
   2.
  1.
 1. SPARK-19522 <https://issues.apache.org/jira/browse/SPARK-19522>
  - --executor-memory flag doesn't work in local-cluster mode (PR
 https://github.com/apache/spark/pull/16975
 <https://github.com/apache/spark/pull/16975>
 2. SPARK-19035 <https://issues.apache.org/jira/browse/SPARK-19035>
  -rand() function in case when cause failed - no PR exists and it
 isn't a blocker so I'd suggest we consider re-targetting
1. SPARK-19759
   <https://issues.apache.org/jira/browse/SPARK-19759> -
ALSModel.predict
   on Dataframes : potential optimization by not using
blas - not explicitly
   targeted but I'd suggest targeting for 2.3 if people agree



Cheers,

Holden :)

On Tue, Mar 28, 2017 at 2:07 PM, Xiao Li  wrote:

> Hi, Michael,
>
> Since Daniel Siegmann asked for a bug fix backport in the previous email,
> I just merged https://issues.apache.org/jira/browse/SPARK-14536 into
> Spark 2.1 branch.
>
> If this JIRA is not part of Spark 2.1.1 release, could you help me correct
> the fix version from 2.1.1. to the next release number.
>
> Thanks,
>
> Xiao
>
> 2017-03-28 8:33 GMT-07:00 Michael Armbrust :
>
>> We just fixed the build yesterday.  I'll kick off a new RC today.
>>
>> On Tue, Mar 28, 2017 at 8:04 AM, Asher Krim  wrote:
>>
>>> Hey Michael,
>>> any update on this? We're itching for a 2.1.1 release (specifically
>>> SPARK-14804 which is currently blocking us)
>>>
>>> Thanks,
>>> Asher Krim
>>> Senior Software Engineer
>>>
>>> On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> An update: I cut the tag for RC1 last night.  Currently fighting with
>>>> the release process.  Will post RC1 once I get it working.
>>>>
>>>> On Tue, Mar 21, 2017 at 2:16 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> As for SPARK-19759 <https://issues.apache.org/jira/browse/SPARK-19759>,
>>>>> I don't think that needs to be targeted for 2.1.1 so we don't need to 
>>>>> worry
>>>>> about it
>>>>>
>>>>>
>>>>> On Tue, 21 Mar 2017 at 13:49 Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> I agree with Michael, I think we've got some outstanding issues but
>>>>>> none of them seem like regression from 2.1 so we should be good to start
>>>>>> the RC process.
>>>>>>
>>>>>> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>> Please speak up if I'm wrong, but none of these seem like critical
>>>>>> regressions from 2.1.  As such I'll start the RC process later today.
>>>>>>
>>>>>> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>> I'm not super sure it should be a blocker for 2.1.1 -- is it a
>>>>>> regression? Maybe we can get TDs input on it?
>>>>>>
>>>>>> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu 
>>>>>> wrote:
>>>>>>
>>>>>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be
>>>>>> a blocker
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Nan
>>>>>>
>>>>>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung <
>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>
>>>>>> I've been scrubbing R and think we are tracking 2 issues
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-19237
>>>>>

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-03-31 Thread Holden Karau
-1 (non-binding)

Python packaging doesn't seem to have quite worked out (looking at PKG-INFO
the description is "Description: ! missing pandoc do not upload to PyPI
"), ideally it would be nice to have this as a version we upgrade to
PyPi.
Building this on my own machine results in a longer description.

My guess is that whichever machine was used to package this is missing the
pandoc executable (or possibly pypandoc library).

On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li  wrote:

> +1
>
> Xiao
>
> 2017-03-30 16:09 GMT-07:00 Michael Armbrust :
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.1-rc2
>>  (
>> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1227/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.0.
>>
>> *What happened to RC1?*
>>
>> There were issues with the release packaging and as a result was skipped.
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Holden Karau
So the fix is installing pandoc on whichever machine is used for packaging.
I thought that was generally done on the machine of the person rolling the
release so I wasn't sure it made sense as a JIRA, but from chatting with
Josh it sounds like that part might be on of the Jenkins workers - is there
a fixed one that is used?

Regardless I'll file a JIRA for this when I get back in front of my desktop
(~1 hour or so).

On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust 
wrote:

> Thanks for the comments everyone.  This vote fails.  Here's how I think we
> should proceed:
>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
> report if this is a regression and if there is an easy fix that we should
> wait for.
>
> For all the other test failures, please take the time to look through JIRA
> and open an issue if one does not already exist so that we can triage if
> these are just environmental issues.  If I don't hear any objections I'm
> going to go ahead with RC3 tomorrow.
>
> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung 
> wrote:
>
> -1
> sorry, found an issue with SparkR CRAN check.
> Opened SPARK-20197 and working on fix.
>
> ------
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Friday, March 31, 2017 6:25:20 PM
> *To:* Xiao Li
> *Cc:* Michael Armbrust; dev@spark.apache.org
> *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2)
>
> -1 (non-binding)
>
> Python packaging doesn't seem to have quite worked out (looking
> at PKG-INFO the description is "Description: ! missing pandoc do not
> upload to PyPI "), ideally it would be nice to have this as a version
> we upgrade to PyPi.
> Building this on my own machine results in a longer description.
>
> My guess is that whichever machine was used to package this is missing the
> pandoc executable (or possibly pypandoc library).
>
> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li  wrote:
>
> +1
>
> Xiao
>
> 2017-03-30 16:09 GMT-07:00 Michael Armbrust :
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc2
> <https://github.com/apache/spark/tree/v2.1.1-rc2> (
> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1227/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>
>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Holden Karau
See SPARK-20216, if Michael can let me know which machine is being used for
packaging I can see if I can install pandoc on it (should be simple but I
know the Jenkins cluster is a bit on the older side).

On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau  wrote:

> So the fix is installing pandoc on whichever machine is used for
> packaging. I thought that was generally done on the machine of the person
> rolling the release so I wasn't sure it made sense as a JIRA, but from
> chatting with Josh it sounds like that part might be on of the Jenkins
> workers - is there a fixed one that is used?
>
> Regardless I'll file a JIRA for this when I get back in front of my
> desktop (~1 hour or so).
>
> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust 
> wrote:
>
>> Thanks for the comments everyone.  This vote fails.  Here's how I think
>> we should proceed:
>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>> report if this is a regression and if there is an easy fix that we should
>> wait for.
>>
>> For all the other test failures, please take the time to look through
>> JIRA and open an issue if one does not already exist so that we can triage
>> if these are just environmental issues.  If I don't hear any objections I'm
>> going to go ahead with RC3 tomorrow.
>>
>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung 
>> wrote:
>>
>> -1
>> sorry, found an issue with SparkR CRAN check.
>> Opened SPARK-20197 and working on fix.
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Friday, March 31, 2017 6:25:20 PM
>> *To:* Xiao Li
>> *Cc:* Michael Armbrust; dev@spark.apache.org
>> *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2)
>>
>> -1 (non-binding)
>>
>> Python packaging doesn't seem to have quite worked out (looking
>> at PKG-INFO the description is "Description: ! missing pandoc do not
>> upload to PyPI "), ideally it would be nice to have this as a version
>> we upgrade to PyPi.
>> Building this on my own machine results in a longer description.
>>
>> My guess is that whichever machine was used to package this is missing
>> the pandoc executable (or possibly pypandoc library).
>>
>> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li  wrote:
>>
>> +1
>>
>> Xiao
>>
>> 2017-03-30 16:09 GMT-07:00 Michael Armbrust :
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.1-rc2
>> <https://github.com/apache/spark/tree/v2.1.1-rc2> (
>> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1227/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.0.
>>
>> *What happened to RC1?*
>>
>> There were issues with the release packaging and as a result was skipped.
>>
>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-05 Thread Holden Karau
Following up, the issues with missing pypandoc/pandoc on the packaging
machine has been resolved.

On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau  wrote:

> See SPARK-20216, if Michael can let me know which machine is being used
> for packaging I can see if I can install pandoc on it (should be simple but
> I know the Jenkins cluster is a bit on the older side).
>
> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau  wrote:
>
>> So the fix is installing pandoc on whichever machine is used for
>> packaging. I thought that was generally done on the machine of the person
>> rolling the release so I wasn't sure it made sense as a JIRA, but from
>> chatting with Josh it sounds like that part might be on of the Jenkins
>> workers - is there a fixed one that is used?
>>
>> Regardless I'll file a JIRA for this when I get back in front of my
>> desktop (~1 hour or so).
>>
>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust 
>> wrote:
>>
>>> Thanks for the comments everyone.  This vote fails.  Here's how I think
>>> we should proceed:
>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>>> report if this is a regression and if there is an easy fix that we should
>>> wait for.
>>>
>>> For all the other test failures, please take the time to look through
>>> JIRA and open an issue if one does not already exist so that we can triage
>>> if these are just environmental issues.  If I don't hear any objections I'm
>>> going to go ahead with RC3 tomorrow.
>>>
>>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung 
>>> wrote:
>>>
>>> -1
>>> sorry, found an issue with SparkR CRAN check.
>>> Opened SPARK-20197 and working on fix.
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Friday, March 31, 2017 6:25:20 PM
>>> *To:* Xiao Li
>>> *Cc:* Michael Armbrust; dev@spark.apache.org
>>> *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2)
>>>
>>> -1 (non-binding)
>>>
>>> Python packaging doesn't seem to have quite worked out (looking
>>> at PKG-INFO the description is "Description: ! missing pandoc do not
>>> upload to PyPI "), ideally it would be nice to have this as a version
>>> we upgrade to PyPi.
>>> Building this on my own machine results in a longer description.
>>>
>>> My guess is that whichever machine was used to package this is missing
>>> the pandoc executable (or possibly pypandoc library).
>>>
>>> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li  wrote:
>>>
>>> +1
>>>
>>> Xiao
>>>
>>> 2017-03-30 16:09 GMT-07:00 Michael Armbrust :
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.1-rc2
>>> <https://github.com/apache/spark/tree/v2.1.1-rc2> (
>>> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1227/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
&

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-13 Thread Holden Karau
If it would help I'd be more than happy to look at kicking off the
packaging for RC3 since I'v been poking around in Jenkins a bit (for
SPARK-20216
& friends) (I'd still probably need some guidance from a previous release
coordinator so I understand if that's not actually faster).

On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai  wrote:

> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0x5CED8B896A6BDFA0
>
>
> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
> > DB,
> >
> > This vote already failed and there isn't a RC3 vote yet. If you backport
> the
> > changes to branch-2.1 they will make it into the next RC.
> >
> > rb
> >
> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
> >>
> >> -1
> >>
> >> I think that back-porting SPARK-20270 and SPARK-18555 are very important
> >> since it's a critical bug that na.fill will mess up the data in Long
> even
> >> the data isn't null.
> >>
> >> Thanks.
> >>
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> --
> >> Web: https://www.dbtsai.com
> >> PGP Key ID: 0x5CED8B896A6BDFA0
> >>
> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau 
> >> wrote:
> >>>
> >>> Following up, the issues with missing pypandoc/pandoc on the packaging
> >>> machine has been resolved.
> >>>
> >>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau 
> >>> wrote:
> >>>>
> >>>> See SPARK-20216, if Michael can let me know which machine is being
> used
> >>>> for packaging I can see if I can install pandoc on it (should be
> simple but
> >>>> I know the Jenkins cluster is a bit on the older side).
> >>>>
> >>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau 
> >>>> wrote:
> >>>>>
> >>>>> So the fix is installing pandoc on whichever machine is used for
> >>>>> packaging. I thought that was generally done on the machine of the
> person
> >>>>> rolling the release so I wasn't sure it made sense as a JIRA, but
> from
> >>>>> chatting with Josh it sounds like that part might be on of the
> Jenkins
> >>>>> workers - is there a fixed one that is used?
> >>>>>
> >>>>> Regardless I'll file a JIRA for this when I get back in front of my
> >>>>> desktop (~1 hour or so).
> >>>>>
> >>>>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
> >>>>>  wrote:
> >>>>>>
> >>>>>> Thanks for the comments everyone.  This vote fails.  Here's how I
> >>>>>> think we should proceed:
> >>>>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
> >>>>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
> >>>>>> report if this is a regression and if there is an easy fix that we
> should
> >>>>>> wait for.
> >>>>>>
> >>>>>> For all the other test failures, please take the time to look
> through
> >>>>>> JIRA and open an issue if one does not already exist so that we can
> triage
> >>>>>> if these are just environmental issues.  If I don't hear any
> objections I'm
> >>>>>> going to go ahead with RC3 tomorrow.
> >>>>>>
> >>>>>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung
> >>>>>>  wrote:
> >>>>>>>
> >>>>>>> -1
> >>>>>>> sorry, found an issue with SparkR CRAN check.
> >>>>>>> Opened SPARK-20197 and working on fix.
> >>>>>>>
> >>>>>>> 
> >>>>>>> From: holden.ka...@gmail.com  on behalf of
> >>>>>>> Holden Karau 
> >>>>>>> Sent: Friday, March 31, 2017 6:25:20 PM
> >>>>>>> To: Xiao Li
> >>>>>>> Cc: Michael Armbrust; dev@spark.apache.org
> >>>>>>> Subject: Re: [VOTE] Apache Spark 2.1.1 (RC2)
> >>>>>>>
> >>>>>>> -1 (non-binding)
&

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Holden Karau
Sure, let me dig into it :)

On Fri, Apr 14, 2017 at 4:21 PM, Michael Armbrust 
wrote:

> Have time to figure out why the doc build failed?
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%
> 20Release/job/spark-release-docs/60/console
>
> On Thu, Apr 13, 2017 at 9:39 PM, Holden Karau 
> wrote:
>
>> If it would help I'd be more than happy to look at kicking off the
>> packaging for RC3 since I'v been poking around in Jenkins a bit (for 
>> SPARK-20216
>> & friends) (I'd still probably need some guidance from a previous release
>> coordinator so I understand if that's not actually faster).
>>
>> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai  wrote:
>>
>>> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0x5CED8B896A6BDFA0
>>>
>>>
>>> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
>>> > DB,
>>> >
>>> > This vote already failed and there isn't a RC3 vote yet. If you
>>> backport the
>>> > changes to branch-2.1 they will make it into the next RC.
>>> >
>>> > rb
>>> >
>>> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
>>> >>
>>> >> -1
>>> >>
>>> >> I think that back-porting SPARK-20270 and SPARK-18555 are very
>>> important
>>> >> since it's a critical bug that na.fill will mess up the data in Long
>>> even
>>> >> the data isn't null.
>>> >>
>>> >> Thanks.
>>> >>
>>> >>
>>> >> Sincerely,
>>> >>
>>> >> DB Tsai
>>> >> --
>>> >> Web: https://www.dbtsai.com
>>> >> PGP Key ID: 0x5CED8B896A6BDFA0
>>> >>
>>> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau 
>>> >> wrote:
>>> >>>
>>> >>> Following up, the issues with missing pypandoc/pandoc on the
>>> packaging
>>> >>> machine has been resolved.
>>> >>>
>>> >>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau 
>>> >>> wrote:
>>> >>>>
>>> >>>> See SPARK-20216, if Michael can let me know which machine is being
>>> used
>>> >>>> for packaging I can see if I can install pandoc on it (should be
>>> simple but
>>> >>>> I know the Jenkins cluster is a bit on the older side).
>>> >>>>
>>> >>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau 
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> So the fix is installing pandoc on whichever machine is used for
>>> >>>>> packaging. I thought that was generally done on the machine of the
>>> person
>>> >>>>> rolling the release so I wasn't sure it made sense as a JIRA, but
>>> from
>>> >>>>> chatting with Josh it sounds like that part might be on of the
>>> Jenkins
>>> >>>>> workers - is there a fixed one that is used?
>>> >>>>>
>>> >>>>> Regardless I'll file a JIRA for this when I get back in front of my
>>> >>>>> desktop (~1 hour or so).
>>> >>>>>
>>> >>>>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
>>> >>>>>  wrote:
>>> >>>>>>
>>> >>>>>> Thanks for the comments everyone.  This vote fails.  Here's how I
>>> >>>>>> think we should proceed:
>>> >>>>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>>> >>>>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA
>>> and
>>> >>>>>> report if this is a regression and if there is an easy fix that
>>> we should
>>> >>>>>> wait for.
>>> >>>>>>
>>> >>>>>> For all the other test failures, please take the time to look
>>> through
>>> >>>>>> JIRA and open an issue if one does not already exist so that we
>>> can triage
>>> >>>>>> if these

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Holden Karau
At first glance the error seems similar to one Pedro Rodriguez ran into
during 2.0, so I'm looping Pedor in if they happen to have any insight into
what was the cause last time.

On Fri, Apr 14, 2017 at 4:40 PM, Holden Karau  wrote:

> Sure, let me dig into it :)
>
> On Fri, Apr 14, 2017 at 4:21 PM, Michael Armbrust 
> wrote:
>
>> Have time to figure out why the doc build failed?
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
>> job/spark-release-docs/60/console
>>
>> On Thu, Apr 13, 2017 at 9:39 PM, Holden Karau 
>> wrote:
>>
>>> If it would help I'd be more than happy to look at kicking off the
>>> packaging for RC3 since I'v been poking around in Jenkins a bit (for 
>>> SPARK-20216
>>> & friends) (I'd still probably need some guidance from a previous release
>>> coordinator so I understand if that's not actually faster).
>>>
>>> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai  wrote:
>>>
>>>> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> --
>>>> Web: https://www.dbtsai.com
>>>> PGP Key ID: 0x5CED8B896A6BDFA0
>>>>
>>>>
>>>> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
>>>> > DB,
>>>> >
>>>> > This vote already failed and there isn't a RC3 vote yet. If you
>>>> backport the
>>>> > changes to branch-2.1 they will make it into the next RC.
>>>> >
>>>> > rb
>>>> >
>>>> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
>>>> >>
>>>> >> -1
>>>> >>
>>>> >> I think that back-porting SPARK-20270 and SPARK-18555 are very
>>>> important
>>>> >> since it's a critical bug that na.fill will mess up the data in Long
>>>> even
>>>> >> the data isn't null.
>>>> >>
>>>> >> Thanks.
>>>> >>
>>>> >>
>>>> >> Sincerely,
>>>> >>
>>>> >> DB Tsai
>>>> >> ----------
>>>> >> Web: https://www.dbtsai.com
>>>> >> PGP Key ID: 0x5CED8B896A6BDFA0
>>>> >>
>>>> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau 
>>>> >> wrote:
>>>> >>>
>>>> >>> Following up, the issues with missing pypandoc/pandoc on the
>>>> packaging
>>>> >>> machine has been resolved.
>>>> >>>
>>>> >>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau 
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> See SPARK-20216, if Michael can let me know which machine is being
>>>> used
>>>> >>>> for packaging I can see if I can install pandoc on it (should be
>>>> simple but
>>>> >>>> I know the Jenkins cluster is a bit on the older side).
>>>> >>>>
>>>> >>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau >>> >
>>>> >>>> wrote:
>>>> >>>>>
>>>> >>>>> So the fix is installing pandoc on whichever machine is used for
>>>> >>>>> packaging. I thought that was generally done on the machine of
>>>> the person
>>>> >>>>> rolling the release so I wasn't sure it made sense as a JIRA, but
>>>> from
>>>> >>>>> chatting with Josh it sounds like that part might be on of the
>>>> Jenkins
>>>> >>>>> workers - is there a fixed one that is used?
>>>> >>>>>
>>>> >>>>> Regardless I'll file a JIRA for this when I get back in front of
>>>> my
>>>> >>>>> desktop (~1 hour or so).
>>>> >>>>>
>>>> >>>>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
>>>> >>>>>  wrote:
>>>> >>>>>>
>>>> >>>>>> Thanks for the comments everyone.  This vote fails.  Here's how I
>>>> >>>>>> think we should proceed:
>>>> >>>>>>  - [SPARK-20197] - S

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Holden Karau
Ok and with a bit more digging between RC2 and RC3 we apparently switched
which JVM we are building the docs with.

The relevant side by side diff of the build logs (
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-docs/60/consoleFull

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-docs/59/consoleFull
):

HEAD is now at 2ed19cf... Preparing Spark release v2.1.1-rc3  | HEAD is now
at 02b165d... Preparing Spark release v2.1.1-rc2
Checked out Spark git hash 2ed19cf| Checked out
Spark git hash 02b165d
Building Spark docs Building
Spark docs
Configuration file: /home/jenkins/workspace/spark-release-doc
Configuration file: /home/jenkins/workspace/spark-release-doc
Moving to project root and building API docs.   Moving to
project root and building API docs.
Running 'build/sbt -Pkinesis-asl clean compile unidoc' from /   Running
'build/sbt -Pkinesis-asl clean compile unidoc' from /
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. | Using
/usr/java/jdk1.7.0_79 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.   Note, this
will be overridden by -java-home if it is set.

There have been some known issues with building the docs with JDK8 and I
believe those fixes are in mainline, and we could cherry pick these changes
in -- but I think it might be more reasonable to just build the 2.1 docs
with JDK7.

What do people think?


On Fri, Apr 14, 2017 at 4:53 PM, Holden Karau  wrote:

> At first glance the error seems similar to one Pedro Rodriguez ran into
> during 2.0, so I'm looping Pedor in if they happen to have any insight into
> what was the cause last time.
>
> On Fri, Apr 14, 2017 at 4:40 PM, Holden Karau 
> wrote:
>
>> Sure, let me dig into it :)
>>
>> On Fri, Apr 14, 2017 at 4:21 PM, Michael Armbrust > > wrote:
>>
>>> Have time to figure out why the doc build failed?
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
>>> job/spark-release-docs/60/console
>>>
>>> On Thu, Apr 13, 2017 at 9:39 PM, Holden Karau 
>>> wrote:
>>>
>>>> If it would help I'd be more than happy to look at kicking off the
>>>> packaging for RC3 since I'v been poking around in Jenkins a bit (for 
>>>> SPARK-20216
>>>> & friends) (I'd still probably need some guidance from a previous release
>>>> coordinator so I understand if that's not actually faster).
>>>>
>>>> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai  wrote:
>>>>
>>>>> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> --
>>>>> Web: https://www.dbtsai.com
>>>>> PGP Key ID: 0x5CED8B896A6BDFA0
>>>>>
>>>>>
>>>>> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
>>>>> > DB,
>>>>> >
>>>>> > This vote already failed and there isn't a RC3 vote yet. If you
>>>>> backport the
>>>>> > changes to branch-2.1 they will make it into the next RC.
>>>>> >
>>>>> > rb
>>>>> >
>>>>> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
>>>>> >>
>>>>> >> -1
>>>>> >>
>>>>> >> I think that back-porting SPARK-20270 and SPARK-18555 are very
>>>>> important
>>>>> >> since it's a critical bug that na.fill will mess up the data in
>>>>> Long even
>>>>> >> the data isn't null.
>>>>> >>
>>>>> >> Thanks.
>>>>> >>
>>>>> >>
>>>>> >> Sincerely,
>>>>> >>
>>>>> >> DB Tsai
>>>>> >> --
>>>>> >> Web: https://www.dbtsai.com
>>>>> >> PGP Key ID: 0x5CED8B896A6BDFA0
>>>>> >>
>>>>> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau >>>> >
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Following up, the issues with missing pypandoc/pandoc on the
>>>>> packaging
>>>>> >>> machine has been resolved.
>>>>> >>>
>>>>> >>

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-17 Thread Holden Karau
I think this is Java 8 v Java 7, if you look at the previous build you see
a lot of the same missing classes but tagged as "warning" rather than
"error". I think all in all it makes sense to stick to JDK7 to build the
legacy build which have been built with it previously.

If there is consensus on that I'm happy to update the env variables for the
RC3 build to set a JDK7 JAVA_HOME (but I'd want to double check with
someone about which jobs need to be updated to make sure I don't miss any).

On Sat, Apr 15, 2017 at 2:33 AM, Sean Owen  wrote:

> I don't think this is an example of Java 8 javadoc being more strict; it
> is not finding classes, not complaining about syntax.
> (Hyukjin cleaned up all of the javadoc 8 errors in master, and they're
> different and much more extensive!)
>
> It wouldn't necessarily break anything to build with Java 8 because it'll
> still emit Java 7 bytecode, etc.
>
> That said, it may very well be that it is somehow due to Java 7 vs 8, and
> is probably best to stick to 1.7 in the release build.
>
> On Sat, Apr 15, 2017 at 1:38 AM Ryan Blue 
> wrote:
>
>> I've hit this before, where Javadoc for 1.8 is much more strict than 1.7.
>>
>> I think we should definitely use Java 1.7 for the release if we used it
>> for the previous releases in the 2.1 line. We don't want to break java 1.7
>> users in a patch release.
>>
>> rb
>>
>> On Fri, Apr 14, 2017 at 5:21 PM, Holden Karau 
>> wrote:
>>
>>> Ok and with a bit more digging between RC2 and RC3 we apparently
>>> switched which JVM we are building the docs with.
>>>
>>> The relevant side by side diff of the build logs (
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%
>>> 20Release/job/spark-release-docs/60/consoleFull https://
>>> amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
>>> job/spark-release-docs/59/consoleFull ):
>>>
>>> HEAD is now at 2ed19cf... Preparing Spark release v2.1.1-rc3  | HEAD is
>>> now at 02b165d... Preparing Spark release v2.1.1-rc2
>>> Checked out Spark git hash 2ed19cf| Checked
>>> out Spark git hash 02b165d
>>> Building Spark docs Building
>>> Spark docs
>>> Configuration file: /home/jenkins/workspace/spark-release-doc
>>> Configuration file: /home/jenkins/workspace/spark-release-doc
>>> Moving to project root and building API docs.   Moving
>>> to project root and building API docs.
>>> Running 'build/sbt -Pkinesis-asl clean compile unidoc' from /   Running
>>> 'build/sbt -Pkinesis-asl clean compile unidoc' from /
>>> Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. | Using
>>> /usr/java/jdk1.7.0_79 as default JAVA_HOME.
>>> Note, this will be overridden by -java-home if it is set.   Note,
>>> this will be overridden by -java-home if it is set.
>>>
>>> There have been some known issues with building the docs with JDK8 and I
>>> believe those fixes are in mainline, and we could cherry pick these changes
>>> in -- but I think it might be more reasonable to just build the 2.1 docs
>>> with JDK7.
>>>
>>> What do people think?
>>>
>>>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-23 Thread Holden Karau
Whats the regression this fixed in 2.1 from 2.0?

On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan  wrote:

> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only
> scan all table files only once, and write back the inferred schema to
> metastore so that we don't need to do the schema inference again.
>
> So technically this will introduce a performance regression for the first
> query, but compared to branch-2.0, it's not performance regression. And
> this patch fixed a regression in branch-2.1, which can run in branch-2.0.
> Personally, I think we should keep INFER_AND_SAVE as the default mode.
>
> + [Eric], what do you think?
>
> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust 
> wrote:
>
>> Thanks for pointing this out, Michael.  Based on the conversation on the
>> PR 
>> this seems like a risky change to include in a release branch with a
>> default other than NEVER_INFER.
>>
>> +Wenchen?  What do you think?
>>
>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
>> wrote:
>>
>>> We've identified the cause of the change in behavior. It is related to
>>> the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key
>>> and its related functionality was absent from our previous build. The
>>> default setting in the current build was causing Spark to attempt to scan
>>> all table files during query analysis. Changing this setting to NEVER_INFER
>>> disabled this operation and resolved the issue we had.
>>>
>>> Michael
>>>
>>>
>>> On Apr 20, 2017, at 3:42 PM, Michael Allman 
>>> wrote:
>>>
>>> I want to caution that in testing a build from this morning's branch-2.1
>>> we found that Hive partition pruning was not working. We found that Spark
>>> SQL was fetching all Hive table partitions for a very simple query whereas
>>> in a build from several weeks ago it was fetching only the required
>>> partitions. I cannot currently think of a reason for the regression outside
>>> of some difference between branch-2.1 from our previous build and
>>> branch-2.1 from this morning.
>>>
>>> That's all I know right now. We are actively investigating to find the
>>> root cause of this problem, and specifically whether this is a problem in
>>> the Spark codebase or not. I will report back when I have an answer to that
>>> question.
>>>
>>> Michael
>>>
>>>
>>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust 
>>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.1-rc3
>>>  (2ed19cff2f6ab79
>>> a718526e5d16633412d8c4dd4)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1230/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.0.
>>>
>>> *What happened to RC1?*
>>>
>>> There were issues with the release packaging and as a result was skipped.
>>>
>>>
>>>
>>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
It

On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman 
wrote:

> The trouble we ran into is that this upgrade was blocking access to our
> tables, and we didn't know why. This sounds like a kind of migration
> operation, but it was not apparent that this was the case. It took an
> expert examining a stack trace and source code to figure this out. Would a
> more naive end user be able to debug this issue? Maybe we're an unusual
> case, but our particular experience was pretty bad. I have my doubts that
> the schema inference on our largest tables would ever complete without
> throwing some kind of timeout (which we were in fact receiving) or the end
> user just giving up and killing our job. We ended up doing a rollback while
> we investigated the source of the issue. In our case, INFER_NEVER is
> clearly the best configuration. We're going to add that to our default
> configuration files.
>
> My expectation is that a minor point release is a pretty safe bug fix
> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>
> One suggestion the Spark team might consider is releasing 2.1.1 with
> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of up-front
> migration notes would help in identifying this new behavior in 2.2.
>
> Thanks,
>
> Michael
>
>
> On Apr 24, 2017, at 2:09 AM, Wenchen Fan  wrote:
>
> see https://issues.apache.org/jira/browse/SPARK-19611
>
> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau 
> wrote:
>
>> Whats the regression this fixed in 2.1 from 2.0?
>>
>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
>> wrote:
>>
>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>>> only scan all table files only once, and write back the inferred schema to
>>> metastore so that we don't need to do the schema inference again.
>>>
>>> So technically this will introduce a performance regression for the
>>> first query, but compared to branch-2.0, it's not performance regression.
>>> And this patch fixed a regression in branch-2.1, which can run in
>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>>> default mode.
>>>
>>> + [Eric], what do you think?
>>>
>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Thanks for pointing this out, Michael.  Based on the conversation on
>>>> the PR
>>>> <https://github.com/apache/spark/pull/16944#issuecomment-285529275>
>>>> this seems like a risky change to include in a release branch with a
>>>> default other than NEVER_INFER.
>>>>
>>>> +Wenchen?  What do you think?
>>>>
>>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
>>>> wrote:
>>>>
>>>>> We've identified the cause of the change in behavior. It is related to
>>>>> the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This
>>>>> key and its related functionality was absent from our previous build. The
>>>>> default setting in the current build was causing Spark to attempt to scan
>>>>> all table files during query analysis. Changing this setting to 
>>>>> NEVER_INFER
>>>>> disabled this operation and resolved the issue we had.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> On Apr 20, 2017, at 3:42 PM, Michael Allman 
>>>>> wrote:
>>>>>
>>>>> I want to caution that in testing a build from this morning's
>>>>> branch-2.1 we found that Hive partition pruning was not working. We found
>>>>> that Spark SQL was fetching all Hive table partitions for a very simple
>>>>> query whereas in a build from several weeks ago it was fetching only the
>>>>> required partitions. I cannot currently think of a reason for the
>>>>> regression outside of some difference between branch-2.1 from our previous
>>>>> build and branch-2.1 from this morning.
>>>>>
>>>>> That's all I know right now. We are actively investigating to find the
>>>>> root cause of this problem, and specifically whether this is a problem in
>>>>> the Spark codebase or not. I will report back when I have an answer to 
>>>>> that
>>>>> question.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> On Apr 18, 2017, at 11:59 AM, Mi

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
Whoops, sorry finger slipped on that last message.
It sounds like whatever we do is going to break some existing users (either
with the tables by case sensitivity or with the unexpected scan).

Personally I agree with Michael Allman on this, I believe we should
use INFER_NEVER for 2.1.1.

On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau  wrote:

> It
>
> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman 
> wrote:
>
>> The trouble we ran into is that this upgrade was blocking access to our
>> tables, and we didn't know why. This sounds like a kind of migration
>> operation, but it was not apparent that this was the case. It took an
>> expert examining a stack trace and source code to figure this out. Would a
>> more naive end user be able to debug this issue? Maybe we're an unusual
>> case, but our particular experience was pretty bad. I have my doubts that
>> the schema inference on our largest tables would ever complete without
>> throwing some kind of timeout (which we were in fact receiving) or the end
>> user just giving up and killing our job. We ended up doing a rollback while
>> we investigated the source of the issue. In our case, INFER_NEVER is
>> clearly the best configuration. We're going to add that to our default
>> configuration files.
>>
>> My expectation is that a minor point release is a pretty safe bug fix
>> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>>
>> One suggestion the Spark team might consider is releasing 2.1.1 with
>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of up-front
>> migration notes would help in identifying this new behavior in 2.2.
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan  wrote:
>>
>> see https://issues.apache.org/jira/browse/SPARK-19611
>>
>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau 
>> wrote:
>>
>>> Whats the regression this fixed in 2.1 from 2.0?
>>>
>>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
>>> wrote:
>>>
>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>>>> only scan all table files only once, and write back the inferred schema to
>>>> metastore so that we don't need to do the schema inference again.
>>>>
>>>> So technically this will introduce a performance regression for the
>>>> first query, but compared to branch-2.0, it's not performance regression.
>>>> And this patch fixed a regression in branch-2.1, which can run in
>>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>>>> default mode.
>>>>
>>>> + [Eric], what do you think?
>>>>
>>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Thanks for pointing this out, Michael.  Based on the conversation on
>>>>> the PR
>>>>> <https://github.com/apache/spark/pull/16944#issuecomment-285529275>
>>>>> this seems like a risky change to include in a release branch with a
>>>>> default other than NEVER_INFER.
>>>>>
>>>>> +Wenchen?  What do you think?
>>>>>
>>>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
>>>>> wrote:
>>>>>
>>>>>> We've identified the cause of the change in behavior. It is related
>>>>>> to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode".
>>>>>> This key and its related functionality was absent from our previous 
>>>>>> build.
>>>>>> The default setting in the current build was causing Spark to attempt to
>>>>>> scan all table files during query analysis. Changing this setting
>>>>>> to NEVER_INFER disabled this operation and resolved the issue we had.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On Apr 20, 2017, at 3:42 PM, Michael Allman 
>>>>>> wrote:
>>>>>>
>>>>>> I want to caution that in testing a build from this morning's
>>>>>> branch-2.1 we found that Hive partition pruning was not working. We found
>>>>>> that Spark SQL was fetching all Hive table partitions for a very simple
>>>>>> query whereas in a build from several weeks ago it was fetching only the
>>>>>> required partitions. I cannot curren

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Holden Karau
+1 (non-binding) PySpark packaging issue from the earlier RC seems to have
been fixed.

On Thu, Apr 27, 2017 at 1:23 PM, Dong Joon Hyun 
wrote:

> +1
>
> I’ve got the same result (Scala/R test) on JDK 1.8.0_131 at this time.
>
> Bests,
> Dongjoon.
>
> From: Reynold Xin 
> Date: Thursday, April 27, 2017 at 1:06 PM
> To: Michael Armbrust , "dev@spark.apache.org" <
> dev@spark.apache.org>
> Subject: Re: [VOTE] Apache Spark 2.1.1 (RC4)
>
> +1
> On Thu, Apr 27, 2017 at 11:59 AM Michael Armbrust 
> wrote:
>
>> I'll also +1
>>
>> On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen  wrote:
>>
>>> +1 , same result as with the last RC. All checks out for me.
>>>
>>> On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00
 PST and passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.1.1
 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.1.1-rc4
  (267aca5bd504230
 3a718d10635bc0d1a1596853f)

 List of JIRA tickets resolved can be found with this filter
 
 .

 The release files, including signatures, digests, etc. can be found at:
 http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1232/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/


 *FAQ*

 *How can I help test this release?*

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 *What should happen to JIRA tickets still targeting 2.1.1?*

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.

 *But my bug isn't fixed!??!*

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.1.0.

 *What happened to RC1?*

 There were issues with the release packaging and as a result was
 skipped.

>>>
>>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Uploading PySpark 2.1.1 to PyPi

2017-05-08 Thread Holden Karau
Just a heads up I'm in the process of trying to upload the latest PySpark
to PyPi (we are blocked on a tickets with the PyPi folks around file size
but I'll follow up with them).

Relatedly PySpark is available in Conda-forge, currently 2.1.0 and there is
a PR to update to 2.1.1 in process.

Happy Python Spark adventures every one :)
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Uploading PySpark 2.1.1 to PyPi

2017-05-08 Thread Holden Karau
So I have a PR to add this to the release process documentation - I'm
waiting on the necessary approvals from PyPi folks before I merge that
incase anything changes as a result of the discussion (like uploading to
the legacy host or something). As for conda-forge, it's not something we
need to do, but I'll add a note about pinging them when we make a new
release so their users can keep up to date easily. The parent JIRA for PyPi
related tasks is SPARK-18267 :)


On Mon, May 8, 2017 at 6:22 PM cloud0fan  wrote:

> Hi Holden,
>
> Thanks for working on it! Do we have a JIRA ticket to track this? We should
> make it part of the release process in all the following Spark releases,
> and
> it will be great if we have a JIRA ticket to record the detailed steps of
> doing this and even automate it.
>
> Thanks,
> Wenchen
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Uploading-PySpark-
> 2-1-1-to-PyPi-tp21531p21532.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Uploading PySpark 2.1.1 to PyPi

2017-05-23 Thread Holden Karau
An account already exists, the PMC has the info for it. I think we will
need to wait for the 2.2 artifacts to do the actual PyPI upload because of
the local version string in 2.2.1, but rest assured this isn't something
I've lost track of.

On Wed, May 24, 2017 at 12:11 AM Xiao Li  wrote:

> Hi, Holden,
>
> Based on the PR, https://github.com/pypa/packaging-problems/issues/90 ,
> the limit has been increased to 250MB.
>
> Just wondering if we can publish PySpark to PyPI now? Have you created the
> account?
>
> Thanks,
>
> Xiao Li
>
>
>
> 2017-05-12 11:35 GMT-07:00 Sameer Agarwal :
>
>> Holden,
>>
>> Thanks again for pushing this forward! Out of curiosity, did we get an
>> approval from the PyPi folks?
>>
>> Regards,
>> Sameer
>>
>> On Mon, May 8, 2017 at 11:44 PM, Holden Karau 
>> wrote:
>>
>>> So I have a PR to add this to the release process documentation - I'm
>>> waiting on the necessary approvals from PyPi folks before I merge that
>>> incase anything changes as a result of the discussion (like uploading to
>>> the legacy host or something). As for conda-forge, it's not something we
>>> need to do, but I'll add a note about pinging them when we make a new
>>> release so their users can keep up to date easily. The parent JIRA for PyPi
>>> related tasks is SPARK-18267 :)
>>>
>>>
>>> On Mon, May 8, 2017 at 6:22 PM cloud0fan  wrote:
>>>
>>>> Hi Holden,
>>>>
>>>> Thanks for working on it! Do we have a JIRA ticket to track this? We
>>>> should
>>>> make it part of the release process in all the following Spark
>>>> releases, and
>>>> it will be great if we have a JIRA ticket to record the detailed steps
>>>> of
>>>> doing this and even automate it.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Uploading-PySpark-2-1-1-to-PyPi-tp21531p21532.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>>
>> --
>> Sameer Agarwal
>> Software Engineer | Databricks Inc.
>> http://cs.berkeley.edu/~sameerag
>>
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Holden Karau
+1 pip install to local virtual env works, no local version string (was
blocking the pypi upload).


On Tue, Jun 6, 2017 at 8:03 AM, Felix Cheung 
wrote:

> All tasks on the R QA umbrella are completed
> SPARK-20512
>
> We can close this.
>
>
>
> _
> From: Sean Owen 
> Sent: Tuesday, June 6, 2017 1:16 AM
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
> To: Michael Armbrust 
> Cc: 
>
>
>
> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / no-go release decision, and I think this interferes. The rest of JIRA
> noise doesn't matter much. You can see we're already resorting to secondary
> communications as a result ("anyone have any issues that need to be fixed
> before I cut another RC?" emails) because this is kind of ignored, and
> think we're swapping out a decent mechanism for worse one.
>
> I suspect, as you do, that there's no to-do here in which case they should
> be resolved and we're still on track for release. I'd wait on +1 until then.
>
>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-05 Thread Holden Karau
+1 PySpark package pip installs into a virtualenv.

On Wed, Jul 5, 2017 at 9:56 PM, Felix Cheung 
wrote:

> +1 (non binding)
> Tested R, R package on Ubuntu and Windows, CRAN checks, manual tests with
> steaming & udf.
>
>
> _
> From: Denny Lee 
> Sent: Monday, July 3, 2017 9:30 PM
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC6)
> To: Liang-Chi Hsieh , 
>
>
>
> +1 (non-binding)
>
> On Mon, Jul 3, 2017 at 6:45 PM Liang-Chi Hsieh  wrote:
>
>> +1
>>
>>
>> Sameer Agarwal wrote
>> > +1
>> >
>> > On Mon, Jul 3, 2017 at 6:08 AM, Wenchen Fan <
>>
>> > cloud0fan@
>>
>> > > wrote:
>> >
>> >> +1
>> >>
>> >> On 3 Jul 2017, at 8:22 PM, Nick Pentreath <
>>
>> > nick.pentreath@
>>
>> > >
>> >> wrote:
>> >>
>> >> +1 (binding)
>> >>
>> >> On Mon, 3 Jul 2017 at 11:53 Yanbo Liang <
>>
>> > ybliang8@
>>
>> > > wrote:
>> >>
>> >>> +1
>> >>>
>> >>> On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier <
>> >>>
>>
>> > hvanhovell@
>>
>> >> wrote:
>> >>>
>>  +1
>> 
>>  On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <
>> 
>>
>> > ricardo.almeida@
>>
>> >> wrote:
>> 
>> > +1 (non-binding)
>> >
>> > Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn
>> > -Phive -Phive-thriftserver -Pscala-2.11 on
>> >
>> >- macOS 10.12.5 Java 8 (build 1.8.0_131)
>> >- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>> >
>> >
>> >
>> >
>> >
>> > On 1 Jul 2017 02:45, "Michael Armbrust" <
>>
>> > michael@
>>
>> > > wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> > version 2.2.0. The vote is open until Friday, July 7th, 2017 at
>> 18:00
>> > PST and passes if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.2.0
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> >
>> > The tag to be voted on is v2.2.0-rc6
>> > ;
>> > (a2c7b2133cfee7f
>> > a9abfaa2bfbfb637155466783)
>> >
>> > List of JIRA tickets resolved can be found with this filter
>> > > project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> > .
>> >
>> > The release files, including signatures, digests, etc. can be found
>> > at:
>> > https://home.apache.org/~pwendell/spark-releases/spark-
>> 2.2.0-rc6-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/
>> orgapachespark-1245/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://people.apache.org/~pwendell/spark-releases/spark-
>> > 2.2.0-rc6-docs/
>> >
>> >
>> > *FAQ*
>> >
>> > *How can I help test this release?*
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > *What should happen to JIRA tickets still targeting 2.2.0?*
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility
>> should
>> > be
>> > worked on immediately. Everything else please retarget to 2.3.0 or
>> > 2.2.1.
>> >
>> > *But my bug isn't fixed!??!*
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from 2.1.1.
>> >
>> >
>> >
>> 
>> 
>> >>>
>> >>
>> >
>> >
>> > --
>> > Sameer Agarwal
>> > Software Engineer | Databricks Inc.
>> > http://cs.berkeley.edu/~sameerag
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-
>> 2-2-0-RC6-tp21902p21914.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Crowdsourced triage Scapegoat compiler plugin warnings

2017-07-13 Thread Holden Karau
I'm happy to help out in the places I'm more familiar with some starting
next week. I suspect things probably just fell to the wayside for some
folks during the 2.2.0 release.

On Thu, Jul 13, 2017 at 1:16 AM, Hyukjin Kwon  wrote:

> Hi all,
>
>
> Another gentle ping for help.
>
> Probably, let me open up a JIRA and proceed this after a couple of weeks
> if no one is going to do this although I hope someone takes this.
>
>
> Thanks.
>
> 2017-06-18 2:16 GMT+09:00 Sean Owen :
>
>> Looks like a whole lot of the results have been analyzed. I suspect
>> there's more than enough to act on already. I think we should wait until
>> after 2.2 is done.
>> Anybody prefer how to proceed here -- just open a JIRA to take care of a
>> batch of related types of issues and go for it?
>>
>> On Sat, Jun 17, 2017 at 4:45 PM Hyukjin Kwon  wrote:
>>
>>> Gentle ping to dev for help. I hope this effort is not abandoned.
>>>
>>>
>>> On 25 May 2017 9:41 am, "Josh Rosen"  wrote:
>>>
>>> I'm interested in using the Scapegoat
>>>  Scala compiler plugin to find
>>> potential bugs and performance problems in Spark. Scapegoat has a useful
>>> built-in set of inspections and is pretty easy to extend with custom ones.
>>> For example, I added an inspection to spot places where we call
>>> *.apply()* on a Seq which is not an IndexedSeq
>>>  in order to make it
>>> easier to spot potential O(n^2) performance bugs.
>>>
>>> There are lots of false-positives and benign warnings (as with any
>>> linter / static analyzer) so I don't think it's feasible to us to include
>>> this as a blocking step in our regular build. I am planning to build
>>> tooling to surface only new warnings so going forward this can become a
>>> useful code-review aid.
>>>
>>> The current codebase has roughly 1700 warnings that I would like to
>>> triage and categorize as false-positives or real bugs. I can't do this
>>> alone, so here's how you can help:
>>>
>>>- Visit the Google Docs spreadsheet at https://docs.google.com/spread
>>>sheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?
>>>usp=sharing
>>>
>>> 
>>>  and
>>>find an un-triaged warning.
>>>- In the columns at the right of the sheet, enter your name in the
>>>appropriate column to mark a warning as a false-positive or as a real bug
>>>and/or performance issue. If think a warning is a real issue then use the
>>>"comments" column for providing additional detail.
>>>- Please don't file JIRAs or PRs for individual warnings; I suspect
>>>that we'll find clusters of issues which are best fixed in a few larger 
>>> PRs
>>>vs. lots of smaller ones. Certain warnings are probably simply style 
>>> issues
>>>so we should discuss those before trying to fix them.
>>>
>>> The sheet has hidden columns capturing the Spark revision and Scapegoat
>>> revision. I can use this to programmatically update the sheet and remap
>>> lines after updating either Scapegoat (to suppress false-positives) or
>>> Spark (to incorporate fixes and surface new warnings). For those who are
>>> interested, the sheet was produced with this script:
>>> https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8
>>>
>>> Depending on the results of this experiment we might want to integrate a
>>> high-signal subset of the Scapegoat warnings into our build. I'm also
>>> hoping that we'll be able to build a useful corpus of triaged warnings in
>>> order to help improve Scapegoat itself and eliminate common false-positives.
>>>
>>> Thanks and happy bug-hunting,
>>> Josh Rosen
>>>
>>>
>>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Holden Karau
The memory overhead is based less on the total amount of data and more on
what you end up doing with the data (e.g. if your doing a lot of off-heap
processing or using Python you need to increase it). Honestly most people
find this number for their job "experimentally" (e.g. they try a few
different things).

On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri 
wrote:

> Ryan,
> Thank you for reply.
>
> For 2 TB of Data what should be the value of 
> spark.yarn.executor.memoryOverhead
> = ?
>
> with regards to this - i see issue at spark https://issues.apache.org/
> jira/browse/SPARK-18787 , not sure whether it works or not at Spark 2.0.1
>  !
>
> can you elaborate more for spark.memory.fraction setting.
>
> number of partitions = 674
> Cluster: 455 GB total memory, VCores: 288, Nodes: 17
> Given / tried memory config: executor-mem = 16g, num-executor=10, executor
> cores=6, driver mem=4g
>
> spark.default.parallelism=1000
> spark.sql.shuffle.partitions=1000
> spark.yarn.executor.memoryOverhead=2048
> spark.shuffle.io.preferDirectBufs=false
>
>
>
>
>
>
>
>
>
> On Wed, Aug 2, 2017 at 10:43 PM, Ryan Blue  wrote:
>
>> Chetan,
>>
>> When you're writing to a partitioned table, you want to use a shuffle to
>> avoid the situation where each task has to write to every partition. You
>> can do that either by adding a repartition by your table's partition keys,
>> or by adding an order by with the partition keys and then columns you
>> normally use to filter when reading the table. I generally recommend the
>> second approach because it handles skew and prepares the data for more
>> efficient reads.
>>
>> If that doesn't help, then you should look at your memory settings. When
>> you're getting killed by YARN, you should consider setting `
>> spark.shuffle.io.preferDirectBufs=false` so you use less off-heap memory
>> that the JVM doesn't account for. That is usually an easier fix than
>> increasing the memory overhead. Also, when you set executor memory, always
>> change spark.memory.fraction to ensure the memory you're adding is used
>> where it is needed. If your memory fraction is the default 60%, then 60% of
>> the memory will be used for Spark execution, not reserved whatever is
>> consuming it and causing the OOM. (If Spark's memory is too low, you'll see
>> other problems like spilling too much to disk.)
>>
>> rb
>>
>> On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Can anyone please guide me with above issue.
>>>
>>>
>>> On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Spark Users,

 I have Hbase table reading and writing to Hive managed table where i
 applied partitioning by date column which worked fine but it has generate
 more number of files in almost 700 partitions but i wanted to use
 reparation to reduce File I/O by reducing number of files inside each
 partition.

 *But i ended up with below exception:*

 ExecutorLostFailure (executor 11 exited caused by one of the running
 tasks) Reason: Container killed by YARN for exceeding memory limits. 14.0
 GB of 14 GB physical memory used. Consider boosting spark.yarn.executor.
 memoryOverhead.

 Driver memory=4g, executor mem=12g, num-executors=8, executor core=8

 Do you think below setting can help me to overcome above issue:

 spark.default.parellism=1000
 spark.sql.shuffle.partitions=1000

 Because default max number of partitions are 1000.



>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Holden Karau
Congrats!

On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler  wrote:

> Great work Hyukjin and Sameer!
>
> On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan 
> wrote:
>
>> Congratulations Hyukjin, Sameer !
>>
>> Regards,
>> Mridul
>>
>> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia 
>> wrote:
>> > Hi everyone,
>> >
>> > The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as
>> committers. Join me in congratulating both of them and thanking them for
>> their contributions to the project!
>> >
>> > Matei
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: spark pypy support?

2017-08-14 Thread Holden Karau
As Dong says yes we do test with PyPy in our CI env; but we expect a
"newer" version of PyPy (although I don't think we ever bothered to write
down what the exact version requirements are for the PyPy support unlike
regular Python).

On Mon, Aug 14, 2017 at 2:06 PM, Dong Joon Hyun 
wrote:

> Hi, Tom.
>
>
>
> What version of PyPy do you use?
>
>
>
> In the Jenkins environment, `pypy` always passes like Python 2.7 and
> Python 3.4.
>
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%
> 20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/3340/consoleFull
>
>
>
> 
>
> Running PySpark tests
>
> 
>
> Running PySpark tests. Output is in /home/jenkins/workspace/spark-
> master-test-sbt-hadoop-2.7/python/unit-tests.log
>
> Will test against the following Python executables: ['python2.7',
> 'python3.4', 'pypy']
>
> Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
> 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
>
> Starting test(python2.7): pyspark.mllib.tests
>
> Starting test(pypy): pyspark.sql.tests
>
> Starting test(pypy): pyspark.tests
>
> Starting test(pypy): pyspark.streaming.tests
>
> Finished test(pypy): pyspark.tests (181s)
>
> …
>
>
>
> Tests passed in 1130 seconds
>
>
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
>
>
> *From: *Tom Graves 
> *Date: *Monday, August 14, 2017 at 1:55 PM
> *To: *"dev@spark.apache.org" 
> *Subject: *spark pypy support?
>
>
>
> Anyone know if pypy works with spark. Saw a jira that it was supported
> back in Spark 1.2 but getting an error when trying and not sure if its
> something with my pypy version of just something spark doesn't support.
>
>
>
>
>
> AttributeError: 'builtin-code' object has no attribute 'co_filename'
> Traceback (most recent call last):
>   File "/app_main.py", line 75, in run_toplevel
>   File "/homes/tgraves/mbe.py", line 40, in 
> count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line 834, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line 808, in collect
> port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line 2440, in _jrdd
> self._jrdd_deserializer, profiler)
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line 2373, in _wrap_function
> pickled_command, broadcast_vars, env, includes =
> _prepare_for_python_RDD(sc, command)
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line 2359, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/serializers.py",
> line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 703, in dumps
> cp.dump(obj)
>   File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 160, in dump
>
>
>
> Thanks,
>
> Tom
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: spark pypy support?

2017-08-14 Thread Holden Karau
Ah interesting, looking at our latest docs we imply that it should work
with PyPy 2.3+ -- we might want to update that to 2.5+ since we aren't
testing with 2.3 anymore?

On Mon, Aug 14, 2017 at 3:09 PM, Tom Graves 
wrote:

> I tried 5.7 and 2.5.1 so its probably something in my setup.  I'll
> investigate that more, wanted to make sure it was still supported because I
> didn't see anything about it since the original jira that added it.
>
> Thanks,
> Tom
>
>
> On Monday, August 14, 2017, 4:29:01 PM CDT, shane knapp <
> skn...@berkeley.edu> wrote:
>
>
> actually, we *have* locked on a particular pypy versions for the
> jenkins workers:  2.5.1
>
> this applies to both the 2.7 and 3.5 conda environments.
>
> (py3k)-bash-4.1$ pypy --version
> Python 2.7.9 (9c4588d731b7fe0b08669bd732c2b676cb0a8233, Apr 09 2015,
> 02:17:39)
> [PyPy 2.5.1 with GCC 4.4.7 20120313 (Red Hat 4.4.7-11)]
>
> On Mon, Aug 14, 2017 at 2:24 PM, Holden Karau 
> wrote:
> > As Dong says yes we do test with PyPy in our CI env; but we expect a
> "newer"
> > version of PyPy (although I don't think we ever bothered to write down
> what
> > the exact version requirements are for the PyPy support unlike regular
> > Python).
> >
> > On Mon, Aug 14, 2017 at 2:06 PM, Dong Joon Hyun 
> > wrote:
> >>
> >> Hi, Tom.
> >>
> >>
> >>
> >> What version of PyPy do you use?
> >>
> >>
> >>
> >> In the Jenkins environment, `pypy` always passes like Python 2.7 and
> >> Python 3.4.
> >>
> >>
> >>
> >>
> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20
> (Dashboard)/job/spark-master-test-sbt-hadoop-2.7/3340/consoleFull
> >>
> >>
> >>
> >> 
> 
> >>
> >> Running PySpark tests
> >>
> >> 
> 
> >>
> >> Running PySpark tests. Output is in
> >> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/
> python/unit-tests.log
> >>
> >> Will test against the following Python executables: ['python2.7',
> >> 'python3.4', 'pypy']
> >>
> >> Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
> >> 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
> >>
> >> Starting test(python2.7): pyspark.mllib.tests
> >>
> >> Starting test(pypy): pyspark.sql.tests
> >>
> >> Starting test(pypy): pyspark.tests
> >>
> >> Starting test(pypy): pyspark.streaming.tests
> >>
> >> Finished test(pypy): pyspark.tests (181s)
> >>
> >> …
> >>
> >>
> >>
> >> Tests passed in 1130 seconds
> >>
> >>
> >>
> >>
> >>
> >> Bests,
> >>
> >> Dongjoon.
> >>
> >>
> >>
> >>
> >>
> >> From: Tom Graves 
> >> Date: Monday, August 14, 2017 at 1:55 PM
> >> To: "dev@spark.apache.org" 
> >> Subject: spark pypy support?
> >>
> >>
> >>
> >> Anyone know if pypy works with spark. Saw a jira that it was supported
> >> back in Spark 1.2 but getting an error when trying and not sure if its
> >> something with my pypy version of just something spark doesn't support.
> >>
> >>
> >>
> >>
> >>
> >> AttributeError: 'builtin-code' object has no attribute 'co_filename'
> >> Traceback (most recent call last):
> >>  File "/app_main.py", line 75, in run_toplevel
> >>  File "/homes/tgraves/mbe.py", line 40, in 
> >>count = sc.parallelize(range(1, n + 1),
> partitions).map(f).reduce(add)
> >>  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line
> >> 834, in reduce
> >>vals = self.mapPartitions(func).collect()
> >>  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line
> >> 808, in collect
> >>port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
> >>  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line
> >> 2440, in _jrdd
> >>self._jrdd_deserializer, profiler)
> >>  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line
> >> 2373, in _wrap_function
> >>pickled_command, broadcast_vars, env, includes =
> >> _prepare_for_python_RDD(sc, command)
> >>  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py",
> line
> >> 2359, in _prepare_for_python_RDD
> >>pickled_command = ser.dumps(command)
> >>  File
> >> "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/serializers.py",
> line
> >> 460, in dumps
> >>return cloudpickle.dumps(obj, 2)
> >>  File
> >> "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line
> >> 703, in dumps
> >>cp.dump(obj)
> >>  File
> >> "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line
> >> 160, in dump
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Tom
> >
> >
> >
> >
> > --
> > Cell : 425-233-8271 <(425)%20233-8271>
> > Twitter: https://twitter.com/holdenkarau
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Holden Karau
+1 (non-binding)

I (personally) think that Kubernetes as a scheduler backend should
eventually get merged in and there is clearly a community interested in the
work required to maintain it.

On Tue, Aug 15, 2017 at 9:51 AM William Benton  wrote:

> +1 (non-binding)
>
> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
> fox...@google.com.invalid> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> .
>> We're ~6 months in, have had 5 releases
>> .
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>> and
>>user-facing 
>> documentation
>>- 10+ consistent code contributors from different organizations
>>
>> 
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>    with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> 
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> 
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes  which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> ,
>> there are still major advantages and significant interest in having native
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users to
>> run applications of different Spark versions of their choices in the same
>> cluster.
>>
>> The feature is being developed in a separate fork
>>  in order to minimize risk
>> to the main project during development. Since the start of the development
>> in November of 2016, it has received over 100 commits from over 20
>> contributors and supports two releases based on Spark 2.1 and 2.2
>> respectively. Documentation is also being actively worked on both in the
>> main project repository and also in the repository
>> https://github.com/apache-spark-on-k8s/userdocs. R

Re: pyspark installation using pip3

2017-08-25 Thread Holden Karau
The reinstall and caching mechanism is managed by pip its self so I'm not
super sure what your asking for?

On Fri, Aug 25, 2017 at 5:43 PM Wei-Shun Lo  wrote:

>
> Hi Spark dev team,
>
> I found that the pyspark installation may always download the full package
> again and it is not quite reasonable move. Please see the screen shot for
> information. A version-check before reinstall will be better. Please
> advise, and thanks. !
>
>
> Best ,
> Ralic
> ************************
> ************
> * Contact Info*
>
>  *US Mobile: 1-408-609-7628   *
>
> Em
> ail: rali...@gmail.com
> ************************
> ************
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: 2.1.2 maintenance release?

2017-09-07 Thread Holden Karau
I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after that)
if people are ok with a committer / me running the release process rather
than a full PMC member.

On Thu, Sep 7, 2017 at 1:05 PM, Dongjoon Hyun 
wrote:

> +1!
>
> As of today,
>
> For 2.1.2, we have 87 commits. (2.1.1 was released 4 months ago)
> For 2.2.1, we have 95 commits. (2.2.0 was released 2 months ago)
>
> Can we have 2.2.1, too?
>
> Bests,
> Dongjoon.
>
>
> On Thu, Sep 7, 2017 at 2:14 AM, Sean Owen  wrote:
>
>> In a separate conversation about bugs and a security issue fixed in 2.1.x
>> and 2.0.x, Marcelo suggested it could be time for a maintenance release.
>> I'm not sure what our stance on 2.0.x is, but 2.1.2 seems like it could be
>> valuable to release.
>>
>> Thoughts? I believe Holden had expressed interest in even managing the
>> release process, but maybe others are interested as well. That is, this
>> could also be a chance to share that burden and spread release experience
>> around a bit.
>>
>> Sean
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: 2.1.2 maintenance release?

2017-09-10 Thread Holden Karau
So I think the consensus is that their is interest in having a few
maintenance releases. I'm happy to act as the RM. I think the next step is
seeing who the PMC wants as the RM for these (and if people are OK with me
I'll start updating my self on the docs, open JIRAs, and relevant Jenkins
jobs for packaging).

On Sun, Sep 10, 2017 at 11:31 PM, Felix Cheung 
wrote:

> Hi - what are the next steps?
> Pending changes are pushed and checked that there is no open JIRA
> targeting 2.1.2 and 2.2.1
>
> _
> From: Reynold Xin 
> Sent: Friday, September 8, 2017 9:27 AM
> Subject: Re: 2.1.2 maintenance release?
> To: Felix Cheung , Holden Karau <
> hol...@pigscanfly.ca>, Sean Owen , dev <
> dev@spark.apache.org>
>
>
>
> +1 as well. We should make a few maintenance releases.
>
> On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung 
> wrote:
>
>> +1 on both 2.1.2 and 2.2.1
>>
>> And would try to help and/or wrangle the release if needed.
>>
>> (Note: trying to backport a few changes to branch-2.1 right now)
>>
>> --
>> *From:* Sean Owen 
>> *Sent:* Friday, September 8, 2017 12:05:28 AM
>> *To:* Holden Karau; dev
>> *Subject:* Re: 2.1.2 maintenance release?
>>
>> Let's look at the standard ASF guidance, which actually surprised me when
>> I first read it:
>>
>> https://www.apache.org/foundation/voting.html
>>
>> VOTES ON PACKAGE RELEASES
>> Votes on whether a package is ready to be released use majority approval
>> -- i.e. at least three PMC members must vote affirmatively for release, and
>> there must be more positive than negative votes. Releases may not be
>> vetoed. Generally the community will cancel the release vote if anyone
>> identifies serious problems, but in most cases the ultimate decision, lies
>> with the individual serving as release manager. The specifics of the
>> process may vary from project to project, but the 'minimum quorum of three
>> +1 votes' rule is universal.
>>
>>
>> PMC votes on it, but no vetoes allowed, and the release manager makes the
>> final call. Not your usual vote! doesn't say the release manager has to be
>> part of the PMC though it's the role with most decision power. In practice
>> I can't imagine it's a problem, but we could also just have someone on the
>> PMC technically be the release manager even as someone else is really
>> operating the release.
>>
>> The goal is, really, to be able to put out maintenance releases with
>> important fixes. Secondly, to ramp up one or more additional people to
>> perform the release steps. Maintenance releases ought to be the least
>> controversial releases to decide.
>>
>> Thoughts on kicking off a release for 2.1.2 to see how it goes?
>>
>> Although someone can just start following the steps, I think it will
>> certainly require some help from Michael, who's run the last release, to
>> clarify parts of the process or possibly provide an essential credential to
>> upload artifacts.
>>
>>
>> On Thu, Sep 7, 2017 at 11:59 PM Holden Karau 
>> wrote:
>>
>>> I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after
>>> that) if people are ok with a committer / me running the release process
>>> rather than a full PMC member.
>>>
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: 2.1.2 maintenance release?

2017-09-12 Thread Holden Karau
Sounds good. I have a little more experience with the Jenkins jobs
packaging from helping debug the Python packaging issues so I'll get
started and look at updating the docs as I go until I get stuck.

On Tue, Sep 12, 2017 at 2:29 AM Sean Owen  wrote:

> I think you could just dive in to the steps at
> http://spark.apache.org/release-process.html and see how far you get
> before you need assistance to execute steps like tagging and publishing
> artifacts.
>
> I think a secondary goal of this process is to update and expand those
> release documents, as a fresh set of eyes inevitably sees points that need
> clarification and that aren't obvious to people who haven't run this
> process before.
>
> Some of this is already done: no need for a new branch, all 2.1.2 issues
> are resolved, branch already set to 2.1.2-SNAPSHOT. (Here's an example: the
> docs don't note that the spark-build-info script auto-updates the string
> that is used by the Spark REPL string.)
>
> The first area where you might need help or additional access is the bit
> about Jenkins jobs that cut release candidates automatically. It'd be good
> to add any notes about how that works here, too, for posterity.
>
> Which jobs are these and who could help set one up to cut 2.1.2 RC1?
>
>
> On Mon, Sep 11, 2017 at 7:41 AM Holden Karau  wrote:
>
>> So I think the consensus is that their is interest in having a few
>> maintenance releases. I'm happy to act as the RM. I think the next step is
>> seeing who the PMC wants as the RM for these (and if people are OK with me
>> I'll start updating my self on the docs, open JIRAs, and relevant Jenkins
>> jobs for packaging).
>>
>> On Sun, Sep 10, 2017 at 11:31 PM, Felix Cheung > > wrote:
>>
>>> Hi - what are the next steps?
>>> Pending changes are pushed and checked that there is no open JIRA
>>> targeting 2.1.2 and 2.2.1
>>>
>>> _
>>> From: Reynold Xin 
>>> Sent: Friday, September 8, 2017 9:27 AM
>>> Subject: Re: 2.1.2 maintenance release?
>>> To: Felix Cheung , Holden Karau <
>>> hol...@pigscanfly.ca>, Sean Owen , dev <
>>> dev@spark.apache.org>
>>>
>>>
>>>
>>> +1 as well. We should make a few maintenance releases.
>>>
>>> On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung 
>>> wrote:
>>>
>>>> +1 on both 2.1.2 and 2.2.1
>>>>
>>>> And would try to help and/or wrangle the release if needed.
>>>>
>>>> (Note: trying to backport a few changes to branch-2.1 right now)
>>>>
>>>> --
>>>> *From:* Sean Owen 
>>>> *Sent:* Friday, September 8, 2017 12:05:28 AM
>>>> *To:* Holden Karau; dev
>>>> *Subject:* Re: 2.1.2 maintenance release?
>>>>
>>>> Let's look at the standard ASF guidance, which actually surprised me
>>>> when I first read it:
>>>>
>>>> https://www.apache.org/foundation/voting.html
>>>>
>>>> VOTES ON PACKAGE RELEASES
>>>> Votes on whether a package is ready to be released use majority
>>>> approval -- i.e. at least three PMC members must vote affirmatively for
>>>> release, and there must be more positive than negative votes. Releases may
>>>> not be vetoed. Generally the community will cancel the release vote if
>>>> anyone identifies serious problems, but in most cases the ultimate
>>>> decision, lies with the individual serving as release manager. The
>>>> specifics of the process may vary from project to project, but the 'minimum
>>>> quorum of three +1 votes' rule is universal.
>>>>
>>>>
>>>> PMC votes on it, but no vetoes allowed, and the release manager makes
>>>> the final call. Not your usual vote! doesn't say the release manager has to
>>>> be part of the PMC though it's the role with most decision power. In
>>>> practice I can't imagine it's a problem, but we could also just have
>>>> someone on the PMC technically be the release manager even as someone else
>>>> is really operating the release.
>>>>
>>>> The goal is, really, to be able to put out maintenance releases with
>>>> important fixes. Secondly, to ramp up one or more additional people to
>>>> perform the release steps. Maintenance releases ought to be the least
>>>> controversial releases to decide.
>>>>
>>>> Thoughts on kicking off a release for 2.1.2 to see how it goes?
>>>>
>>>> Although someone can just start following the steps, I think it will
>>>> certainly require some help from Michael, who's run the last release, to
>>>> clarify parts of the process or possibly provide an essential credential to
>>>> upload artifacts.
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 11:59 PM Holden Karau 
>>>> wrote:
>>>>
>>>>> I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after
>>>>> that) if people are ok with a committer / me running the release process
>>>>> rather than a full PMC member.
>>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Starting the process for a Spark 2.1.2 Release

2017-09-12 Thread Holden Karau
Hi Spark Developers,

After the discussion around the need for a Spark 2.1.2 release I'd like to
start get the ball rolling.

If you are a developer on a specific component in Spark now is a good time
to look and see if there are any important bug fixes that should be back
ported into 2.1.2 and create issues for the back porting so we can track
them through out the release.

There are currently no "open" issues for Spark 2.1.2, but I suspect we'll
find a few more things to back-port before the first RC is cut. The current
list of all of the issues for 2.1.2 (open, resolved, etc.) can be viewed
with this filter.


In the meantime I'll be re-familiarizing myself with the Jenkins jobs and
updating the release docs
 to capture some of
the institutional knowledge around how publish a release.

If nothing big pops up I'll try and cut an RC1 later on this week :)

Cheers,

Holden :)

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


[VOTE] Spark 2.1.2 (RC1)

2017-09-14 Thread Holden Karau
Please vote on releasing the following candidate as Apache Spark version
2.1.2. The vote is open until Friday September 22nd at 18:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.1.2-rc1
 (6f470323a036365
6999dd36cb33f528afe627c12)

List of JIRA tickets resolved in this release can be found with this filter.


The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1248/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

*What should happen to JIRA tickets still targeting 2.1.2?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.3.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1. That being said if
there is something which is a regression form 2.1.1 that has not been
correctly targeted please ping a committer to help target the issue (you
can see the open issues listed as impacting Spark 2.1.1 & 2.1.2

)

*What are the unresolved* issues targeted for 2.1.2

?

At the time of the writing, there is one in progress major issue SPARK-21985
, I believe Andrew Ray &
HyukjinKwon are looking into this one.

-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Holden Karau
That's a good question, I built the release candidate however the Jenkins
scripts don't take a parameter for configuring who signs them rather it
always signs them with Patrick's key. You can see this from previous
releases which were managed by other folks but still signed by Patrick.

On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:

> The signature is valid, but why was the release signed with Patrick
> Wendell's private key? Did Patrick build the release candidate?
>
> rb
>
> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee  wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
>> wrote:
>>
>>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>>
>>> _
>>> From: Sean Owen 
>>> Sent: Thursday, September 14, 2017 3:12 PM
>>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>>> To: Holden Karau , 
>>>
>>>
>>>
>>> +1
>>> Very nice. The sigs and hashes look fine, it builds fine for me on
>>> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
>>> tests.
>>>
>>> Yes as you say, no outstanding issues except for this which doesn't look
>>> critical, as it's not a regression.
>>>
>>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>>>
>>>
>>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.1.2. The vote is open until Friday September 22nd at 18:00
>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.1.2-rc1
>>>> <https://github.com/apache/spark/tree/v2.1.2-rc1> (6f470323a036365
>>>> 6999dd36cb33f528afe627c12)
>>>>
>>>> List of JIRA tickets resolved in this release can be found with this
>>>> filter.
>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1248/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://people.apache.org/~pwendell/spark-releases/spark-2.
>>>> 1.2-rc1-docs/
>>>>
>>>>
>>>> *FAQ*
>>>>
>>>> *How can I help test this release?*
>>>>
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala you
>>>> can add the staging repository to your projects resolvers and test with the
>>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>>> up building with a out of date RC going forward).
>>>>
>>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>>
>>>> *But my bug isn't fixed!??!*
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from 2.1.1. That being
>>>> said if there is something which is a regression form 2.1.1 that has not
>>>> been correctly targeted please ping a committer to help target the issue
>>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)>
>>>> )
>>>>
>>>> *What are the unresolved* issues targeted for 2.1.2
>>>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.2>
>>>> ?
>>>>
>>>> At the time of the writing, there is one in progress major issue
>>>> SPARK-21985 <https://issues.apache.org/jira/browse/SPARK-21985>, I
>>>> believe Andrew Ray & HyukjinKwon are looking into this one.
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Holden Karau
Xiao If it doesn't apply/you've changed your mind if you can re-vote that
would be rad.

On Fri, Sep 15, 2017 at 2:22 PM, Felix Cheung 
wrote:

> Yes ;)
>
> --
> *From:* Xiao Li 
> *Sent:* Friday, September 15, 2017 2:22:03 PM
> *To:* Holden Karau
> *Cc:* Ryan Blue; Denny Lee; Felix Cheung; Sean Owen; dev@spark.apache.org
>
> *Subject:* Re: [VOTE] Spark 2.1.2 (RC1)
>
> Sorry, this release candidate is 2.1.2. The issue is in 2.2.1.
>
> 2017-09-15 14:21 GMT-07:00 Xiao Li :
>
>> -1
>>
>> See the discussion in https://github.com/apache/spark/pull/19074
>>
>> Xiao
>>
>>
>>
>> 2017-09-15 12:28 GMT-07:00 Holden Karau :
>>
>>> That's a good question, I built the release candidate however the
>>> Jenkins scripts don't take a parameter for configuring who signs them
>>> rather it always signs them with Patrick's key. You can see this from
>>> previous releases which were managed by other folks but still signed by
>>> Patrick.
>>>
>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>
>>>> The signature is valid, but why was the release signed with Patrick
>>>> Wendell's private key? Did Patrick build the release candidate?
>>>>
>>>> rb
>>>>
>>>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
>>>>> felixcheun...@hotmail.com> wrote:
>>>>>
>>>>>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>>>>>
>>>>>> _
>>>>>> From: Sean Owen 
>>>>>> Sent: Thursday, September 14, 2017 3:12 PM
>>>>>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>>>>>> To: Holden Karau , 
>>>>>>
>>>>>>
>>>>>>
>>>>>> +1
>>>>>> Very nice. The sigs and hashes look fine, it builds fine for me on
>>>>>> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
>>>>>> tests.
>>>>>>
>>>>>> Yes as you say, no outstanding issues except for this which doesn't
>>>>>> look critical, as it's not a regression.
>>>>>>
>>>>>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.1.2. The vote is open until Friday September 22nd at
>>>>>>> 18:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>>>>>> cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> https://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is v2.1.2-rc1
>>>>>>> <https://github.com/apache/spark/tree/v2.1.2-rc1> (6f470323a036365
>>>>>>> 6999dd36cb33f528afe627c12)
>>>>>>>
>>>>>>> List of JIRA tickets resolved in this release can be found with
>>>>>>> this filter.
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2
>>>>>>> -rc1-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>>> spark-1248/
>>>>>>>
>>>>>>> The

Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Holden Karau
Changing the release jobs, beyond the available parameters, right now
depends on Josh arisen as there are some scripts which generate the jobs
which aren't public. I've done temporary fixes in the past with the Python
packaging but my understanding is that in the medium term it requires
access to the scripts.

So +CC Josh.

On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:

> I think this needs to be fixed. It's true that there are barriers to
> publication, but the signature is what we use to authenticate Apache
> releases.
>
> If Patrick's key is available on Jenkins for any Spark committer to use,
> then the chance of a compromise are much higher than for a normal RM key.
>
> rb
>
> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>
>> Yeah I had meant to ask about that in the past. While I presume Patrick
>> consents to this and all that, it does mean that anyone with access to said
>> Jenkins scripts can create a signed Spark release, regardless of who they
>> are.
>>
>> I haven't thought through whether that's a theoretical issue we can
>> ignore or something we need to fix up. For example you can't get a release
>> on the ASF mirrors without more authentication.
>>
>> How hard would it be to make the script take in a key? it sort of looks
>> like the script already takes GPG_KEY, but don't know how to modify the
>> jobs. I suppose it would be ideal, in any event, for the actual release
>> manager to sign.
>>
>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>> wrote:
>>
>>> That's a good question, I built the release candidate however the
>>> Jenkins scripts don't take a parameter for configuring who signs them
>>> rather it always signs them with Patrick's key. You can see this from
>>> previous releases which were managed by other folks but still signed by
>>> Patrick.
>>>
>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>
>>>> The signature is valid, but why was the release signed with Patrick
>>>> Wendell's private key? Did Patrick build the release candidate?
>>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Holden Karau
Also continuing the discussion from the vote threads, Shane probably has
the best idea on the ACLs for Jenkins so I've CC'd him as well.


On Fri, Sep 15, 2017 at 5:09 PM Holden Karau  wrote:

> Changing the release jobs, beyond the available parameters, right now
> depends on Josh arisen as there are some scripts which generate the jobs
> which aren't public. I've done temporary fixes in the past with the Python
> packaging but my understanding is that in the medium term it requires
> access to the scripts.
>
> So +CC Josh.
>
> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:
>
>> I think this needs to be fixed. It's true that there are barriers to
>> publication, but the signature is what we use to authenticate Apache
>> releases.
>>
>> If Patrick's key is available on Jenkins for any Spark committer to use,
>> then the chance of a compromise are much higher than for a normal RM key.
>>
>> rb
>>
>> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>>
>>> Yeah I had meant to ask about that in the past. While I presume Patrick
>>> consents to this and all that, it does mean that anyone with access to said
>>> Jenkins scripts can create a signed Spark release, regardless of who they
>>> are.
>>>
>>> I haven't thought through whether that's a theoretical issue we can
>>> ignore or something we need to fix up. For example you can't get a release
>>> on the ASF mirrors without more authentication.
>>>
>>> How hard would it be to make the script take in a key? it sort of looks
>>> like the script already takes GPG_KEY, but don't know how to modify the
>>> jobs. I suppose it would be ideal, in any event, for the actual release
>>> manager to sign.
>>>
>>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>>> wrote:
>>>
>>>> That's a good question, I built the release candidate however the
>>>> Jenkins scripts don't take a parameter for configuring who signs them
>>>> rather it always signs them with Patrick's key. You can see this from
>>>> previous releases which were managed by other folks but still signed by
>>>> Patrick.
>>>>
>>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>>
>>>>> The signature is valid, but why was the release signed with Patrick
>>>>> Wendell's private key? Did Patrick build the release candidate?
>>>>>
>>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Holden Karau
Indeed it's limited to a people with login permissions on the Jenkins host
(and perhaps further limited, I'm not certain). Shane probably knows more
about the ACLs, so I'll ask him in the other thread for specifics.

This is maybe branching a bit from the question of the current RC though,
so I'd suggest we continue this discussion on the thread Sean Owen made.

On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  wrote:

> I'm not familiar with the release procedure, can you send a link to this
> Jenkins job? Can anyone run this job, or is it limited to committers?
>
> rb
>
> On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau 
> wrote:
>
>> That's a good question, I built the release candidate however the Jenkins
>> scripts don't take a parameter for configuring who signs them rather it
>> always signs them with Patrick's key. You can see this from previous
>> releases which were managed by other folks but still signed by Patrick.
>>
>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>
>>> The signature is valid, but why was the release signed with Patrick
>>> Wendell's private key? Did Patrick build the release candidate?
>>>
>>> rb
>>>
>>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
>>>> felixcheun...@hotmail.com> wrote:
>>>>
>>>>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>>>>
>>>>> _
>>>>> From: Sean Owen 
>>>>> Sent: Thursday, September 14, 2017 3:12 PM
>>>>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>>>>> To: Holden Karau , 
>>>>>
>>>>>
>>>>>
>>>>> +1
>>>>> Very nice. The sigs and hashes look fine, it builds fine for me on
>>>>> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
>>>>> tests.
>>>>>
>>>>> Yes as you say, no outstanding issues except for this which doesn't
>>>>> look critical, as it's not a regression.
>>>>>
>>>>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>>>>>
>>>>>
>>>>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.1.2. The vote is open until Friday September 22nd at 18:00
>>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>>
>>>>>> To learn more about Apache Spark, please see
>>>>>> https://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v2.1.2-rc1
>>>>>> <https://github.com/apache/spark/tree/v2.1.2-rc1> (
>>>>>> 6f470323a0363656999dd36cb33f528afe627c12)
>>>>>>
>>>>>> List of JIRA tickets resolved in this release can be found with this
>>>>>> filter.
>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>>>>>
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1248/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>>
>>>>>> https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/
>>>>>>
>>>>>>
>>>>>> *FAQ*
>>>>>>
>>>>>> *How can I help test this release?*
>>>>>>
>>>>>> If you are a Spark user, you can help us test this release by ta

Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Holden Karau
Oh yes and to keep people more informed I've been updating a PR for the
release documentation as I go to write down some of this unwritten
knowledge -- https://github.com/apache/spark-website/pull/66


On Fri, Sep 15, 2017 at 5:12 PM Holden Karau  wrote:

> Also continuing the discussion from the vote threads, Shane probably has
> the best idea on the ACLs for Jenkins so I've CC'd him as well.
>
>
> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau  wrote:
>
>> Changing the release jobs, beyond the available parameters, right now
>> depends on Josh arisen as there are some scripts which generate the jobs
>> which aren't public. I've done temporary fixes in the past with the Python
>> packaging but my understanding is that in the medium term it requires
>> access to the scripts.
>>
>> So +CC Josh.
>>
>> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:
>>
>>> I think this needs to be fixed. It's true that there are barriers to
>>> publication, but the signature is what we use to authenticate Apache
>>> releases.
>>>
>>> If Patrick's key is available on Jenkins for any Spark committer to use,
>>> then the chance of a compromise are much higher than for a normal RM key.
>>>
>>> rb
>>>
>>> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>>>
>>>> Yeah I had meant to ask about that in the past. While I presume Patrick
>>>> consents to this and all that, it does mean that anyone with access to said
>>>> Jenkins scripts can create a signed Spark release, regardless of who they
>>>> are.
>>>>
>>>> I haven't thought through whether that's a theoretical issue we can
>>>> ignore or something we need to fix up. For example you can't get a release
>>>> on the ASF mirrors without more authentication.
>>>>
>>>> How hard would it be to make the script take in a key? it sort of looks
>>>> like the script already takes GPG_KEY, but don't know how to modify the
>>>> jobs. I suppose it would be ideal, in any event, for the actual release
>>>> manager to sign.
>>>>
>>>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>>>> wrote:
>>>>
>>>>> That's a good question, I built the release candidate however the
>>>>> Jenkins scripts don't take a parameter for configuring who signs them
>>>>> rather it always signs them with Patrick's key. You can see this from
>>>>> previous releases which were managed by other folks but still signed by
>>>>> Patrick.
>>>>>
>>>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>>>
>>>>>> The signature is valid, but why was the release signed with Patrick
>>>>>> Wendell's private key? Did Patrick build the release candidate?
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-16 Thread Holden Karau
Ok :) Was this working in 2.1.1?

On Sat, Sep 16, 2017 at 3:59 PM Xiao Li  wrote:

> Still -1
>
> Unable to pass the tests in my local environment. Open a JIRA
> https://issues.apache.org/jira/browse/SPARK-22041
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false
> (OracleIntegrationSuite.scala:158)
>
> Xiao
>
> 2017-09-15 17:35 GMT-07:00 Ryan Blue :
>
>> -1 (with my Apache member hat on, non-binding)
>>
>> I'll continue discussion in the other thread, but I don't think we should
>> share signing keys.
>>
>> On Fri, Sep 15, 2017 at 5:14 PM, Holden Karau 
>> wrote:
>>
>>> Indeed it's limited to a people with login permissions on the Jenkins
>>> host (and perhaps further limited, I'm not certain). Shane probably knows
>>> more about the ACLs, so I'll ask him in the other thread for specifics.
>>>
>>> This is maybe branching a bit from the question of the current RC
>>> though, so I'd suggest we continue this discussion on the thread Sean Owen
>>> made.
>>>
>>> On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  wrote:
>>>
>>>> I'm not familiar with the release procedure, can you send a link to
>>>> this Jenkins job? Can anyone run this job, or is it limited to committers?
>>>>
>>>> rb
>>>>
>>>> On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau 
>>>> wrote:
>>>>
>>>>> That's a good question, I built the release candidate however the
>>>>> Jenkins scripts don't take a parameter for configuring who signs them
>>>>> rather it always signs them with Patrick's key. You can see this from
>>>>> previous releases which were managed by other folks but still signed by
>>>>> Patrick.
>>>>>
>>>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>>>
>>>>>> The signature is valid, but why was the release signed with Patrick
>>>>>> Wendell's private key? Did Patrick build the release candidate?
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
>>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>>
>>>>>>>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>>>>>>>
>>>>>>>> _
>>>>>>>> From: Sean Owen 
>>>>>>>> Sent: Thursday, September 14, 2017 3:12 PM
>>>>>>>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>>>>>>>> To: Holden Karau , 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> +1
>>>>>>>> Very nice. The sigs and hashes look fine, it builds fine for me on
>>>>>>>> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
>>>>>>>> tests.
>>>>>>>>
>>>>>>>> Yes as you say, no outstanding issues except for this which doesn't
>>>>>>>> look critical, as it's not a regression.
>>>>>>>>
>>>>>>>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped
>>>>>>>> RDDs
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>>> version 2.1.2. The vote is open until Friday September 22nd at
>>>>>>>>> 18:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>>>>>>>> cast.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>

Re: Signing releases with pwendell or release manager's key?

2017-09-17 Thread Holden Karau
Would any of Patrick/Josh/Shane (or other PMC folks with
understanding/opinions on this setup) care to comment? If this is a
blocking issue I can cancel the current release vote thread while we
discuss this some more.

On Fri, Sep 15, 2017 at 5:18 PM Holden Karau  wrote:

> Oh yes and to keep people more informed I've been updating a PR for the
> release documentation as I go to write down some of this unwritten
> knowledge -- https://github.com/apache/spark-website/pull/66
>
>
> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau  wrote:
>
>> Also continuing the discussion from the vote threads, Shane probably has
>> the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>
>>
>> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau 
>> wrote:
>>
>>> Changing the release jobs, beyond the available parameters, right now
>>> depends on Josh arisen as there are some scripts which generate the jobs
>>> which aren't public. I've done temporary fixes in the past with the Python
>>> packaging but my understanding is that in the medium term it requires
>>> access to the scripts.
>>>
>>> So +CC Josh.
>>>
>>> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:
>>>
>>>> I think this needs to be fixed. It's true that there are barriers to
>>>> publication, but the signature is what we use to authenticate Apache
>>>> releases.
>>>>
>>>> If Patrick's key is available on Jenkins for any Spark committer to
>>>> use, then the chance of a compromise are much higher than for a normal RM
>>>> key.
>>>>
>>>> rb
>>>>
>>>> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>>>>
>>>>> Yeah I had meant to ask about that in the past. While I presume
>>>>> Patrick consents to this and all that, it does mean that anyone with 
>>>>> access
>>>>> to said Jenkins scripts can create a signed Spark release, regardless of
>>>>> who they are.
>>>>>
>>>>> I haven't thought through whether that's a theoretical issue we can
>>>>> ignore or something we need to fix up. For example you can't get a release
>>>>> on the ASF mirrors without more authentication.
>>>>>
>>>>> How hard would it be to make the script take in a key? it sort of
>>>>> looks like the script already takes GPG_KEY, but don't know how to modify
>>>>> the jobs. I suppose it would be ideal, in any event, for the actual 
>>>>> release
>>>>> manager to sign.
>>>>>
>>>>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> That's a good question, I built the release candidate however the
>>>>>> Jenkins scripts don't take a parameter for configuring who signs them
>>>>>> rather it always signs them with Patrick's key. You can see this from
>>>>>> previous releases which were managed by other folks but still signed by
>>>>>> Patrick.
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>>> The signature is valid, but why was the release signed with Patrick
>>>>>>> Wendell's private key? Did Patrick build the release candidate?
>>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Holden Karau
I'm more than willing to help migrate the scripts as part of either this
release or the next.

It sounds like there is a consensus developing around changing the process
-- should we hold off on the 2.1.2 release or roll this into the next one?

On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin  wrote:

> +1 to this. There should be a script in the Spark repo that has all
> the logic needed for a release. That script should take the RM's key
> as a parameter.
>
> if there's a desire to keep the current Jenkins job to create the
> release, it should be based on that script. But from what I'm seeing
> there are currently too many unknowns in the release process.
>
> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue 
> wrote:
> > I don't understand why it is necessary to share a release key. If this is
> > something that can be automated in a Jenkins job, then can it be a script
> > with a reasonable set of build requirements for Mac and Ubuntu? That's
> the
> > approach I've seen the most in other projects.
> >
> > I'm also not just concerned about release managers. Having a key stored
> > persistently on outside infrastructure adds the most risk, as Luciano
> noted
> > as well. We should also start publishing checksums in the Spark VOTE
> thread,
> > which are currently missing. The risk I'm concerned about is that if the
> key
> > were compromised, it would be possible to replace binaries with perfectly
> > valid ones, at least on some mirrors. If the Apache copy were replaced,
> then
> > we wouldn't even be able to catch that it had happened. Given the high
> > profile of Spark and the number of companies that run it, I think we
> need to
> > take extra care to make sure that can't happen, even if it is an
> annoyance
> > for the release managers.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Holden Karau
That sounds like a pretty good temporary work around if folks agree I'll
cancel release vote for 2.1.2 and work on getting an RC2 out later this
week manually signed. I've filed JIRA SPARK-22055 & SPARK-22054 to port the
release scripts and allow injecting of the RM's key.

On Mon, Sep 18, 2017 at 8:11 PM, Patrick Wendell 
wrote:

> For the current release - maybe Holden could just sign the artifacts with
> her own key manually, if this is a concern. I don't think that would
> require modifying the release pipeline, except to just remove/ignore the
> existing signatures.
>
> - Patrick
>
> On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin  wrote:
>
>> Does anybody know whether this is a hard blocker? If it is not, we should
>> probably push 2.1.2 forward quickly and do the infrastructure improvement
>> in parallel.
>>
>> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau 
>> wrote:
>>
>>> I'm more than willing to help migrate the scripts as part of either this
>>> release or the next.
>>>
>>> It sounds like there is a consensus developing around changing the
>>> process -- should we hold off on the 2.1.2 release or roll this into the
>>> next one?
>>>
>>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin 
>>> wrote:
>>>
>>>> +1 to this. There should be a script in the Spark repo that has all
>>>> the logic needed for a release. That script should take the RM's key
>>>> as a parameter.
>>>>
>>>> if there's a desire to keep the current Jenkins job to create the
>>>> release, it should be based on that script. But from what I'm seeing
>>>> there are currently too many unknowns in the release process.
>>>>
>>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue 
>>>> wrote:
>>>> > I don't understand why it is necessary to share a release key. If
>>>> this is
>>>> > something that can be automated in a Jenkins job, then can it be a
>>>> script
>>>> > with a reasonable set of build requirements for Mac and Ubuntu?
>>>> That's the
>>>> > approach I've seen the most in other projects.
>>>> >
>>>> > I'm also not just concerned about release managers. Having a key
>>>> stored
>>>> > persistently on outside infrastructure adds the most risk, as Luciano
>>>> noted
>>>> > as well. We should also start publishing checksums in the Spark VOTE
>>>> thread,
>>>> > which are currently missing. The risk I'm concerned about is that if
>>>> the key
>>>> > were compromised, it would be possible to replace binaries with
>>>> perfectly
>>>> > valid ones, at least on some mirrors. If the Apache copy were
>>>> replaced, then
>>>> > we wouldn't even be able to catch that it had happened. Given the high
>>>> > profile of Spark and the number of companies that run it, I think we
>>>> need to
>>>> > take extra care to make sure that can't happen, even if it is an
>>>> annoyance
>>>> > for the release managers.
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-18 Thread Holden Karau
As per the conversation happening around the signing of releases I'm
cancelling this vote. If folks agree with the temporary solution there I'll
try and get a new RC out shortly but if we end up blocking on migrating the
Jenkins jobs it could take a bit longer.

On Sun, Sep 17, 2017 at 1:30 AM, yuming wang  wrote:

> Yes, It doesn’t work in 2.1.0 and 2.1.1, I create a PR for this:
> https://github.com/apache/spark/pull/19259.
>
>
> 在 2017年9月17日,16:14,Sean Owen  写道:
>
> So, didn't work in 2.1.0 or 2.1.1? If it's not a regression and not
> critical, it shouldn't block a release. It seems like this can only affect
> Docker and/or Oracle JDBC? Well, if we need to roll another release anyway,
> seems OK.
>
> On Sun, Sep 17, 2017 at 6:06 AM Xiao Li  wrote:
>
>> This is a bug introduced in 2.1. It works fine in 2.0
>>
>> 2017-09-16 16:15 GMT-07:00 Holden Karau :
>>
>>> Ok :) Was this working in 2.1.1?
>>>
>>> On Sat, Sep 16, 2017 at 3:59 PM Xiao Li  wrote:
>>>
>>>> Still -1
>>>>
>>>> Unable to pass the tests in my local environment. Open a JIRA
>>>> https://issues.apache.org/jira/browse/SPARK-22041
>>>>
>>>> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>>>>
>>>>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false
>>>> (OracleIntegrationSuite.scala:158)
>>>>
>>>> Xiao
>>>>
>>>> 2017-09-15 17:35 GMT-07:00 Ryan Blue :
>>>>
>>>>> -1 (with my Apache member hat on, non-binding)
>>>>>
>>>>> I'll continue discussion in the other thread, but I don't think we
>>>>> should share signing keys.
>>>>>
>>>>> On Fri, Sep 15, 2017 at 5:14 PM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Indeed it's limited to a people with login permissions on the Jenkins
>>>>>> host (and perhaps further limited, I'm not certain). Shane probably knows
>>>>>> more about the ACLs, so I'll ask him in the other thread for specifics.
>>>>>>
>>>>>> This is maybe branching a bit from the question of the current RC
>>>>>> though, so I'd suggest we continue this discussion on the thread Sean 
>>>>>> Owen
>>>>>> made.
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> I'm not familiar with the release procedure, can you send a link to
>>>>>>> this Jenkins job? Can anyone run this job, or is it limited to 
>>>>>>> committers?
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau >>>>>> > wrote:
>>>>>>>
>>>>>>>> That's a good question, I built the release candidate however the
>>>>>>>> Jenkins scripts don't take a parameter for configuring who signs them
>>>>>>>> rather it always signs them with Patrick's key. You can see this from
>>>>>>>> previous releases which were managed by other folks but still signed by
>>>>>>>> Patrick.
>>>>>>>>
>>>>>>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The signature is valid, but why was the release signed with
>>>>>>>>> Patrick Wendell's private key? Did Patrick build the release 
>>>>>>>>> candidate?
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
>>>>>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>>>>>>>>>>
>>>>>>>>>>> _
>>>>>>>>>>> From: Sean Ow

Re: Signing releases with pwendell or release manager's key?

2017-09-19 Thread Holden Karau
Thanks for the reminder :)

On Tue, Sep 19, 2017 at 9:02 AM Luciano Resende 
wrote:

> Manually signing seems a good compromise for now, but note that there are
> two places that this needs to happen, the artifacts that goes to dist.a.o
> as well as the ones that are published to maven.
>
> On Tue, Sep 19, 2017 at 8:53 AM, Ryan Blue 
> wrote:
>
>> +1. Thanks for coming up with a solution, everyone! I think the manually
>> signed RC as a work around will work well, and it will be an improvement
>> for the rest to be updated.
>>
>> On Mon, Sep 18, 2017 at 8:25 PM, Patrick Wendell 
>> wrote:
>>
>>> Sounds good - thanks Holden!
>>>
>>> On Mon, Sep 18, 2017 at 8:21 PM, Holden Karau 
>>> wrote:
>>>
>>>> That sounds like a pretty good temporary work around if folks agree
>>>> I'll cancel release vote for 2.1.2 and work on getting an RC2 out later
>>>> this week manually signed. I've filed JIRA SPARK-22055 & SPARK-22054 to
>>>> port the release scripts and allow injecting of the RM's key.
>>>>
>>>> On Mon, Sep 18, 2017 at 8:11 PM, Patrick Wendell <
>>>> patr...@databricks.com> wrote:
>>>>
>>>>> For the current release - maybe Holden could just sign the artifacts
>>>>> with her own key manually, if this is a concern. I don't think that would
>>>>> require modifying the release pipeline, except to just remove/ignore the
>>>>> existing signatures.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin 
>>>>> wrote:
>>>>>
>>>>>> Does anybody know whether this is a hard blocker? If it is not, we
>>>>>> should probably push 2.1.2 forward quickly and do the infrastructure
>>>>>> improvement in parallel.
>>>>>>
>>>>>> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> I'm more than willing to help migrate the scripts as part of either
>>>>>>> this release or the next.
>>>>>>>
>>>>>>> It sounds like there is a consensus developing around changing the
>>>>>>> process -- should we hold off on the 2.1.2 release or roll this into the
>>>>>>> next one?
>>>>>>>
>>>>>>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin >>>>>> > wrote:
>>>>>>>
>>>>>>>> +1 to this. There should be a script in the Spark repo that has all
>>>>>>>> the logic needed for a release. That script should take the RM's key
>>>>>>>> as a parameter.
>>>>>>>>
>>>>>>>> if there's a desire to keep the current Jenkins job to create the
>>>>>>>> release, it should be based on that script. But from what I'm seeing
>>>>>>>> there are currently too many unknowns in the release process.
>>>>>>>>
>>>>>>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue
>>>>>>>>  wrote:
>>>>>>>> > I don't understand why it is necessary to share a release key. If
>>>>>>>> this is
>>>>>>>> > something that can be automated in a Jenkins job, then can it be
>>>>>>>> a script
>>>>>>>> > with a reasonable set of build requirements for Mac and Ubuntu?
>>>>>>>> That's the
>>>>>>>> > approach I've seen the most in other projects.
>>>>>>>> >
>>>>>>>> > I'm also not just concerned about release managers. Having a key
>>>>>>>> stored
>>>>>>>> > persistently on outside infrastructure adds the most risk, as
>>>>>>>> Luciano noted
>>>>>>>> > as well. We should also start publishing checksums in the Spark
>>>>>>>> VOTE thread,
>>>>>>>> > which are currently missing. The risk I'm concerned about is that
>>>>>>>> if the key
>>>>>>>> > were compromised, it would be possible to replace binaries with
>>>>>>>> perfectly
>>>>>>>> > valid ones, at least on some mirrors. If the Apache copy were
>>>>>>>> replaced, then
>>>>>>>> > we wouldn't even be able to catch that it had happened. Given the
>>>>>>>> high
>>>>>>>> > profile of Spark and the number of companies that run it, I think
>>>>>>>> we need to
>>>>>>>> > take extra care to make sure that can't happen, even if it is an
>>>>>>>> annoyance
>>>>>>>> > for the release managers.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Marcelo
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Signing releases with pwendell or release manager's key?

2017-09-19 Thread Holden Karau
Another option is I can just run the build locally, this might be better
approach since it will help make sure we have the dependencies documented
for the eventual transition to dockerized builds?

On Tue, Sep 19, 2017 at 9:53 AM, Holden Karau  wrote:

> Thanks for the reminder :)
>
> On Tue, Sep 19, 2017 at 9:02 AM Luciano Resende 
> wrote:
>
>> Manually signing seems a good compromise for now, but note that there are
>> two places that this needs to happen, the artifacts that goes to dist.a.o
>> as well as the ones that are published to maven.
>>
>> On Tue, Sep 19, 2017 at 8:53 AM, Ryan Blue 
>> wrote:
>>
>>> +1. Thanks for coming up with a solution, everyone! I think the manually
>>> signed RC as a work around will work well, and it will be an improvement
>>> for the rest to be updated.
>>>
>>> On Mon, Sep 18, 2017 at 8:25 PM, Patrick Wendell >> > wrote:
>>>
>>>> Sounds good - thanks Holden!
>>>>
>>>> On Mon, Sep 18, 2017 at 8:21 PM, Holden Karau 
>>>> wrote:
>>>>
>>>>> That sounds like a pretty good temporary work around if folks agree
>>>>> I'll cancel release vote for 2.1.2 and work on getting an RC2 out later
>>>>> this week manually signed. I've filed JIRA SPARK-22055 & SPARK-22054 to
>>>>> port the release scripts and allow injecting of the RM's key.
>>>>>
>>>>> On Mon, Sep 18, 2017 at 8:11 PM, Patrick Wendell <
>>>>> patr...@databricks.com> wrote:
>>>>>
>>>>>> For the current release - maybe Holden could just sign the artifacts
>>>>>> with her own key manually, if this is a concern. I don't think that would
>>>>>> require modifying the release pipeline, except to just remove/ignore the
>>>>>> existing signatures.
>>>>>>
>>>>>> - Patrick
>>>>>>
>>>>>> On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin 
>>>>>> wrote:
>>>>>>
>>>>>>> Does anybody know whether this is a hard blocker? If it is not, we
>>>>>>> should probably push 2.1.2 forward quickly and do the infrastructure
>>>>>>> improvement in parallel.
>>>>>>>
>>>>>>> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm more than willing to help migrate the scripts as part of either
>>>>>>>> this release or the next.
>>>>>>>>
>>>>>>>> It sounds like there is a consensus developing around changing the
>>>>>>>> process -- should we hold off on the 2.1.2 release or roll this into 
>>>>>>>> the
>>>>>>>> next one?
>>>>>>>>
>>>>>>>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin <
>>>>>>>> van...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> +1 to this. There should be a script in the Spark repo that has all
>>>>>>>>> the logic needed for a release. That script should take the RM's
>>>>>>>>> key
>>>>>>>>> as a parameter.
>>>>>>>>>
>>>>>>>>> if there's a desire to keep the current Jenkins job to create the
>>>>>>>>> release, it should be based on that script. But from what I'm
>>>>>>>>> seeing
>>>>>>>>> there are currently too many unknowns in the release process.
>>>>>>>>>
>>>>>>>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue
>>>>>>>>>  wrote:
>>>>>>>>> > I don't understand why it is necessary to share a release key.
>>>>>>>>> If this is
>>>>>>>>> > something that can be automated in a Jenkins job, then can it be
>>>>>>>>> a script
>>>>>>>>> > with a reasonable set of build requirements for Mac and Ubuntu?
>>>>>>>>> That's the
>>>>>>>>> > approach I've seen the most in other projects.
>>>>>>>>> >
>>>>>>>>> > I'm also not just concerned about release managers. Having a key
>>>>>>>&

Re: What steps to take to work on [Spark-8899] issue?

2015-07-08 Thread Holden Karau
Not exactly but it means someone has come up with what they think a
solution to the problem is and that they've submitted some code for
consideration/review.

On Wednesday, July 8, 2015, Chandrashekhar Kotekar <
shekhar.kote...@gmail.com> wrote:

> Maybe it is stupid question but 'pull request posted to it' means this bug
> is already fixed?
>
>
> Regards,
> Chandrash3khar Kotekar
> Mobile - +91 8600011455
>
> On Thu, Jul 9, 2015 at 12:14 AM, Michael Armbrust  > wrote:
>
>> There is a lot of info here:
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>>
>> In this particular case I'd start by looking at the JIRA (which already
>> has a pull request posted to it).
>>
>> On Wed, Jul 8, 2015 at 11:40 AM, Chandrashekhar Kotekar <
>> shekhar.kote...@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Although I have 7+ years experience in Java development, I am new to
>>> open source contribution. To understand which steps one needs to take to
>>> work on some issue and upload those changes, I have decided to work on this
>>> [Spark-8899] issue which is marked as 'trivial'.
>>>
>>> So far I have done following steps :
>>>
>>> 1. Opened http://github.com/apache/spark in chrome
>>> 2. Clicked on 'Fork' button which is on upper right hand side of the page
>>> 3. Cloned the project using github shell app.
>>>
>>> Now what should I do next to work on this issue? Can anyone please help?
>>>
>>> Thanks,
>>> Chandrash3khar Kotekar
>>> Mobile - +91 8600011455
>>>
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Holden Karau
+1 - compiled on ubuntu & centos, spark-perf run against yarn in client
mode on a small cluster comparing 1.4.0 & 1.4.1 (for core) doesn't have any
huge jumps (albeit with a small scaling factor).

On Wed, Jul 8, 2015 at 11:58 PM, Patrick Wendell  wrote:

> +1
>
> On Wed, Jul 8, 2015 at 10:55 PM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> 1.4.1!
> >
> > This release fixes a handful of known issues in Spark 1.4.0, listed here:
> > http://s.apache.org/spark-1.4.1
> >
> > The tag to be voted on is v1.4.1-rc4 (commit dbaa5c2):
> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> > dbaa5c294eb565f84d7032e387e4b8c1a56e4cd2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > [published as version: 1.4.1]
> > https://repository.apache.org/content/repositories/orgapachespark-1125/
> > [published as version: 1.4.1-rc4]
> > https://repository.apache.org/content/repositories/orgapachespark-1126/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.4.1!
> >
> > The vote is open until Sunday, July 12, at 06:55 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.4.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Building with sbt "impossible to get artifacts when data has not been loaded"

2015-08-26 Thread Holden Karau
Has anyone else run into "impossible to get artifacts when data has not
been loaded. IvyNode = org.scala-lang#scala-library;2.10.3" during
hive/update when building with sbt. Working around it is pretty simple
(just add it as a dependency), but I'm wondering if its impacting anyone
else and I should make a PR for it or if its something funky with my local
build setup.

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
Hi Spark Devs,

So this has been brought up a few times before, and generally on the user
list people get directed to use spark-testing-base. I'd like to start
moving some of spark-testing-base's functionality into Spark so that people
don't need a library to do what is (hopefully :p) a very common requirement
across all Spark projects.

To that end I was wondering what peoples thoughts are on where this should
live inside of Spark. I was thinking it could either be a separate testing
project (like sql or similar), or just put the bits to enable testing
inside of each relevant project.

I was also thinking it probably makes sense to only move the unit testing
parts at the start and leave things like integration testing in a testing
project since that could vary depending on the users environment.

What are peoples thoughts?

Cheers,

Holden :)


Re: Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
I'll put together a google doc and send that out (in the meantime a quick
guide of sort of how the current package can be used is in the blog post I
did at
http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
)  If people think its better to keep as a package I am of course happy to
keep doing that. It feels a little strange to have something as core as
being able to test your code live outside.

On Tue, Oct 6, 2015 at 3:44 PM, Patrick Wendell  wrote:

> Hey Holden,
>
> It would be helpful if you could outline the set of features you'd imagine
> being part of Spark in a short doc. I didn't see a README on the existing
> repo, so it's hard to know exactly what is being proposed.
>
> As a general point of process, we've typically avoided merging modules
> into Spark that can exist outside of the project. A testing utility package
> that is based on Spark's public API's seems like a really useful thing for
> the community, but it does seem like a good fit for a package library. At
> least, this is my first question after taking a look at the project.
>
> In any case, getting some high level view of the functionality you imagine
> would be helpful to give more detailed feedback.
>
> - Patrick
>
> On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau  wrote:
>
>> Hi Spark Devs,
>>
>> So this has been brought up a few times before, and generally on the user
>> list people get directed to use spark-testing-base. I'd like to start
>> moving some of spark-testing-base's functionality into Spark so that people
>> don't need a library to do what is (hopefully :p) a very common requirement
>> across all Spark projects.
>>
>> To that end I was wondering what peoples thoughts are on where this
>> should live inside of Spark. I was thinking it could either be a separate
>> testing project (like sql or similar), or just put the bits to enable
>> testing inside of each relevant project.
>>
>> I was also thinking it probably makes sense to only move the unit testing
>> parts at the start and leave things like integration testing in a testing
>> project since that could vary depending on the users environment.
>>
>> What are peoples thoughts?
>>
>> Cheers,
>>
>> Holden :)
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: Adding Spark Testing functionality

2015-10-12 Thread Holden Karau
So here is a quick description of the current testing bits (I can expand on
it if people are interested) http://bit.ly/pandaPandaPanda .

On Tue, Oct 6, 2015 at 3:49 PM, Holden Karau  wrote:

> I'll put together a google doc and send that out (in the meantime a quick
> guide of sort of how the current package can be used is in the blog post I
> did at
> http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
> )  If people think its better to keep as a package I am of course happy to
> keep doing that. It feels a little strange to have something as core as
> being able to test your code live outside.
>
> On Tue, Oct 6, 2015 at 3:44 PM, Patrick Wendell 
> wrote:
>
>> Hey Holden,
>>
>> It would be helpful if you could outline the set of features you'd
>> imagine being part of Spark in a short doc. I didn't see a README on the
>> existing repo, so it's hard to know exactly what is being proposed.
>>
>> As a general point of process, we've typically avoided merging modules
>> into Spark that can exist outside of the project. A testing utility package
>> that is based on Spark's public API's seems like a really useful thing for
>> the community, but it does seem like a good fit for a package library. At
>> least, this is my first question after taking a look at the project.
>>
>> In any case, getting some high level view of the functionality you
>> imagine would be helpful to give more detailed feedback.
>>
>> - Patrick
>>
>> On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau 
>> wrote:
>>
>>> Hi Spark Devs,
>>>
>>> So this has been brought up a few times before, and generally on the
>>> user list people get directed to use spark-testing-base. I'd like to start
>>> moving some of spark-testing-base's functionality into Spark so that people
>>> don't need a library to do what is (hopefully :p) a very common requirement
>>> across all Spark projects.
>>>
>>> To that end I was wondering what peoples thoughts are on where this
>>> should live inside of Spark. I was thinking it could either be a separate
>>> testing project (like sql or similar), or just put the bits to enable
>>> testing inside of each relevant project.
>>>
>>> I was also thinking it probably makes sense to only move the unit
>>> testing parts at the start and leave things like integration testing in a
>>> testing project since that could vary depending on the users environment.
>>>
>>> What are peoples thoughts?
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: Live UI

2015-10-12 Thread Holden Karau
I don't think there has been much work done with ScalaJS and Spark (outside
of the April fools press release), but there is a live Web UI project out
of hammerlab with Ryan Williams https://github.com/hammerlab/spree which
you may want to take a look at.

On Mon, Oct 12, 2015 at 2:36 PM, Jakob Odersky  wrote:

> Hi everyone,
> I am just getting started working on spark and was thinking of a first way
> to contribute whilst still trying to wrap my head around the codebase.
>
> Exploring the web UI, I noticed it is a classic request-response website,
> requiring manual refresh to get the latest data.
> I think it would be great to have a "live" website where data would be
> displayed real-time without the need to hit the refresh button. I would be
> very interested in contributing this feature if it is acceptable.
>
> Specifically, I was thinking of using websockets with a ScalaJS front-end.
> Please let me know if this design would be welcome or if it introduces
> unwanted dependencies, I'll be happy to discuss this further in detail.
>
> thanks for your feedback,
> --Jakob
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Ability to offer initial coefficients in ml.LogisticRegression

2015-11-02 Thread Holden Karau
Hi YiZhi,

I've been waiting on the shared param to go in (I think it was kmeans) so
we could have a common API. I think the issue is SPARK-7852 but I am on
mobile right now.

Cheers,

Holden :)

On Monday, November 2, 2015, DB Tsai > wrote:

> Hi YiZhi,
>
> Sure. I think Holden already created a JIRA for this. Please
> coordinate with Holden, and keep me in the loop. Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Nov 2, 2015 at 7:32 AM, YiZhi Liu  wrote:
> > Hi Tsai,
> >
> > Is it proper if I create a jira and try to work on it?
> >
> > 2015-10-23 10:40 GMT+08:00 YiZhi Liu :
> >> Thank you Tsai.
> >>
> >> Holden, would you mind posting the JIRA issue id here? I searched but
> >> found nothing. Thanks.
> >>
> >> 2015-10-23 1:36 GMT+08:00 DB Tsai :
> >>> There is a JIRA for this. I know Holden is interested in this.
> >>>
> >>>
> >>> On Thursday, October 22, 2015, YiZhi Liu  wrote:
> 
>  Would someone mind giving some hint?
> 
>  2015-10-20 15:34 GMT+08:00 YiZhi Liu :
>  > Hi all,
>  >
>  > I noticed that in ml.classification.LogisticRegression, users are
> not
>  > allowed to set initial coefficients, while it is supported in
>  > mllib.classification.LogisticRegressionWithSGD.
>  >
>  > Sometimes we know specific coefficients are close to the final
> optima.
>  > e.g., we usually pick yesterday's output model as init coefficients
>  > since the data distribution between two days' training sample
>  > shouldn't change much.
>  >
>  > Is there any concern for not supporting this feature?
>  >
>  > --
>  > Yizhi Liu
>  > Senior Software Engineer / Data Mining
>  > www.mvad.com, Shanghai, China
> 
> 
> 
>  --
>  Yizhi Liu
>  Senior Software Engineer / Data Mining
>  www.mvad.com, Shanghai, China
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>  For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >>>
> >>>
> >>> --
> >>> - DB
> >>>
> >>> Sent from my iPhone
> >>
> >>
> >>
> >> --
> >> Yizhi Liu
> >> Senior Software Engineer / Data Mining
> >> www.mvad.com, Shanghai, China
> >
> >
> >
> > --
> > Yizhi Liu
> > Senior Software Engineer / Data Mining
> > www.mvad.com, Shanghai, China
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Ability to offer initial coefficients in ml.LogisticRegression

2015-11-03 Thread Holden Karau
Thats correct :)

On Mon, Nov 2, 2015 at 8:04 PM, YiZhi Liu  wrote:

> Hi Holden,
>
> Yep the issue id is correct. It seems that you're waiting for
> SPARK-11136 which Jayant is working on?
>
> Best,
> Yizhi
>
> 2015-11-03 11:14 GMT+08:00 Holden Karau :
> > Hi YiZhi,
> >
> > I've been waiting on the shared param to go in (I think it was kmeans)
> so we
> > could have a common API. I think the issue is SPARK-7852 but I am on
> mobile
> > right now.
> >
> > Cheers,
> >
> > Holden :)
> >
> >
> > On Monday, November 2, 2015, DB Tsai  wrote:
> >>
> >> Hi YiZhi,
> >>
> >> Sure. I think Holden already created a JIRA for this. Please
> >> coordinate with Holden, and keep me in the loop. Thanks.
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> --
> >> Web: https://www.dbtsai.com
> >> PGP Key ID: 0xAF08DF8D
> >>
> >>
> >> On Mon, Nov 2, 2015 at 7:32 AM, YiZhi Liu  wrote:
> >> > Hi Tsai,
> >> >
> >> > Is it proper if I create a jira and try to work on it?
> >> >
> >> > 2015-10-23 10:40 GMT+08:00 YiZhi Liu :
> >> >> Thank you Tsai.
> >> >>
> >> >> Holden, would you mind posting the JIRA issue id here? I searched but
> >> >> found nothing. Thanks.
> >> >>
> >> >> 2015-10-23 1:36 GMT+08:00 DB Tsai :
> >> >>> There is a JIRA for this. I know Holden is interested in this.
> >> >>>
> >> >>>
> >> >>> On Thursday, October 22, 2015, YiZhi Liu 
> wrote:
> >> >>>>
> >> >>>> Would someone mind giving some hint?
> >> >>>>
> >> >>>> 2015-10-20 15:34 GMT+08:00 YiZhi Liu :
> >> >>>> > Hi all,
> >> >>>> >
> >> >>>> > I noticed that in ml.classification.LogisticRegression, users are
> >> >>>> > not
> >> >>>> > allowed to set initial coefficients, while it is supported in
> >> >>>> > mllib.classification.LogisticRegressionWithSGD.
> >> >>>> >
> >> >>>> > Sometimes we know specific coefficients are close to the final
> >> >>>> > optima.
> >> >>>> > e.g., we usually pick yesterday's output model as init
> coefficients
> >> >>>> > since the data distribution between two days' training sample
> >> >>>> > shouldn't change much.
> >> >>>> >
> >> >>>> > Is there any concern for not supporting this feature?
> >> >>>> >
> >> >>>> > --
> >> >>>> > Yizhi Liu
> >> >>>> > Senior Software Engineer / Data Mining
> >> >>>> > www.mvad.com, Shanghai, China
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Yizhi Liu
> >> >>>> Senior Software Engineer / Data Mining
> >> >>>> www.mvad.com, Shanghai, China
> >> >>>>
> >> >>>>
> -
> >> >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> - DB
> >> >>>
> >> >>> Sent from my iPhone
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Yizhi Liu
> >> >> Senior Software Engineer / Data Mining
> >> >> www.mvad.com, Shanghai, China
> >> >
> >> >
> >> >
> >> > --
> >> > Yizhi Liu
> >> > Senior Software Engineer / Data Mining
> >> > www.mvad.com, Shanghai, China
> >
> >
> >
> > --
> > Cell : 425-233-8271
> > Twitter: https://twitter.com/holdenkarau
> >
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Holden Karau
I'd be more than happy to help review the docs if that would be useful :)

On Sat, Dec 5, 2015 at 2:21 PM, Joseph Bradley 
wrote:

> Thanks for reporting this!  I just added a JIRA:
> https://issues.apache.org/jira/browse/SPARK-12159
> That would be great if you could send a PR for it; thanks!
> Joseph
>
> On Sat, Dec 5, 2015 at 5:02 AM, Benjamin Fradet  > wrote:
>
>> Hi,
>>
>> I was wondering why the IndexToString
>> 
>>  label
>> transformer was not documented in ml-features.md
>> .
>>
>> If it's not intentional, having used it a few times, I'd be happy to
>> submit a jira and the pr associated.
>>
>> Best,
>> Ben.
>>
>> --
>> Ben Fradet.
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: How to run PySpark tests?

2016-02-18 Thread Holden Karau
I've run into some problems with the Python tests in the past when I
haven't built with hive support, you might want to build your assembly with
hive support and see if that helps.

On Thursday, February 18, 2016, Jason White  wrote:

> Hi,
>
> I'm trying to finish up a PR (https://github.com/apache/spark/pull/10089)
> which is currently failing PySpark tests. The instructions to run the test
> suite seem a little dated. I was able to find these:
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
> http://spark.apache.org/docs/latest/building-spark.html
>
> I've tried running `python/run-tests`, but it fails hard at the ORC tests.
> I
> suspect it has to do with the external libraries not being compiled or put
> in the right location.
> I've tried running `SPARK_TESTING=1 ./bin/pyspark
> python/pyspark/streaming/tests.py` as suggested, but this doesn't work on
> Spark 2.0.
> I've tried running `SPARK_TESTING=1 ./bin/spark-submit
> python/pyspark/streaming/tests.py`and that worked a little better, but it
> failed at `pyspark.streaming.tests.KafkaStreamTests`, with
> `java.lang.ClassNotFoundException:
> org.apache.spark.streaming.kafka.KafkaTestUtils`. I suspect the same issue
> with external libraries.
>
> I've compiling Spark with `build/mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package` with no trouble.
>
> Is there any better documentation somewhere about how to run the PySpark
> tests?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: How to run PySpark tests?

2016-02-18 Thread Holden Karau
Great - I'll update the wiki.

On Thu, Feb 18, 2016 at 8:34 PM, Jason White 
wrote:

> Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0
> -DskipTests clean package` followed by `python/run-tests` seemed to do the
> trick! Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Write access to wiki

2016-02-19 Thread Holden Karau
Any chance I could also get write access to the wiki? I'd like to update
some of the PySpark documentation in the wiki.

On Tue, Jan 12, 2016 at 10:14 AM, shane knapp  wrote:

> > Ok, sounds good. I think it would be great, if you could add installing
> the
> > 'docker-engine' package and starting the 'docker' service in there too. I
> > was planning to update the playbook if there were one in the apache/spark
> > repo but I didn't see one, hence my question.
> >
> we currently have docker 1.5 running on the worker, and after the
> Great Upgrade To CentOS7, we'll be running a much more modern version
> of docker.
>
> shane
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: How to run PySpark tests?

2016-02-19 Thread Holden Karau
Or wait I don't have access to the wiki - if anyone can give me wiki access
I'll update the instructions.

On Thu, Feb 18, 2016 at 8:45 PM, Holden Karau  wrote:

> Great - I'll update the wiki.
>
> On Thu, Feb 18, 2016 at 8:34 PM, Jason White 
> wrote:
>
>> Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive
>> -Dhadoop.version=2.4.0
>> -DskipTests clean package` followed by `python/run-tests` seemed to do the
>> trick! Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-PySpark-tests-tp16357p16362.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-05 Thread Holden Karau
One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is
deprecation warnings in our builds that we can't fix without introducing a
wrapper/ scala version specific code. This isn't a big deal, and if we drop
2.10 in the 3-6 month time frame talked about we can cleanup those warnings
once we get there.

On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors <
raymond.honderd...@sizmek.com> wrote:

> What about a seperate branch for scala 2.10?
>
>
>
> Sent from my Samsung Galaxy smartphone.
>
>
>  Original message 
> From: Koert Kuipers 
> Date: 4/2/2016 02:10 (GMT+02:00)
> To: Michael Armbrust 
> Cc: Matei Zaharia , Mark Hamstra <
> m...@clearstorydata.com>, Cody Koeninger , Sean Owen <
> so...@cloudera.com>, dev@spark.apache.org
> Subject: Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle
>
> as long as we don't lock ourselves into supporting scala 2.10 for the
> entire spark 2 lifespan it sounds reasonable to me
>
> On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust 
> wrote:
>
>> +1 to Matei's reasoning.
>>
>> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia 
>> wrote:
>>
>>> I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
>>> entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
>>> the default version we built with in 1.x. We want to make the transition
>>> from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default
>>> downloads be for Scala 2.11, so people will more easily move, but we
>>> shouldn't create obstacles that lead to fragmenting the community and
>>> slowing down Spark 2.0's adoption. I've seen companies that stayed on an
>>> old Scala version for multiple years because switching it, or mixing
>>> versions, would affect the company's entire codebase.
>>>
>>> Matei
>>>
>>> On Mar 30, 2016, at 12:08 PM, Koert Kuipers  wrote:
>>>
>>> oh wow, had no idea it got ripped out
>>>
>>> On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra 
>>> wrote:
>>>
 No, with 2.0 Spark really doesn't use Akka:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744

 On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers 
 wrote:

> Spark still runs on akka. So if you want the benefits of the latest
> akka (not saying we do, was just an example) then you need to drop scala
> 2.10
> On Mar 30, 2016 10:44 AM, "Cody Koeninger"  wrote:
>
>> I agree with Mark in that I don't see how supporting scala 2.10 for
>> spark 2.0 implies supporting it for all of spark 2.x
>>
>> Regarding Koert's comment on akka, I thought all akka dependencies
>> have been removed from spark after SPARK-7997 and the recent removal
>> of external/akka
>>
>> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> > Dropping Scala 2.10 support has to happen at some point, so I'm not
>> > fundamentally opposed to the idea; but I've got questions about how
>> we go
>> > about making the change and what degree of negative consequences we
>> are
>> > willing to accept.  Until now, we have been saying that 2.10
>> support will be
>> > continued in Spark 2.0.0.  Switching to 2.11 will be non-trivial
>> for some
>> > Spark users, so abruptly dropping 2.10 support is very likely to
>> delay
>> > migration to Spark 2.0 for those users.
>> >
>> > What about continuing 2.10 support in 2.0.x, but repeatedly making
>> an
>> > obvious announcement in multiple places that such support is
>> deprecated,
>> > that we are not committed to maintaining it throughout 2.x, and
>> that it is,
>> > in fact, scheduled to be removed in 2.1.0?
>> >
>> > On Wed, Mar 30, 2016 at 7:45 AM, Sean Owen 
>> wrote:
>> >>
>> >> (This should fork as its own thread, though it began during
>> discussion
>> >> of whether to continue Java 7 support in Spark 2.x.)
>> >>
>> >> Simply: would like to more clearly take the temperature of all
>> >> interested parties about whether to support Scala 2.10 in the Spark
>> >> 2.x lifecycle. Some of the arguments appear to be:
>> >>
>> >> Pro
>> >> - Some third party dependencies do not support Scala 2.11+ yet and
>> so
>> >> would not be usable in a Spark app
>> >>
>> >> Con
>> >> - Lower maintenance overhead -- no separate 2.10 build,
>> >> cross-building, tests to check, esp considering support of 2.12
>> will
>> >> be needed
>> >> - Can use 2.11+ features freely
>> >> - 2.10 was EOL in late 2014 and Spark 2.x lifecycle is years to
>> come
>> >>
>> >> I would like to not support 2.10 for Spark 2.x, myself.
>> >>
>> >>
>> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apa

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Holden Karau
I'm very much in favor of this, the less porting work there is the better :)

On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley 
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~1
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Holden Karau
Personally I'd rather err on the side of keeping PRs open, but I understand
wanting to keep the open PRs limited to ones which have a reasonable chance
of being merged.

What about if we filtered for non-mergeable PRs or instead left a comment
asking the author to respond if they are still available to move the PR
forward - and close the ones where they don't respond for a week?

Just a suggestion.
On Monday, April 18, 2016, Ted Yu  wrote:

> I had one PR which got merged after 3 months.
>
> If the inactivity was due to contributor, I think it can be closed after
> 30 days.
> But if the inactivity was due to lack of review, the PR should be kept
> open.
>
> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger  > wrote:
>
>> For what it's worth, I have definitely had PRs that sat inactive for
>> more than 30 days due to committers not having time to look at them,
>> but did eventually end up successfully being merged.
>>
>> I guess if this just ends up being a committer ping and reopening the
>> PR, it's fine, but I don't know if it really addresses the underlying
>> issue.
>>
>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin > > wrote:
>> > We have hit a new high in open pull requests: 469 today. While we can
>> > certainly get more review bandwidth, many of these are old and still
>> open
>> > for other reasons. Some are stale because the original authors have
>> become
>> > busy and inactive, and some others are stale because the committers are
>> not
>> > sure whether the patch would be useful, but have not rejected the patch
>> > explicitly. We can cut down the signal to noise ratio by closing pull
>> > requests that have been inactive for greater than 30 days, with a nice
>> > message. I just checked and this would close ~ half of the pull
>> requests.
>> >
>> > For example:
>> >
>> > "Thank you for creating this pull request. Since this pull request has
>> been
>> > inactive for 30 days, we are automatically closing it. Closing the pull
>> > request does not remove it from history and will retain all the diff and
>> > review comments. If you have the bandwidth and would like to continue
>> > pushing this forward, please reopen it. Thanks again!"
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: persist versus checkpoint

2016-04-30 Thread Holden Karau
They are different, also this might be better suited for the user list.
Persist by default will cache in memory on one machine, although you can
specify a different storage level. Checkpoint on the other hand will write
out to a persistent store and get rid of the dependency graph used to
compute the RDD (so it is often seen in iterative algorithms which may
build very large or complex dependency graphs over time).

On Saturday, April 30, 2016, Renyi Xiong  wrote:

> Hi,
>
> Is RDD.persist equivalent to RDD.checkpoint If they save same number of
> copies (say 3) to disk?
>
> (I assume persist saves copies on different machines ?)
>
> thanks,
> Renyi.
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


PySpark mixed with Jython

2016-05-15 Thread Holden Karau
I've been doing some looking at EclairJS (Spark + Javascript) which takes a
really interesting approach. The driver program is run in node and the
workers are run in nashorn. I was wondering if anyone has given much though
to optionally exposing an interface for PySpark in a similar fashion. For
some UDFs and UDAFs we could keep the data entirely in the JVM, and still
go back to our old PipelinedRDD based interface for operations which
require native libraries or otherwise aren't supported in Jython. Have I
had too much coffee and this is actually a bad idea or is this something
people think would be worth investigating some?

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Holden Karau
+1 non-binding (as a contributor anything which speed things up is worth a
try, and git blame is a good enough substitute for the list when figuring
out who to ping on a PR).

On Monday, May 23, 2016, Imran Rashid  wrote:

> +1 (binding)
>
> On Mon, May 23, 2016 at 8:13 AM, Tom Graves  > wrote:
>
>> +1 (binding)
>>
>> Tom
>>
>>
>> On Sunday, May 22, 2016 7:34 PM, Matei Zaharia > > wrote:
>>
>>
>> It looks like the discussion thread on this has only had positive
>> replies, so I'm going to call a VOTE. The proposal is to remove the
>> maintainer process in 
>> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
>> <
>> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
>> given that it doesn't seem to have had a huge impact on the project, and it
>> can unnecessarily create friction in contributing. We already have +1s from
>> Mridul, Tom, Andrew Or and Imran on that thread.
>>
>> I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>>
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: ImportError: No module named numpy

2016-06-01 Thread Holden Karau
Generally this means numpy isn't installed on the system or your PYTHONPATH
has somehow gotten pointed somewhere odd,

On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra  wrote:

> If any one please can help me with following error.
>
>  File
> "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
> line 25, in 
>
> ImportError: No module named numpy
>
>
> Thanks in advance!
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I think your error could possibly be different - looking at the original
JIRA the issue was happening on HDFS and you seem to be experiencing the
issue on s3n, and while I don't have full view of the problem I could see
this being s3 specific (read-after-write on s3 is trickier than
read-after-write on HDFS).

On Thursday, June 9, 2016, Sunil Kumar 
wrote:

> Hi,
>
> I am running into SPARK-2984 while running my spark 1.6.1 jobs over yarn
> in AWS. I have tried with spark.speculation=false but still see the same
> failure with _temporary file missing for task_xxx...This ticket is in
> resolved state. How can this be reopened ? Is there a workaround ?
>
> thanks
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I'd do some searching and see if there is a JIRA related to this problem
on s3 and if you don't find one go ahead and make one. Even if it is an
intrinsic problem with s3 (and I'm not super sure since I'm just reading
this on mobile) - it would maybe be a good thing for us to document.

On Thursday, June 9, 2016, Sunil Kumar  wrote:

> Holden
> Thanks for your prompt reply... Any suggestions on the next step ? Does
> this call for a new spark jira ticket or is this an issue for s3?
> Thx
>
>
> I think your error could possibly be different - looking at the original
> JIRA the issue was happening on HDFS and you seem to be experiencing the
> issue on s3n, and while I don't have full view of the problem I could see
> this being s3 specific (read-after-write on s3 is trickier than
> read-after-write on HDFS).
>
> On Thursday, June 9, 2016, Sunil Kumar 
> wrote:
>
>> Hi,
>>
>> I am running into SPARK-2984 while running my spark 1.6.1 jobs over yarn
>> in AWS. I have tried with spark.speculation=false but still see the same
>> failure with _temporary file missing for task_xxx...This ticket is in
>> resolved state. How can this be reopened ? Is there a workaround ?
>>
>> thanks
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Holden Karau
PySpark RDDs are (on the Java side) are essentially RDD of pickled objects
and mostly (but not entirely) opaque to the JVM. It is possible (by using
some internals) to pass a PySpark DataFrame to a Scala library (you may or
may not find the talk I gave at Spark Summit useful
https://www.youtube.com/watch?v=V6DkTVvy9vk as well as some of the Python
examples in
https://github.com/high-performance-spark/high-performance-spark-examples
). Good luck! :)

On Wed, Jun 22, 2016 at 7:07 PM, Daniel Imberman 
wrote:

> Hi All,
>
> I've developed a spark module in scala that I would like to add a python
> port for. I want to be able to allow users to create a pyspark RDD and send
> it to my system. I've been looking into the pyspark source code as well as
> py4J and was wondering if there has been anything like this implemented
> before.
>
> Thank you
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...

2014-03-06 Thread Holden Karau
Sure, unique from MLI-2?


On Thu, Mar 6, 2014 at 2:15 PM, mengxr  wrote:

> Github user mengxr commented on the pull request:
>
> https://github.com/apache/spark/pull/18#issuecomment-36944266
>
> LGTM, except the extra empty line. Do you mind creating a Spark JIRA
> for this PR?
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>



-- 
Cell : 425-233-8271


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-27 Thread Holden Karau
+1 (I did some very basic testing with PySpark & Pandas on rc11)


On Tue, May 27, 2014 at 3:53 PM, Mark Hamstra wrote:

> +1
>
>
> On Tue, May 27, 2014 at 9:26 AM, Ankur Dave  wrote:
>
> > 0
> >
> > OK, I withdraw my downvote.
> >
> > Ankur 
> >
>



-- 
Cell : 425-233-8271


Re: Easy win: SBT plugin config expert to help on SPARK-3359?

2014-10-22 Thread Holden Karau
Hi Sean,

I've pushed a PR for this https://github.com/apache/spark/pull/2893 :)

Cheers,

Holden :)

On Tue, Oct 21, 2014 at 4:41 AM, Sean Owen  wrote:

> This one can be resolved, I think, with a bit of help from someone who
> understands SBT + plugin config:
>
> https://issues.apache.org/jira/browse/SPARK-3359
>
> Just a matter of figuring out how to set a property on the plugin.
> This would make Java 8 javadoc work much more nicely. Minor but
> useful!
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271


Re: Development testing code

2014-10-22 Thread Holden Karau
Hi,

Many tests in pyspark are implemented as doctests and the python
unittesting framework is also used for additional tests.

Cheers,

Holden :)

On Wed, Oct 22, 2014 at 4:13 PM, catchmonster  wrote:

> Hi,
> If developing in python, what is preffered way to do unit testing?
> Do I use pyunit framework or I need to go with scalaTest?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Development-testing-code-tp8911.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271


Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread Holden Karau
+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, Mar 11, 2024 at 7:44 PM Reynold Xin 
wrote:

> +1
>
>
> On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim 
> wrote:
>
>> +1 (non-binding), thanks Gengliang!
>>
>> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang  wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Structured Logging Framework for
>>> Apache Spark
>>>
>>> References:
>>>
>>>- JIRA ticket 
>>>- SPIP doc
>>>
>>> 
>>>- Discussion thread
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>> Gengliang Wang
>>>
>>


Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Holden Karau
+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon 
>
> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
> wrote:
>
>> +1
>> --
>> *From:* Denny Lee 
>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>> *To:* Hussein Awala 
>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>> Mridul Muralidharan ; dev 
>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>
>> +1 (non-binding)
>>
>>
>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>
>> +1(non-binding) I add to the difference will it make that it will also
>> simplify package maintenance and easily release a bug fix/new feature
>> without needing to wait for Pyspark to release.
>>
>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>>


Re: Apache Spark 3.4.3 (?)

2024-04-06 Thread Holden Karau
Sounds good to me :)

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>


Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Holden Karau
I like the idea of improving flexibility of Sparks physical plans and
really anything that might reduce code duplication among the ~4 or so
different accelerators.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
wrote:

> Thank you for sharing, Jia.
>
> I have the same questions like the previous Weiting's thread.
>
> Do you think you can share the future milestone of Apache Gluten?
> I'm wondering when the first stable release will come and how we can
> coordinate across the ASF communities.
>
> > This project is still under active development now, and doesn't have a
> stable release.
> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>
> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> support.
> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> scheduled in October.
>
> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
> is something we need to do from Spark side.
>
+1 I think any changes need to target 4.0

>
> Thanks,
> Dongjoon.
>
>
> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>
>> Apache Spark currently lacks an official mechanism to support
>> cross-platform execution of physical plans. The Gluten project offers a
>> mechanism that utilizes the Substrait standard to convert and optimize
>> Spark's physical plans. By introducing Gluten's plan conversion,
>> validation, and fallback mechanisms into Spark, we can significantly
>> enhance the portability and interoperability of Spark's physical plans,
>> enabling them to operate across a broader spectrum of execution
>> environments without requiring users to migrate, while also improving
>> Spark's execution efficiency through the utilization of Gluten's advanced
>> optimization techniques. And the integration of Gluten into Spark has
>> already shown significant performance improvements with ClickHouse and
>> Velox backends and has been successfully deployed in production by several
>> customers.
>>
>> References:
>> JIAR Ticket 
>> SPIP Doc
>> 
>>
>> Your feedback and comments are welcome and appreciated.  Thanks.
>>
>> Thanks,
>> Jia Ke
>>
>


Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-10 Thread Holden Karau
On Wed, Apr 10, 2024 at 9:54 PM Binwei Yang  wrote:

>
> Gluten currently already support Velox backend and Clickhouse backend.
> data fusion support is also proposed but no one worked on it.
>
> Gluten isn't a POC. It's under actively developing but some companies
> already used it.
>
>
> On 2024/04/11 03:32:01 Dongjoon Hyun wrote:
> > I'm interested in your claim.
> >
> > Could you elaborate or provide some evidence for your claim, *a door for
> > all native libraries*, Binwei?
> >
> > For example, is there any POC for that claim? Maybe, did I miss something
> > in that SPIP?
>
I think the concern here is there are multiple different layers to get from
Spark -> Native code and ideally any changes we introduce in Spark would be
for common functionality that is useful across them (e.g. data fusion comet
& gluten & photon*, etc.)


* Photon being harder to guess at since it's closed source.

> >
> > Dongjoon.
> >
> > On Wed, Apr 10, 2024 at 8:19 PM Binwei Yang  wrote:
> >
> > >
> > > The SPIP is not for current Gluten, but open a door for all native
> > > libraries and accelerators support.
> > >
> > > On 2024/04/11 00:27:43 Weiting Chen wrote:
> > > > Yes, the 1st Apache release(v1.2.0) for Gluten will be in September.
> > > > For Spark version support, currently Gluten v1.1.1 support Spark3.2
> and
> > > 3.3.
> > > > We are planning to support Spark3.4 and 3.5 in Gluten v1.2.0.
> > > > Spark4.0 support for Gluten is depending on the release schedule in
> > > Spark community.
> > > >
> > > > On 2024/04/09 07:14:13 Dongjoon Hyun wrote:
> > > > > Thank you for sharing, Weiting.
> > > > >
> > > > > Do you think you can share the future milestone of Apache Gluten?
> > > > > I'm wondering when the first stable release will come and how we
> can
> > > > > coordinate across the ASF communities.
> > > > >
> > > > > > This project is still under active development now, and doesn't
> have
> > > a
> > > > > stable release.
> > > > > > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
> > > > >
> > > > > In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end
> of
> > > > > support.
> > > > > And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release)
> is
> > > > > scheduled in October.
> > > > >
> > > > > For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only
> if
> > > there
> > > > > is something we need to do from Spark side.
> > > > >
> > > > > Thanks,
> > > > > Dongjoon.
> > > > >
> > > > >
> > > > > On Mon, Apr 8, 2024 at 11:19 PM WeitingChen <
> weitingc...@apache.org>
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are excited to introduce a new Apache incubating project
> called
> > > Gluten.
> > > > > > Gluten serves as a middleware layer designed to offload Spark to
> > > native
> > > > > > engines like Velox or ClickHouse.
> > > > > > For more detailed information, please visit the project
> repository at
> > > > > > https://github.com/apache/incubator-gluten
> > > > > >
> > > > > > Additionally, a new Spark SPIP related to Spark + Gluten
> > > collaboration has
> > > > > > been proposed at
> https://issues.apache.org/jira/browse/SPARK-47773.
> > > > > > We eagerly await feedback from the Spark community.
> > > > > >
> > > > > > Thanks,
> > > > > > Weiting.
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-13 Thread Holden Karau
+1 -- even if it's not perfect now is the time to change default values

On Sat, Apr 13, 2024 at 4:11 PM Hyukjin Kwon  wrote:

> +1
>
> On Sun, Apr 14, 2024 at 7:46 AM Chao Sun  wrote:
>
>> +1.
>>
>> This feature is very helpful for guarding against correctness issues,
>> such as null results due to invalid input or math overflows. It’s been
>> there for a while now and it’s a good time to enable it by default as Spark
>> enters the next major release.
>>
>> On Sat, Apr 13, 2024 at 3:27 PM Dongjoon Hyun 
>> wrote:
>>
>>> I'll start from my +1.
>>>
>>> Dongjoon.
>>>
>>> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
>>> > Please vote on SPARK-4 to use ANSI SQL mode by default.
>>> > The technical scope is defined in the following PR which is
>>> > one line of code change and one line of migration guide.
>>> >
>>> > - DISCUSSION:
>>> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
>>> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
>>> > - PR: https://github.com/apache/spark/pull/46013
>>> >
>>> > The vote is open until April 17th 1AM (PST) and passes
>>> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Use ANSI SQL mode by default
>>> > [ ] -1 Do not use ANSI SQL mode by default because ...
>>> >
>>> > Thank you in advance.
>>> >
>>> > Dongjoon
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Holden Karau
+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Thu, Apr 25, 2024 at 11:18 AM Maciej  wrote:

> +1
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 4/25/24 6:21 PM, Reynold Xin wrote:
>
> +1
>
> On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale
>  
> wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun 
>> wrote:
>>
>>> FYI, there is a proposal to drop Python 3.8 because its EOL is October
>>> 2024.
>>>
>>> https://github.com/apache/spark/pull/46228
>>> [SPARK-47993][PYTHON] Drop Python 3.8
>>>
>>> Since it's still alive and there will be an overlap between the
>>> lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your
>>> feedback on the PR, if you have any concerns.
>>>
>>> From my side, I agree with this decision.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>


Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Holden Karau
+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:

> +1
>
> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun 
> wrote:
> >
> > I'll start with my +1.
> >
> > Dongjoon.
> >
> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
> > > Please vote on SPARK-46122 to set
> spark.sql.legacy.createHiveTableByDefault
> > > to `false` by default. The technical scope is defined in the following
> PR.
> > >
> > > - DISCUSSION:
> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
> > > - PR: https://github.com/apache/spark/pull/46207
> > >
> > > The vote is open until April 30th 1AM (PST) and passes
> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by
> default
> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because
> ...
> > >
> > > Thank you in advance.
> > >
> > > Dongjoon
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau
+1 :) yay previews

On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

> +1
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
>> +1 for next Monday.
>>
>> We can do more previews when the other features are ready for preview.
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?

 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Hey all,
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
> Thanks!
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
>> Thank you all for the replies!
>>
>> To @Nicholas Chammas  : Thanks for
>> cleaning up the error terminology and documentation! I've merged the 
>> first
>> PR and let's finish others before the 4.0 release.
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>> To @Jungtaek Lim  : Ack. We can treat
>> the Streaming state store data source as completed for 4.0 then.
>> To @Cheng Pan  : Yea we definitely should have
>> a preview release. Let's collect more feedback on the ongoing projects 
>> and
>> then we can propose a date for the preview release.
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and
>>> 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are 
>>> demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC 
>>> cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June
>>> 2024), and I think it's time to prepare for it and discuss the ongoing
>>> projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new data type VARIANT
>>> > • STRING collation support
>>> > • Spark k8s operator versioning
>>> > Please help to add more items to this list that are missed here. I
>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>> there is no objection. Thank you all for the great work that fills Spark
>>> 4.0!
>>> >
>>> > Wenchen Fan
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: ASF board report draft for May

2024-05-05 Thread Holden Karau
Do we want to include that we’re planning on having a preview release of
Spark 4 so folks can see the APIs “soon”?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
wrote:

> It’s time for our quarterly ASF board report on Apache Spark this
> Wednesday. Here’s a draft, feel free to suggest changes.
>
> 
>
> Description:
>
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine
> learning, and graph analytics.
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark
> 3.4.2 on April 18, 2024.
> - The votes on "SPIP: Structured Logging Framework for Apache Spark" and
> "Pure Python Package in PyPI (Spark Connect)" have passed.
> - The votes for two behavior changes have passed: "SPARK-4: Use ANSI
> SQL mode by default" and "SPARK-46122: Set
> spark.sql.legacy.createHiveTableByDefault to false".
> - The community decided that upcoming Spark 4.0 release will drop support
> for Python 3.8.
> - We started a discussion about the definition of behavior changes that is
> critical for version upgrades and user experience.
> - We've opened a dedicated repository for the Spark Kubernetes Operator at
> https://github.com/apache/spark-kubernetes-operator. We added a new
> version in Apache Spark JIRA for versioning of the Spark operator based on
> a vote result.
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
> - Spark 3.4.3 was released on April 18, 2024
> - Spark 3.5.1 was released on February 28, 2024
> - Spark 3.3.4 was released on December 16, 2023
>
> Committers and PMC:
>
> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
> Yikun Jiang).
>
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
If folks are against the term soon we could say “in-progress”

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, May 6, 2024 at 2:08 AM Mich Talebzadeh 
wrote:

> Hi,
>
> We should reconsider using the term "soon" for ASF board as it is
> subjective with no date (assuming this is an official communication on
> Wednesday). We ought to say
>
>  "Spark 4, the next major release after Spark 3.x, is currently under
> development. We plan to make a preview version available for evaluation as
> soon as it is feasible"
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 6 May 2024 at 05:09, Dongjoon Hyun 
> wrote:
>
>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>> "soon".
>> (If Wenchen release it on Monday, we can simply mention the release)
>>
>> In addition, Apache Spark PMC received an official notice from ASF Infra
>> team.
>>
>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF
>> projects
>>
>> To track and comply with the new ASF Infra Policy as much as possible, we
>> opened a blocker-level JIRA issue and have been working on it.
>> - https://infra.apache.org/github-actions-policy.html
>>
>> Please include a sentence that Apache Spark PMC is working on under the
>> following umbrella JIRA issue.
>>
>> https://issues.apache.org/jira/browse/SPARK-48094
>> > Reduce GitHub Action usage according to ASF project allowance
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Sun, May 5, 2024 at 3:45 PM Holden Karau 
>> wrote:
>>
>>> Do we want to include that we’re planning on having a preview release of
>>> Spark 4 so folks can see the APIs “soon”?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
>>> wrote:
>>>
>>>> It’s time for our quarterly ASF board report on Apache Spark this
>>>> Wednesday. Here’s a draft, feel free to suggest changes.
>>>>
>>>> 
>>>>
>>>> Description:
>>>>
>>>> Apache Spark is a fast and general purpose engine for large-scale data
>>>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>>>> well as a rich set of libraries including stream processing, machine
>>>> learning, and graph analytics.
>>>>
>>>> Issues for the board:
>>>>
>>>> - None
>>>>
>>>> Project status:
>>>>
>>>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and
>>>> Spark 3.4.2 on April 18, 2024.
>>>> - The votes on "SPIP: Structured Logging Framework for Apache Spark"
>>>> and "Pure Python Package in PyPI (Spark Connect)" have passed.
>>>> - The votes for two behavior changes have passed: "SPARK-4: Use
>>>> ANSI SQL mode by default" and "SPARK-46122: Set
>>>> spark.sql.legacy.createHiveTableByDefault to false".
>>>> - The community decided that upcoming Spark 4.0 release will drop
>>>> support for Python 3.8.
>>>> - We started a discussion about the definition of behavior changes that
>>>> is critical for version upgrades and user experience.
>>>> - We've opened a dedicated repository for the Spark Kubernetes Operator
>>>> at https://github.com/apache/spark-kubernetes-operator. We added a new
>>>> version in Apache Spark JIRA for versioning of the Spark operator based on
>>>> a vote result.
>>>>
>>>> Trademarks:
>>>>
>>>> - No changes since the last report.
>>>>
>>>> Latest releases:
>>>> - Spark 3.4.3 was released on April 18, 2024
>>>> - Spark 3.5.1 was released on February 28, 2024
>>>> - Spark 3.3.4 was released on December 16, 2023
>>>>
>>>> Committers and PMC:
>>>>
>>>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>>>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>>>> Yikun Jiang).
>>>>
>>>> 
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
I trust Wenchen to manage the preview release effectively but if there are
concerns around how to manage a developer preview release lets split that
off from the board report discussion.

On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh 
wrote:

> I did some historical digging on this.
>
> Whilst both preview release and RCs are pre-release versions, the main
> difference lies in their maturity and readiness for production use. Preview
> releases are early versions aimed at gathering feedback, while release
> candidates (RCs) are nearly finished versions that undergo final testing
> and voting before the official release.
>
> So in our case, we have two options:
>
>
>1. Skip mentioning of the Preview and focus on "We are intending to
>gather feedback on version 4 by releasing an earlier version to the
>community for look and feel feedback, especially focused on APIs
>2. Mention Preview in the form. "There will be a Preview release with
>the aim of gathering feedback from the community focused on APIs"
>
> IMO Preview release does not require a formal vote. Preview releases are
> often considered experimental or pre-alpha versions and are not expected to
> meet the same level of stability and completeness as release candidates or
> final releases.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 6 May 2024 at 14:10, Mich Talebzadeh 
> wrote:
>
>> @Wenchen Fan 
>>
>> Thanks for the update! To clarify, is the vote for approving a specific
>> preview build, or is it for moving towards an RC stage? I gather there is a
>> distinction between these two?
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 6 May 2024 at 13:03, Wenchen Fan  wrote:
>>
>>> The preview release also needs a vote. I'll try my best to cut the RC on
>>> Monday, but the actual release may take some time. Hopefully, we can get it
>>> out this week but if the vote fails, it will take longer as we need more
>>> RCs.
>>>
>>> On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> +1 for Holden's comment. Yes, it would be great to mention `it` as
>>>> "soon".
>>>> (If Wenchen release it on Monday, we can simply mention the release)
>>>>
>>>> In addition, Apache Spark PMC received an official notice from ASF
>>>> Infra team.
>>>>
>>>> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
>>>> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for
>>>> ASF projects
>>>>
>>>> To track and comply with the new ASF Infra Policy as much as possible,
>>>> we opened a blocker-level JIRA issue and have been working on it.
>>>> - https://infra.apache.org/github-actions-policy.html
>>>>
>>>> Please include a sentence that Apache Spark PMC is working on under the
>>>> following umbrella JIRA issue.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-48094
>>>> > Reduce GitHub Action usage according to ASF project allowance
>>>>
>>>> Thanks,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Sun, May 5, 2024 at 3:45 PM 

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
Indeed. We could conceivably build the release in CI/CD but the final
verification / signing should be done locally to keep the keys safe (there
was some concern from earlier release processes).

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually from
> a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get it
>> ready for the release process (docker desktop doesn't work anymore, my pgp
>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>> your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>>> Cheng Pan , Spark dev list ,
>>> Anish Shrigondekar 
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>>
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
>>> Thank you all for the replies!
>>>
>>>
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>>
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>>
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>>
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
>>> will we have preview release for 4.0.0 like

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
I think signing the artifacts produced from a secure CI sounds like a good
idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
someone interested could volunteer to set that up.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes- snapshot
> version, either automatically every day or if we need to save costs
> (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across the
> industry we are using vaults (such as hashicorp vault)- but if the build
> will be automated and the only thing which will be manual is to sign the
> release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done manually
>>> from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>>
>>>> UPDATE:
>>>>
>>>> Unfortunately, it took me quite some time to set up my laptop and get
>>>> it ready for the release process (docker desktop doesn't work anymore, my
>>>> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
>>>> for your patience!
>>>>
>>>> Wenchen
>>>>
>>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>
>>>>> *发件人**: *Jungtaek Lim 
>>>>> *日期**: *2024年5月2日 星期四 10:21
>>>>> *收件人**: *Holden Karau 
>>>>> *抄送**: *Chao Sun , Xiao Li ,
>>>>> Tathagata Das , Wenchen Fan <
>>>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas
>>>>> , Dongjoon Hyun ,
>>>>> Cheng Pan , Spark dev list ,
>>>>> Anish Shrigondekar 
>>>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>>>
>>>>>
>>>>>
>>>>> +1 love to see it!
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>>>> wrote:
>>>>>
>>>>> +1 :) yay previews
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>>>
>>>>> +1 for next Monday.
>>>>>
>>>>>
>>>>>
>>>>> We can do more previews when the other features are ready for preview.
>>>>>
>>>>>
>>>>>
>>>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>>>
>>>>> Next week sounds great! Thank you Wenchen!
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>> Yea I think a preview release won't hurt (without a branch cut). We
>>>>> don't need to wait for all the ongoing projects to be ready. How about we
>>>>> do a 4.0 preview release based on the current master branch next Monday?
>&g

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau
That looks cool, maybe let’s split off a thread on how to improve our
release processes?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:

> On that note, GitHub recently released (public preview) a new feature
> called Artifact Attestions which may be relevant/useful here: Introducing
> Artifact Attestations–now in public beta - The GitHub Blog
> <https://github.blog/2024-05-02-introducing-artifact-attestations-now-in-public-beta/>
>
> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>
>> I have no permissions so I can't do it but I'm happy to help (although I
>> am more familiar with Gitlab CICD than Github Actions).
>> Is there some point of contact that can provide me needed context and
>> permissions?
>> I'd also love to see why the costs are high and see how we can reduce
>> them...
>>
>> Thanks,
>> Nimrod
>>
>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>> wrote:
>>
>>> I think signing the artifacts produced from a secure CI sounds like a
>>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>>> perhaps someone interested could volunteer to set that up.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
>>> wrote:
>>>
>>>> Hi,
>>>> Thanks for the reply.
>>>>
>>>> From my experience, a build on a build server would be much more
>>>> predictable and less error prone than building on some laptop- and of
>>>> course much faster to have builds, snapshots, release candidates, early
>>>> previews releases, release candidates or final releases.
>>>> It will enable us to have a preview version with current changes-
>>>> snapshot version, either automatically every day or if we need to save
>>>> costs (although build is really not expensive) - with a click of a button.
>>>>
>>>> Regarding keys for signing. - that's what vaults are for, all across
>>>> the industry we are using vaults (such as hashicorp vault)- but if the
>>>> build will be automated and the only thing which will be manual is to sign
>>>> the release for security reasons that would be reasonable.
>>>>
>>>> Thanks,
>>>> Nimrod
>>>>
>>>>
>>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>>>> holden.ka...@gmail.com>:
>>>>
>>>>> Indeed. We could conceivably build the release in CI/CD but the final
>>>>> verification / signing should be done locally to keep the keys safe (there
>>>>> was some concern from earlier release processes).
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>>
>>>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Sorry for the novice question, Wenchen - the release is done manually
>>>>>> from a laptop? Not using a CI CD process on a build server?
>>>>>>
>>>>>> Thanks,
>>>>>> Nimrod
>>>>>>
>>>>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>> UPDATE:
>>>>>>>
>>>>>>> Unfortunately, it took me quite some time to set up my laptop and
>>>>>>> get it ready for the release process (docker desktop doesn't work 
>>>>>>> anymore,
>>>>>>> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
>>>>>>> Thanks
>>>>>>> for your patience!
>>>>>>>
>>>>>>> Wenchen
>>>>>>>
>>>>>>> On Fri, May 3, 2024 at 7:47 AM yang

Re: [DISCUSS] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Holden Karau
I guess my one concern here would be are we going to expand the
dependencies that are visible on the class path for non-connect users?

One of the pain points that folks experienced with upgrading can be from
those changing.

Otherwise this seems pretty reasonable.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, Jul 2, 2024 at 5:36 AM Matthew Powers 
wrote:

> This is a great idea and would be a great quality of life improvement.
>
> +1 (non-binding)
>
> On Tue, Jul 2, 2024 at 4:56 AM Hyukjin Kwon  wrote:
>
>> > while leaving the connect jvm client in a separate folder looks weird
>>
>> I plan to actually put it at the top level together but I feel like this
>> has to be done with SPIP so I am moving internal server side first
>> orthogonally
>>
>> On Tue, 2 Jul 2024 at 17:54, Cheng Pan  wrote:
>>
>>> Thanks for raising this discussion, I think putting the connect folder
>>> on the top level is a good idea to promote Spark Connect, while leaving the
>>> connect jvm client in a separate folder looks weird. I suppose there is no
>>> contract to leave all optional modules under `connector`? e.g.
>>> `resource-managers/kubernetes/{docker,integration-tests}`, `hadoop-cloud`.
>>> What about moving the whole `connect` folder to the top level?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> On Jul 2, 2024, at 08:19, Hyukjin Kwon  wrote:
>>>
>>> Hi all,
>>>
>>> I would like to discuss moving Spark Connect server to builtin package.
>>> Right now, users have to specify —packages when they run Spark Connect
>>> server script, for example:
>>>
>>> ./sbin/start-connect-server.sh --jars `ls 
>>> connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
>>>
>>> or
>>>
>>> ./sbin/start-connect-server.sh --packages 
>>> org.apache.spark:spark-connect_2.12:3.5.1
>>>
>>> which is a little bit odd that sbin scripts should provide jars to start.
>>>
>>> Moving it to builtin package is pretty straightforward because most of
>>> jars are shaded, and the impact would be minimal, I have a prototype here
>>> apache/spark/#47157 . This
>>> also simplifies Python local running logic a lot.
>>>
>>> User facing API layer, Spark Connect Client, stays external but I would
>>> like the internal/admin server layer, Spark Connect Server, implementation
>>> to be built in Spark.
>>>
>>> Please let me know if you have thoughts on this!
>>>
>>>
>>>


Re: [外部邮件] Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-02 Thread Holden Karau
+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, Jul 2, 2024 at 10:18 PM yangjie01 
wrote:

> +1 (non-binding)
>
>
>
> *发件人**: *Denny Lee 
> *日期**: *2024年7月3日 星期三 09:12
> *收件人**: *Hyukjin Kwon 
> *抄送**: *dev 
> *主题**: *[外部邮件] Re: [VOTE] Move Spark Connect server to builtin package
> (Client API layer stays external)
>
>
>
> +1 (non-binding)
>
>
>
> On Wed, Jul 3, 2024 at 9:11 AM Hyukjin Kwon  wrote:
>
> Starting with my own +1.
>
>
>
> On Wed, 3 Jul 2024 at 09:59, Hyukjin Kwon  wrote:
>
> Hi all,
>
> I’d like to start a vote for moving Spark Connect server to builtin
> package (Client API layer stays external).
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/odlx9b552dp8yllhrdlp24pf9m9s4tmx
> 
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-48763
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thank you!
>
>


Re: [VOTE] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-04 Thread Holden Karau
+1

Although given its a US holiday maybe keep the vote open for an extra day?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Thu, Jul 4, 2024 at 7:33 AM Denny Lee  wrote:

> +1 (non-binding)
>
> On Thu, Jul 4, 2024 at 19:13 Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for allowing GitHub Actions runs for
>> contributors' PRs without approvals in apache/spark-connect-go.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/tsqm0dv01f7jgkv5l4kyvtpw4tc6f420
>>- JIRA ticket: https://issues.apache.org/jira/browse/INFRA-25936
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thank you!
>>
>>


<    1   2   3   4   5   6   >